Applications of Neural Networks for Speech and Language Processing

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (15 November 2022) | Viewed by 16734

Special Issue Editors


E-Mail Website1 Website2
Guest Editor
Department of Electronics and Multimedia Communications, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Nemcovej 32, 040 01 Košice, Slovakia
Interests: speech and language processing; neural networks; fuzzy logic; machine learning; human–computer interaction

E-Mail Website1 Website2 Website3 Website4
Guest Editor
Department of Electronics and Multimedia Communications, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Nemcovej 32, 040 01 Košice, Slovakia
Interests: speech recognition; human–computer interaction; voice interaction; automatic broadcast news processing; networking protocols and applications; mobile communications systems; multimodal applications; information security and biometrics
Special Issues, Collections and Topics in MDPI journals

E-Mail Website1 Website2
Guest Editor
Multimedia Systems Department, Gdańsk University of Technology, Gabriela Narutowicza 11/12, 80-233 Gdańsk, Poland
Interests: processing of audio and video; computer animation; 3D visualization; inference methods; artificial intelligence; applications of rough sets theory; classification and perception of sounds and images; algorithms and methods of image analysis and understanding; applications of embedded systems
Special Issues, Collections and Topics in MDPI journals
Faculty of Electrical Engineering and Computer Science, University of Maribor, 2000 Maribor, Slovenia
Interests: automatic speech recognition; language resources; acoustic classification and analysis; digital signal processing; telecommunication services; quality of experience

Special Issue Information

Dear Colleagues,

Speech and language processing are key elements for natural interaction with devices and services. Current deep-learning systems (e.g., end-to-end approaches) reveal and use intrinsic knowledge hidden in a record of human communication in acoustic, visual, or textual form. Advances in the research have blurred the line between algorithms and methods. Machine translation and speech recognition can apply a similar-looking neural network to transcribe the input data into comprehensible output. This convergence of methods brings new possibilities, but also new challenges. This Special Issue provides an opportunity to discuss the current applications, trends, and directions in the processing of human interaction using neural networks.

We welcome contributions reporting applications or enhancements of neural networks in the following areas (topics include, but are not limited to):

 • Speech processing, recognition, and synthesis:              

  • Automatic speech recognition;              
  • Text to speech synthesis;              
  • Speaker recognition and verification;              
  • Speech pre-processing and enhancement;              
  • Paralinguistic and COVID-19 cough/speech processing.

• Natural language processing:              

  • Language modeling and generation;              
  • Multilingual and cross-lingual systems;              
  • Dialog modeling;              
  • Machine translation;              
  • Machine comprehension and information retrieval.

• Multimodal processing:              

  • Speech processing supported by biometrics;              
  • Multimodal analysis of user behavior, including the exploration of relations between visual and acoustic speech features;              
  • Personal traits detection and classification;              
  • Paralinguistic effects;              
  • Advances in dialogue systems and virtual assistants.

Dr. Daniel Hládek
Prof. Dr. Matúš Pleva
Prof. Dr. Piotr Szczuko
Dr. Andrej Zgank
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech processing, recognition, and synthesis
  • natural language processing
  • multimodal processing

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 6207 KiB  
Article
Object Recognition System for the Visually Impaired: A Deep Learning Approach using Arabic Annotation
by Nada Alzahrani and Heyam H. Al-Baity
Electronics 2023, 12(3), 541; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12030541 - 20 Jan 2023
Cited by 5 | Viewed by 3128
Abstract
Object detection is an important computer vision technique that has increasingly attracted the attention of researchers in recent years. The literature to date in the field has introduced a range of object detection models. However, these models have largely been English-language-based, and there [...] Read more.
Object detection is an important computer vision technique that has increasingly attracted the attention of researchers in recent years. The literature to date in the field has introduced a range of object detection models. However, these models have largely been English-language-based, and there is only a limited number of published studies that have addressed how object detection can be implemented for the Arabic language. As far as we are aware, the generation of an Arabic text-to-speech engine to utter objects’ names and their positions in images to help Arabic-speaking visually impaired people has not been investigated previously. Therefore, in this study, we propose an object detection and segmentation model based on the Mask R-CNN algorithm that is capable of identifying and locating different objects in images, then uttering their names and positions in Arabic. The proposed model was trained on the Pascal VOC 2007 and 2012 datasets and evaluated on the Pascal VOC 2007 testing set. We believe that this is one of a few studies that uses these datasets to train and test the Mask R-CNN model. The performance of the proposed object detection model was evaluated and compared with previous object detection models in the literature, and the results demonstrated its superiority and ability to achieve an accuracy of 83.9%. Moreover, experiments were conducted to evaluate the performance of the incorporated translator and TTS engines, and the results showed that the proposed model could be effective in helping Arabic-speaking visually impaired people understand the content of digital images. Full article
(This article belongs to the Special Issue Applications of Neural Networks for Speech and Language Processing)
Show Figures

Figure 1

12 pages, 456 KiB  
Article
Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets
by Marta Zielonka, Artur Piastowski, Andrzej Czyżewski, Paweł Nadachowski, Maksymilian Operlejn and Kamil Kaczor
Electronics 2022, 11(22), 3831; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11223831 - 21 Nov 2022
Cited by 7 | Viewed by 2698
Abstract
Artificial Neural Network (ANN) models, specifically Convolutional Neural Networks (CNN), were applied to extract emotions based on spectrograms and mel-spectrograms. This study uses spectrograms and mel-spectrograms to investigate which feature extraction method better represents emotions and how big the differences in efficiency are [...] Read more.
Artificial Neural Network (ANN) models, specifically Convolutional Neural Networks (CNN), were applied to extract emotions based on spectrograms and mel-spectrograms. This study uses spectrograms and mel-spectrograms to investigate which feature extraction method better represents emotions and how big the differences in efficiency are in this context. The conducted studies demonstrated that mel-spectrograms are a better-suited data type for training CNN-based speech emotion recognition (SER). The research experiments employed five popular datasets: Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Surrey Audio-Visual Expressed Emotion (SAVEE), Toronto Emotional Speech Set (TESS), and The Interactive Emotional Dyadic Motion Capture (IEMOCAP). Six different classes of emotions were used: happiness, anger, sadness, fear, disgust, and neutral. However, some experiments were prepared to recognize just four emotions due to the characteristics of the IEMOCAP dataset. A comparison of classification efficiency on different datasets and an attempt to develop a universal model trained using all datasets were also performed. This approach brought an accuracy of 55.89% when recognizing four emotions. The most accurate model for six emotion recognition was trained and achieved 57.42% accuracy on a combination of four datasets (CREMA-D, RAVDESS, SAVEE, TESS). What is more, another study was developed that demonstrated that improper data division for training and test sets significantly influences the test accuracy of CNNs. Therefore, the problem of inappropriate data division between the training and test sets, which affected the results of studies known from the literature, was addressed extensively. The performed experiments employed the popular ResNet18 architecture to demonstrate the reliability of the research results and to show that these problems are not unique to the custom CNN architecture proposed in experiments. Subsequently, the label correctness of the CREMA-D dataset was studied through the employment of a prepared questionnaire. Full article
(This article belongs to the Special Issue Applications of Neural Networks for Speech and Language Processing)
Show Figures

Figure 1

13 pages, 1439 KiB  
Article
RECA: Relation Extraction Based on Cross-Attention Neural Network
by Xiaofeng Huang, Zhiqiang Guo, Jialiang Zhang, Hui Cao and Jie Yang
Electronics 2022, 11(14), 2161; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11142161 - 11 Jul 2022
Viewed by 1685
Abstract
Extracting entities and relations, as a crucial part of many tasks in natural language processing, transforms the unstructured text information into structured information and provides corresponding data support for knowledge graph (KG) and knowledge vault (KV) construction. Nevertheless, the mainstream relation-extraction methods, the [...] Read more.
Extracting entities and relations, as a crucial part of many tasks in natural language processing, transforms the unstructured text information into structured information and provides corresponding data support for knowledge graph (KG) and knowledge vault (KV) construction. Nevertheless, the mainstream relation-extraction methods, the pipeline method and the joint method, ignore the dependency between the subject entity and the object entity. This work introduces a pre-trained BERT model and a dilated gated convolutional neural network (DGCNN) as an encoder to distinguish the long-range semantics representation from the input sequence. In addition, we propose a cross-attention neural network as a decoder to learn the importance of each subject word for each word of the input sequence. Experiments were undertaken with two extensive datasets, the New York Times Corpus (NYT) and WebNLG Corpus, and showed that our model performs significantly better than the CasRel model, outperforming the baseline by 1.9% and 0.7% absolute gain in terms of F1-score. Full article
(This article belongs to the Special Issue Applications of Neural Networks for Speech and Language Processing)
Show Figures

Figure 1

13 pages, 1495 KiB  
Article
A Speech Recognition Model Building Method Combined Dynamic Convolution and Multi-Head Self-Attention Mechanism
by Wei Liu, Jiaming Sun, Yiming Sun and Chunyi Chen
Electronics 2022, 11(10), 1656; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11101656 - 23 May 2022
Viewed by 1632
Abstract
The Conformer enhanced Transformer by using convolution serial connected to the multi-head self-attention (MHSA). The method strengthened the local attention calculation and obtained a better effect in auto speech recognition. This paper proposes a hybrid attention mechanism which combines the dynamic convolution CNNs [...] Read more.
The Conformer enhanced Transformer by using convolution serial connected to the multi-head self-attention (MHSA). The method strengthened the local attention calculation and obtained a better effect in auto speech recognition. This paper proposes a hybrid attention mechanism which combines the dynamic convolution CNNs and multi-head self-attention. This study focuses on generating local attention by embedding DY-CNNs in MHSA, followed by parallel computation of the globe and local attention inside the attention layer. Finally, concatenate the result of global and local attention to the output. In the experiments, we use the Aishell-1 (178 hours) Chinese database for training. In the testing folder dev/test, 4.5%/4.8% CER was obtained. The proposed method shows better performance in computation speed and the number of experimental parameters. The results are extremely close to the best result (4.4%/4.7%) of the Conformer. Full article
(This article belongs to the Special Issue Applications of Neural Networks for Speech and Language Processing)
Show Figures

Figure 1

10 pages, 590 KiB  
Article
Research on Joint Extraction Model of Financial Product Opinion and Entities Based on RoBERTa
by Jiang Liao and Hanxiao Shi
Electronics 2022, 11(9), 1345; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11091345 - 23 Apr 2022
Cited by 1 | Viewed by 1674
Abstract
With the rapid development of the Internet, and its enormous impact on all aspects of life, traditional financial companies increasingly focus on the user’s online reviews, aiming to promote competitiveness and quality of service in the products of this enterprise. Due to the [...] Read more.
With the rapid development of the Internet, and its enormous impact on all aspects of life, traditional financial companies increasingly focus on the user’s online reviews, aiming to promote competitiveness and quality of service in the products of this enterprise. Due to the difficulty of extracting comment text compared with structured data itself, coupled with the fact that it is too colloquial, the traditional model insufficiently semantically represents sentences, resulting in unsatisfactory extraction results. Therefore, this paper selects RoBERTa, a pre-trained language model that has exhibited an excellent performance in recent years, and proposes a joint model of financial product opinion and entities extraction based on RoBERTa multi-layer fusion for the two tasks of opinion and entities extraction. The experimental results show that the performance of the proposed joint model on the financial reviews dataset is significantly better than that of the single model. Full article
(This article belongs to the Special Issue Applications of Neural Networks for Speech and Language Processing)
Show Figures

Figure 1

15 pages, 5546 KiB  
Article
Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition
by Qingran Zhan, Xiang Xie, Chenguang Hu, Juan Zuluaga-Gomez, Jing Wang and Haobo Cheng
Electronics 2021, 10(24), 3172; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10243172 - 20 Dec 2021
Cited by 2 | Viewed by 2638
Abstract
Phonological-based features (articulatory features, AFs) describe the movements of the vocal organ which are shared across languages. This paper investigates a domain-adversarial neural network (DANN) to extract reliable AFs, and different multi-stream techniques are used for cross-lingual speech recognition. First, a novel universal [...] Read more.
Phonological-based features (articulatory features, AFs) describe the movements of the vocal organ which are shared across languages. This paper investigates a domain-adversarial neural network (DANN) to extract reliable AFs, and different multi-stream techniques are used for cross-lingual speech recognition. First, a novel universal phonological attributes definition is proposed for Mandarin, English, German and French. Then a DANN-based AFs detector is trained using source languages (English, German and French). When doing the cross-lingual speech recognition, the AFs detectors are used to transfer the phonological knowledge from source languages (English, German and French) to the target language (Mandarin). Two multi-stream approaches are introduced to fuse the acoustic features and cross-lingual AFs. In addition, the monolingual AFs system (i.e., the AFs are directly extracted from the target language) is also investigated. Experiments show that the performance of the AFs detector can be improved by using convolutional neural networks (CNN) with a domain-adversarial learning method. The multi-head attention (MHA) based multi-stream can reach the best performance compared to the baseline, cross-lingual adaptation approach, and other approaches. More specifically, the MHA-mode with cross-lingual AFs yields significant improvements over monolingual AFs with the restriction of training data size and, which can be easily extended to other low-resource languages. Full article
(This article belongs to the Special Issue Applications of Neural Networks for Speech and Language Processing)
Show Figures

Figure 1

13 pages, 573 KiB  
Article
A Self-Supervised Model for Language Identification Integrating Phonological Knowledge
by Qingran Zhan, Xiang Xie, Chenguang Hu and Haobo Cheng
Electronics 2021, 10(18), 2259; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10182259 - 14 Sep 2021
Cited by 3 | Viewed by 1843
Abstract
In this paper, a self-supervised learning pre-trained model is proposed and successfully applied in language identification task (LID). A Transformer encoder is employed and multi-task strategy is used to train the self-supervised model: the first task is to reconstruct the masking spans of [...] Read more.
In this paper, a self-supervised learning pre-trained model is proposed and successfully applied in language identification task (LID). A Transformer encoder is employed and multi-task strategy is used to train the self-supervised model: the first task is to reconstruct the masking spans of input frames and the second task is a supervision task where the phoneme and phonological labels are used with Connectionist Temporal Classification (CTC) loss. By using this multi-task learning loss, the model is expected to capture high-level speech representation in phonological space. Meanwhile, an adaptive loss is also applied for multi-task learning to balance the weight between different tasks. After the pretraining stage, the self-supervised model is used for xvector systems. Our LID experiments are carried out on the oriental language recognition (OLR) challenge data corpus and 1 s, 3 s, Full-length test sets are selected. Experimental results show that on 1 s test set, feature extraction model approach can get best performance and in 3 s, Full-length test, the fine-tuning approach can reach the best performance. Furthermore, our results prove that the multi-task training strategy is effective and the proposed model can get the best performance. Full article
(This article belongs to the Special Issue Applications of Neural Networks for Speech and Language Processing)
Show Figures

Figure 1

Back to TopTop