sensors-logo

Journal Browser

Journal Browser

VOICE Sensors with Deep Learning

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (30 June 2023) | Viewed by 42325

Special Issue Editor


E-Mail Website
Guest Editor
Professor & Director of VOICE AI research institute, Inha University, Incheon, Korea
Interests: VOICE sensor; Deep Learning; Patent, Information Retrieval
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Deep Learning triggering sensor technologies, especially as relates to the VOICE issue.

A person goes through a speaking and listening process that repeats verbal, physiological, and acoustic steps for communication. Voice technology utilizing a range of sensor technologies using biosignals that are measurable in these human voice activities has developed rapidly in recent years. In particular, with the development of voice recognition technology based on artificial intelligence and deep learning, the related market has expanded and is being released into various services. Many sensor issues need to be exploited.

Prof. Wookey Lee
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Voice technology
  • Deep Learning
  • Voice Sensors
  • Biosignal
  • Voice Recognition
  • Voice Generation

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

11 pages, 2184 KiB  
Article
Research on Speech Synthesis Based on Mixture Alignment Mechanism
by Yan Deng, Ning Wu, Chengjun Qiu, Yan Chen and Xueshan Gao
Sensors 2023, 23(16), 7283; https://0-doi-org.brum.beds.ac.uk/10.3390/s23167283 - 20 Aug 2023
Viewed by 1330
Abstract
In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information [...] Read more.
In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

17 pages, 1332 KiB  
Article
gRDF: An Efficient Compressor with Reduced Structural Regularities That Utilizes gRePair
by Tangina Sultana and Young-Koo Lee
Sensors 2022, 22(7), 2545; https://0-doi-org.brum.beds.ac.uk/10.3390/s22072545 - 26 Mar 2022
Cited by 3 | Viewed by 1528
Abstract
The explosive volume of semantic data published in the Resource Description Framework (RDF) data model demands efficient management and compression with better compression ratio and runtime. Although extensive work has been carried out for compressing the RDF datasets, they do not perform well [...] Read more.
The explosive volume of semantic data published in the Resource Description Framework (RDF) data model demands efficient management and compression with better compression ratio and runtime. Although extensive work has been carried out for compressing the RDF datasets, they do not perform well in all dimensions. However, these compressors rarely exploit the graph patterns and structural regularities of real-world datasets. Moreover, there are a variety of existing approaches that reduce the size of a graph by using a grammar-based graph compression algorithm. In this study, we introduce a novel approach named gRDF (graph repair for RDF) that uses gRePair, one of the most efficient grammar-based graph compression schemes, to compress the RDF dataset. In addition to that, we have improved the performance of HDT (header-dictionary-triple), an efficient approach for compressing the RDF datasets based on structural properties, by introducing modified HDT (M-HDT). It can detect the frequent graph pattern by employing the data-structure-oriented approach in a single pass from the dataset. In our proposed system, we use M-HDT for indexing the nodes and edge labels. Then, we employ gRePair algorithm for identifying the grammar from the RDF graph. Afterward, the system improves the performance of k2-trees by introducing a more efficient algorithm to create the trees and serialize the RDF datasets. Our experiments affirm that the proposed gRDF scheme can substantially achieve at approximately 26.12%, 13.68%, 6.81%, 2.38%, and 12.76% better compression ratio when compared with the most prominent state-of-the-art schemes such as HDT, HDT++, k2-trees, RDF-TR, and gRePair in the case of real-world datasets. Moreover, the processing efficiency of our proposed scheme also outperforms others. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

18 pages, 860 KiB  
Article
Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System
by June-Woo Kim, Hyekyung Yoon and Ho-Young Jung
Sensors 2022, 22(4), 1509; https://0-doi-org.brum.beds.ac.uk/10.3390/s22041509 - 15 Feb 2022
Cited by 4 | Viewed by 2716
Abstract
Successful applications of deep learning technologies in the natural language processing domain have improved text-based intent classifications. However, in practical spoken dialogue applications, the users’ articulation styles and background noises cause automatic speech recognition (ASR) errors, and these may lead language models to [...] Read more.
Successful applications of deep learning technologies in the natural language processing domain have improved text-based intent classifications. However, in practical spoken dialogue applications, the users’ articulation styles and background noises cause automatic speech recognition (ASR) errors, and these may lead language models to misclassify users’ intents. To overcome the limited performance of the intent classification task in the spoken dialogue system, we propose a novel approach that jointly uses both recognized text obtained by the ASR model and a given labeled text. In the evaluation phase, only the fine-tuned recognized language model (RLM) is used. The experimental results show that the proposed scheme is effective at classifying intents in the spoken dialogue system containing ASR errors. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

20 pages, 3408 KiB  
Article
Emotional Speech Recognition Using Deep Neural Networks
by Loan Trinh Van, Thuy Dao Thi Le, Thanh Le Xuan and Eric Castelli
Sensors 2022, 22(4), 1414; https://0-doi-org.brum.beds.ac.uk/10.3390/s22041414 - 12 Feb 2022
Cited by 21 | Viewed by 5238
Abstract
The expression of emotions in human communication plays a very important role in the information that needs to be conveyed to the partner. The forms of expression of human emotions are very rich. It could be body language, facial expressions, eye contact, laughter, [...] Read more.
The expression of emotions in human communication plays a very important role in the information that needs to be conveyed to the partner. The forms of expression of human emotions are very rich. It could be body language, facial expressions, eye contact, laughter, and tone of voice. The languages of the world’s peoples are different, but even without understanding a language in communication, people can almost understand part of the message that the other partner wants to convey with emotional expressions as mentioned. Among the forms of human emotional expression, the expression of emotions through voice is perhaps the most studied. This article presents our research on speech emotion recognition using deep neural networks such as CNN, CRNN, and GRU. We used the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus for the study with four emotions: anger, happiness, sadness, and neutrality. The feature parameters used for recognition include the Mel spectral coefficients and other parameters related to the spectrum and the intensity of the speech signal. The data augmentation was used by changing the voice and adding white noise. The results show that the GRU model gave the highest average recognition accuracy of 97.47%. This result is superior to existing studies on speech emotion recognition with the IEMOCAP corpus. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

15 pages, 1260 KiB  
Article
Using a Low-Power Spiking Continuous Time Neuron (SCTN) for Sound Signal Processing
by Moshe Bensimon, Shlomo Greenberg and Moshe Haiut
Sensors 2021, 21(4), 1065; https://0-doi-org.brum.beds.ac.uk/10.3390/s21041065 - 04 Feb 2021
Cited by 7 | Viewed by 2355
Abstract
This work presents a new approach based on a spiking neural network for sound preprocessing and classification. The proposed approach is biologically inspired by the biological neuron’s characteristic using spiking neurons, and Spike-Timing-Dependent Plasticity (STDP)-based learning rule. We propose a biologically plausible sound [...] Read more.
This work presents a new approach based on a spiking neural network for sound preprocessing and classification. The proposed approach is biologically inspired by the biological neuron’s characteristic using spiking neurons, and Spike-Timing-Dependent Plasticity (STDP)-based learning rule. We propose a biologically plausible sound classification framework that uses a Spiking Neural Network (SNN) for detecting the embedded frequencies contained within an acoustic signal. This work also demonstrates an efficient hardware implementation of the SNN network based on the low-power Spike Continuous Time Neuron (SCTN). The proposed sound classification framework suggests direct Pulse Density Modulation (PDM) interfacing of the acoustic sensor with the SCTN-based network avoiding the usage of costly digital-to-analog conversions. This paper presents a new connectivity approach applied to Spiking Neuron (SN)-based neural networks. We suggest considering the SCTN neuron as a basic building block in the design of programmable analog electronics circuits. Usually, a neuron is used as a repeated modular element in any neural network structure, and the connectivity between the neurons located at different layers is well defined. Thus, generating a modular Neural Network structure composed of several layers with full or partial connectivity. The proposed approach suggests controlling the behavior of the spiking neurons, and applying smart connectivity to enable the design of simple analog circuits based on SNN. Unlike existing NN-based solutions for which the preprocessing phase is carried out using analog circuits and analog-to-digital conversion, we suggest integrating the preprocessing phase into the network. This approach allows referring to the basic SCTN as an analog module enabling the design of simple analog circuits based on SNN with unique inter-connections between the neurons. The efficiency of the proposed approach is demonstrated by implementing SCTN-based resonators for sound feature extraction and classification. The proposed SCTN-based sound classification approach demonstrates a classification accuracy of 98.73% using the Real-World Computing Partnership (RWCP) database. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

17 pages, 1296 KiB  
Article
Design and Implementation of Fast Spoken Foul Language Recognition with Different End-to-End Deep Neural Network Architectures
by Abdulaziz Saleh Ba Wazir, Hezerul Abdul Karim, Mohd Haris Lye Abdullah, Nouar AlDahoul, Sarina Mansor, Mohammad Faizal Ahmad Fauzi, John See and Ahmad Syazwan Naim
Sensors 2021, 21(3), 710; https://0-doi-org.brum.beds.ac.uk/10.3390/s21030710 - 21 Jan 2021
Cited by 9 | Viewed by 3324
Abstract
Given the excessive foul language identified in audio and video files and the detrimental consequences to an individual’s character and behaviour, content censorship is crucial to filter profanities from young viewers with higher exposure to uncensored content. Although manual detection and censorship were [...] Read more.
Given the excessive foul language identified in audio and video files and the detrimental consequences to an individual’s character and behaviour, content censorship is crucial to filter profanities from young viewers with higher exposure to uncensored content. Although manual detection and censorship were implemented, the methods proved tedious. Inevitably, misidentifications involving foul language owing to human weariness and the low performance in human visual systems concerning long screening time occurred. As such, this paper proposed an intelligent system for foul language censorship through a mechanized and strong detection method using advanced deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) through Long Short-Term Memory (LSTM) cells. Data on foul language were collected, annotated, augmented, and analysed for the development and evaluation of both CNN and RNN configurations. Hence, the results indicated the feasibility of the suggested systems by reporting a high volume of curse word identifications with only 2.53% to 5.92% of False Negative Rate (FNR). The proposed system outperformed state-of-the-art pre-trained neural networks on the novel foul language dataset and proved to reduce the computational cost with minimal trainable parameters. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

17 pages, 3045 KiB  
Article
Multi-TALK: Multi-Microphone Cross-Tower Network for Jointly Suppressing Acoustic Echo and Background Noise
by Song-Kyu Park and Joon-Hyuk Chang
Sensors 2020, 20(22), 6493; https://0-doi-org.brum.beds.ac.uk/10.3390/s20226493 - 13 Nov 2020
Viewed by 1943
Abstract
In this paper, we propose a multi-channel cross-tower with attention mechanisms in latent domain network (Multi-TALK) that suppresses both the acoustic echo and background noise. The proposed approach consists of the cross-tower network, a parallel encoder with an auxiliary encoder, and a decoder. [...] Read more.
In this paper, we propose a multi-channel cross-tower with attention mechanisms in latent domain network (Multi-TALK) that suppresses both the acoustic echo and background noise. The proposed approach consists of the cross-tower network, a parallel encoder with an auxiliary encoder, and a decoder. For the multi-channel processing, a parallel encoder is used to extract latent features of each microphone, and the latent features including the spatial information are compressed by a 1D convolution operation. In addition, the latent features of the far-end are extracted by the auxiliary encoder, and they are effectively provided to the cross-tower network by using the attention mechanism. The cross tower network iteratively estimates the latent features of acoustic echo and background noise in each tower. To improve the performance at each iteration, the outputs of each tower are transmitted as the input for the next iteration of the neighboring tower. Before passing through the decoder, to estimate the near-end speech, attention mechanisms are further applied to remove the estimated acoustic echo and background noise from the compressed mixture to prevent speech distortion by over-suppression. Compared to the conventional algorithms, the proposed algorithm effectively suppresses the acoustic echo and background noise and significantly lowers the speech distortion. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

18 pages, 695 KiB  
Article
Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network
by Misbah Farooq, Fawad Hussain, Naveed Khan Baloch, Fawad Riasat Raja, Heejung Yu and Yousaf Bin Zikria
Sensors 2020, 20(21), 6008; https://0-doi-org.brum.beds.ac.uk/10.3390/s20216008 - 23 Oct 2020
Cited by 71 | Viewed by 6885
Abstract
Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted [...] Read more.
Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

19 pages, 899 KiB  
Article
Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data
by Ayesha Pervaiz, Fawad Hussain, Huma Israr, Muhammad Ali Tahir, Fawad Riasat Raja, Naveed Khan Baloch, Farruh Ishmanov and Yousaf Bin Zikria
Sensors 2020, 20(8), 2326; https://0-doi-org.brum.beds.ac.uk/10.3390/s20082326 - 19 Apr 2020
Cited by 31 | Viewed by 5451
Abstract
The advent of new devices, technology, machine learning techniques, and the availability of free large speech corpora results in rapid and accurate speech recognition. In the last two decades, extensive research has been initiated by researchers and different organizations to experiment with new [...] Read more.
The advent of new devices, technology, machine learning techniques, and the availability of free large speech corpora results in rapid and accurate speech recognition. In the last two decades, extensive research has been initiated by researchers and different organizations to experiment with new techniques and their applications in speech processing systems. There are several speech command based applications in the area of robotics, IoT, ubiquitous computing, and different human-computer interfaces. Various researchers have worked on enhancing the efficiency of speech command based systems and used the speech command dataset. However, none of them catered to noise in the same. Noise is one of the major challenges in any speech recognition system, as real-time noise is a very versatile and unavoidable factor that affects the performance of speech recognition systems, particularly those that have not learned the noise efficiently. We thoroughly analyse the latest trends in speech recognition and evaluate the speech command dataset on different machine learning based and deep learning based techniques. A novel technique is proposed for noise robustness by augmenting noise in training data. Our proposed technique is tested on clean and noisy data along with locally generated data and achieves much better results than existing state-of-the-art techniques, thus setting a new benchmark. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

Review

Jump to: Research

22 pages, 4595 KiB  
Review
Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review
by Wookey Lee, Jessica Jiwon Seong, Busra Ozlu, Bong Sup Shim, Azizbek Marakhimov and Suan Lee
Sensors 2021, 21(4), 1399; https://0-doi-org.brum.beds.ac.uk/10.3390/s21041399 - 17 Feb 2021
Cited by 55 | Viewed by 9096
Abstract
Voice is one of the essential mechanisms for communicating and expressing one’s intentions as a human being. There are several causes of voice inability, including disease, accident, vocal abuse, medical surgery, ageing, and environmental pollution, and the risk of voice loss continues to [...] Read more.
Voice is one of the essential mechanisms for communicating and expressing one’s intentions as a human being. There are several causes of voice inability, including disease, accident, vocal abuse, medical surgery, ageing, and environmental pollution, and the risk of voice loss continues to increase. Novel approaches should have been developed for speech recognition and production because that would seriously undermine the quality of life and sometimes leads to isolation from society. In this review, we survey mouth interface technologies which are mouth-mounted devices for speech recognition, production, and volitional control, and the corresponding research to develop artificial mouth technologies based on various sensors, including electromyography (EMG), electroencephalography (EEG), electropalatography (EPG), electromagnetic articulography (EMA), permanent magnet articulography (PMA), gyros, images and 3-axial magnetic sensors, especially with deep learning techniques. We especially research various deep learning technologies related to voice recognition, including visual speech recognition, silent speech interface, and analyze its flow, and systematize them into a taxonomy. Finally, we discuss methods to solve the communication problems of people with disabilities in speaking and future research with respect to deep learning components. Full article
(This article belongs to the Special Issue VOICE Sensors with Deep Learning)
Show Figures

Figure 1

Back to TopTop