Selected Papers from 16th National Conference on Man-Machine Speech Communication (NCMMSC2021)

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (10 November 2021) | Viewed by 23691

Special Issue Editors

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
Interests: speech and audio coding; speech enhancement; frequency band expansion; three-dimensional audio reconstruction; speech segmentation and retrieval
School of Physics and Electronic Engineering, Jiangsu Normal University, Xuzhou 221116, China
Interests: speaker segmentation and clustering; speaker recognition; language and dialect recognition; acoustic scene classification; acoustic event detection

Special Issue Information

Dear Colleagues,

The 16th National Conference on Man–Machine Speech Communication (NCMMSC), the largest and most influential event on speech signal processing in China, will be hosted by the Chinese Information Processing Society of China and China Computer Federation, jointly co-organized by the language, hearing, and music acoustics branch of the Acoustical Society of China, the Phonetic Association of China, and the signal processing branch of the Chinese Institute of Electronics, jointly undertaken by Jiangsu Normal University and Beijing University of technology. NCMMSC is an important stage for experts, scholars, and researchers in this field to exchange the latest research results and promote the continuous progress of the field and development work.

Papers published in the Special Issue on “National Conference on Man–Machine Speech Communication (NCMMSC2021)” will be focused on the topics of speech recognition, synthesis, enhancement, and coding, as well as experimental phonetics, speech prosody analysis, pathological speech analysis, speech analysis, acoustic scene classification, and human–computer dialogue understanding.

Prof. Dr. Changchun Bao
Dr. Yong Ma
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech and speaker recognition
  • speech synthesis and voice conversion
  • speech coding and enhancement
  • language recognition
  • speech emotion recognition
  • acoustic scene classification
  • voice detection and speech separation
  • phonetics and phonology
  • language model

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

8 pages, 322 KiB  
Article
The XMUSPEECH System for Accented English Automatic Speech Recognition
by Fuchuan Tong, Tao Li, Dexin Liao, Shipeng Xia, Song Li, Qingyang Hong and Lin Li
Appl. Sci. 2022, 12(3), 1478; https://0-doi-org.brum.beds.ac.uk/10.3390/app12031478 - 29 Jan 2022
Cited by 4 | Viewed by 2351
Abstract
In this paper, we present the XMUSPEECH systems for Track 2 of the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC2020). Track 2 is an Automatic Speech Recognition (ASR) task where the non-native English speakers have various accents, which reduces the accuracy of [...] Read more.
In this paper, we present the XMUSPEECH systems for Track 2 of the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC2020). Track 2 is an Automatic Speech Recognition (ASR) task where the non-native English speakers have various accents, which reduces the accuracy of the ASR system. To solve this problem, we experimented with acoustic models and input features. Furthermore, we trained a TDNN-LSTM language model for lattice rescoring to obtain better results. Compared with our baseline system, we achieved relative word error rate (WER) improvements of 40.7% and 35.7% on the development set and evaluation set, respectively. Full article
Show Figures

Figure 1

9 pages, 437 KiB  
Article
Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Using Classification-Based Methods
by Yaoguang Wang, Yaohao Zheng, Yunxiang Zhang, Yongsheng Xie, Sen Xu, Ying Hu and Liang He
Appl. Sci. 2021, 11(23), 11128; https://0-doi-org.brum.beds.ac.uk/10.3390/app112311128 - 24 Nov 2021
Cited by 6 | Viewed by 3000
Abstract
The task of unsupervised anomalous sound detection (ASD) is challenging for detecting anomalous sounds from a large audio database without any annotated anomalous training data. Many unsupervised methods were proposed, but previous works have confirmed that the classification-based models far exceeds the unsupervised [...] Read more.
The task of unsupervised anomalous sound detection (ASD) is challenging for detecting anomalous sounds from a large audio database without any annotated anomalous training data. Many unsupervised methods were proposed, but previous works have confirmed that the classification-based models far exceeds the unsupervised models in ASD. In this paper, we adopt two classification-based anomaly detection models: (1) Outlier classifier is to distinguish anomalous sounds or outliers from the normal; (2) ID classifier identifies anomalies using both the confidence of classification and the similarity of hidden embeddings. We conduct experiments in task 2 of DCASE 2020 challenge, and our ensemble method achieves an averaged area under the curve (AUC) of 95.82% and averaged partial AUC (pAUC) of 92.32%, which outperforms the state-of-the-art models. Full article
Show Figures

Figure 1

9 pages, 870 KiB  
Article
Elastic CRFs for Open-Ontology Slot Filling
by Yinpei Dai, Yichi Zhang, Hong Liu, Zhijian Ou, Yi Huang and Junlan Feng
Appl. Sci. 2021, 11(22), 10675; https://0-doi-org.brum.beds.ac.uk/10.3390/app112210675 - 12 Nov 2021
Viewed by 1483
Abstract
Slot filling is a crucial component in task-oriented dialog systems that is used to parse (user) utterances into semantic concepts called slots. An ontology is defined by the collection of slots and the values that each slot can take. The most widely used [...] Read more.
Slot filling is a crucial component in task-oriented dialog systems that is used to parse (user) utterances into semantic concepts called slots. An ontology is defined by the collection of slots and the values that each slot can take. The most widely used practice of treating slot filling as a sequence labeling task suffers from two main drawbacks. First, the ontology is usually pre-defined and fixed and therefore is not able to detect new labels for unseen slots. Second, the one-hot encoding of slot labels ignores the correlations between slots with similar semantics, which makes it difficult to share knowledge learned across different domains. To address these problems, we propose a new model called elastic conditional random field (eCRF), where each slot is represented by the embedding of its natural language description and modeled by a CRF layer. New slot values can be detected by eCRF whenever a language description is available for the slot. In our experiment, we show that eCRFs outperform existing models in both in-domain and cross-domain tasks, especially in predicting unseen slots and values. Full article
Show Figures

Figure 1

9 pages, 1277 KiB  
Article
Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
by Xiao Zhou, Zhenhua Ling, Yajun Hu and Lirong Dai
Appl. Sci. 2021, 11(21), 10475; https://0-doi-org.brum.beds.ac.uk/10.3390/app112110475 - 08 Nov 2021
Viewed by 1334
Abstract
An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as [...] Read more.
An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well. Full article
Show Figures

Figure 1

14 pages, 1570 KiB  
Article
A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition
by Huiyun Zhang, Heming Huang and Henry Han
Appl. Sci. 2021, 11(21), 9897; https://0-doi-org.brum.beds.ac.uk/10.3390/app11219897 - 22 Oct 2021
Cited by 17 | Viewed by 2032
Abstract
Speech emotion recognition is a substantial component of natural language processing (NLP). It has strict requirements for the effectiveness of feature extraction and that of the acoustic model. With that in mind, a Heterogeneous Parallel Convolution Bi-LSTM model is proposed to address the [...] Read more.
Speech emotion recognition is a substantial component of natural language processing (NLP). It has strict requirements for the effectiveness of feature extraction and that of the acoustic model. With that in mind, a Heterogeneous Parallel Convolution Bi-LSTM model is proposed to address the challenges. It consists of two heterogeneous branches: the left one contains two dense layers and a Bi-LSTM layer, while the right one contains a dense layer, a convolution layer, and a Bi-LSTM layer. It can exploit the spatiotemporal information more effectively, and achieves 84.65%, 79.67%, and 56.50% unweighted average recalls on the benchmark databases EMODB, CASIA, and SAVEE, respectively. Compared with the previous research results, the proposed model achieves better performance stably. Full article
Show Figures

Figure 1

10 pages, 732 KiB  
Article
Improving Transformer Based End-to-End Code-Switching Speech Recognition Using Language Identification
by Zheying Huang, Pei Wang, Jian Wang, Haoran Miao, Ji Xu and Pengyuan Zhang
Appl. Sci. 2021, 11(19), 9106; https://0-doi-org.brum.beds.ac.uk/10.3390/app11199106 - 30 Sep 2021
Cited by 5 | Viewed by 1840
Abstract
A Recurrent Neural Networks (RNN) based attention model has been used in code-switching speech recognition (CSSR). However, due to the sequential computation constraint of RNN, there are stronger short-range dependencies and weaker long-range dependencies, which makes it hard to immediately switch languages in [...] Read more.
A Recurrent Neural Networks (RNN) based attention model has been used in code-switching speech recognition (CSSR). However, due to the sequential computation constraint of RNN, there are stronger short-range dependencies and weaker long-range dependencies, which makes it hard to immediately switch languages in CSSR. Firstly, to deal with this problem, we introduce the CTC-Transformer, relying entirely on a self-attention mechanism to draw global dependencies and adopting connectionist temporal classification (CTC) as an auxiliary task for better convergence. Secondly, we proposed two multi-task learning recipes, where a language identification (LID) auxiliary task is learned in addition to the CTC-Transformer automatic speech recognition (ASR) task. Thirdly, we study a decoding strategy to combine the LID into an ASR task. Experiments on the SEAME corpus demonstrate the effects of the proposed methods, achieving a mixed error rate (MER) of 30.95%. It obtains up to 19.35% relative MER reduction compared to the baseline RNN-based CTC-Attention system, and 8.86% relative MER reduction compared to the baseline CTC-Transformer system. Full article
Show Figures

Figure 1

10 pages, 463 KiB  
Article
Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
by Guolun Sun, Zhihua Huang, Li Wang and Pengyuan Zhang
Appl. Sci. 2021, 11(19), 9056; https://doi.org/10.3390/app11199056 - 28 Sep 2021
Cited by 2 | Viewed by 1594
Abstract
Articulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a [...] Read more.
Articulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel temporal convolution network-based acoustic-to-articulatory inversion system. The acoustic feature is converted into a high-dimensional hidden space feature map through temporal convolution with frame-level feature correlations taken into account. Meanwhile, we construct a two-part target function combining prediction’s Root Mean Square Error (RMSE) and the sequences’ Pearson Correlation Coefficient (PCC) to jointly optimize the performance of the specific inversion model from both aspects. We also further conducted an analysis on the impact of the weight between the two parts on the final performance of the inversion model. Extensive experiments have shown that our, temporal convolution networks (TCN) model outperformed the Bi-derectional Long Short Term Memory model by 1.18 mm in RMSE and 0.845 in PCC with 14 model parameters when optimizing evenly with RMSE and PCC aspects. Full article
Show Figures

Figure 1

9 pages, 1103 KiB  
Article
Acoustic Word Embeddings for End-to-End Speech Synthesis
by Feiyu Shen, Chenpeng Du and Kai Yu
Appl. Sci. 2021, 11(19), 9010; https://0-doi-org.brum.beds.ac.uk/10.3390/app11199010 - 27 Sep 2021
Cited by 3 | Viewed by 2575
Abstract
The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained [...] Read more.
The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained linguistic word embeddings as TTS system input. However, since linguistic information is not directly relevant to how words are pronounced, TTS quality improvement of these systems is mild. In this paper, we propose a novel and effective way of jointly training acoustic phone and word embeddings for end-to-end TTS systems. Experiments on the LJSpeech dataset show that the acoustic word embeddings dramatically decrease both the training and validation loss in phone-level prosody prediction. Subjective evaluations on naturalness demonstrate that the incorporation of acoustic word embeddings can significantly outperform both pure phone-based system and the TTS system with pre-trained linguistic word embedding. Full article
Show Figures

Figure 1

11 pages, 2350 KiB  
Article
Confidence Learning for Semi-Supervised Acoustic Event Detection
by Yuzhuo Liu, Hangting Chen, Jian Wang, Pei Wang and Pengyuan Zhang
Appl. Sci. 2021, 11(18), 8581; https://0-doi-org.brum.beds.ac.uk/10.3390/app11188581 - 15 Sep 2021
Viewed by 1332
Abstract
In recent years, the involvement of synthetic strongly labeled data, weakly labeled data, and unlabeled data has drawn much research attention in semi-supervised acoustic event detection (SAED). The classic self-training method carries out predictions for unlabeled data and then selects predictions with high [...] Read more.
In recent years, the involvement of synthetic strongly labeled data, weakly labeled data, and unlabeled data has drawn much research attention in semi-supervised acoustic event detection (SAED). The classic self-training method carries out predictions for unlabeled data and then selects predictions with high probabilities as pseudo-labels for retraining. Such models have shown its effectiveness in SAED. However, probabilities are poorly calibrated confidence estimates, and samples with low probabilities are ignored. Hence, we introduce a confidence-based semi-supervised Acoustic event detection (C-SAED) framework. The C-SAED method learns confidence deliberately and retrains all data distinctly by applying confidence as weights. Additionally, we apply a power pooling function whose coefficient can be trained automatically and use weakly labeled data more efficiently. The experimental results demonstrate that the generated confidence is proportional to the accuracy of the predictions. Our C-SAED framework achieves a relative error rate reduction of 34% in contrast to the baseline model. Full article
Show Figures

Figure 1

20 pages, 442 KiB  
Article
Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview
by Xiaojiao Chen, Sheng Li and Hao Huang
Appl. Sci. 2021, 11(18), 8450; https://0-doi-org.brum.beds.ac.uk/10.3390/app11188450 - 12 Sep 2021
Cited by 5 | Viewed by 3075
Abstract
Voice Processing Systems (VPSes), now widely deployed, have become deeply involved in people’s daily lives, helping drive the car, unlock the smartphone, make online purchases, etc. Unfortunately, recent research has shown that those systems based on deep neural networks are vulnerable to adversarial [...] Read more.
Voice Processing Systems (VPSes), now widely deployed, have become deeply involved in people’s daily lives, helping drive the car, unlock the smartphone, make online purchases, etc. Unfortunately, recent research has shown that those systems based on deep neural networks are vulnerable to adversarial examples, which attract significant attention to VPS security. This review presents a detailed introduction to the background knowledge of adversarial attacks, including the generation of adversarial examples, psychoacoustic models, and evaluation indicators. Then we provide a concise introduction to defense methods against adversarial attacks. Finally, we propose a systematic classification of adversarial attacks and defense methods, with which we hope to provide a better understanding of the classification and structure for beginners in this field. Full article
Show Figures

Figure 1

12 pages, 657 KiB  
Article
A Pronunciation Prior Assisted Vowel Reduction Detection Framework with Multi-Stream Attention Method
by Zongming Liu, Zhihua Huang, Li Wang and Pengyuan Zhang
Appl. Sci. 2021, 11(18), 8321; https://0-doi-org.brum.beds.ac.uk/10.3390/app11188321 - 08 Sep 2021
Cited by 1 | Viewed by 1576
Abstract
Vowel reduction is a common pronunciation phenomenon in stress-timed languages like English. Native speakers tend to weaken unstressed vowels into a schwa-like sound. It is an essential factor that makes the accent of language learners sound unnatural. To improve vowel reduction detection in [...] Read more.
Vowel reduction is a common pronunciation phenomenon in stress-timed languages like English. Native speakers tend to weaken unstressed vowels into a schwa-like sound. It is an essential factor that makes the accent of language learners sound unnatural. To improve vowel reduction detection in a phoneme recognition framework, we propose an end-to-end vowel reduction detection method that introduces pronunciation prior knowledge as auxiliary information. In particular, we have designed two methods for automatically generating pronunciation prior sequences from reference texts and have implemented a main and auxiliary encoder structure that uses hierarchical attention mechanisms to utilize the pronunciation prior information and acoustic information dynamically. In addition, we also propose a method to realize the feature enhancement after encoding by using the attention mechanism between different streams to obtain expanded multi-streams. Compared with the HMM-DNN hybrid method and the general end-to-end method, the average F1 score of our approach for the two types of vowel reduction detection increased by 8.8% and 6.9%, respectively. The overall phoneme recognition rate increased by 5.8% and 5.0%, respectively. The experimental part further analyzes why the pronunciation prior knowledge auxiliary input is effective and the impact of different pronunciation prior knowledge types on performance. Full article
Show Figures

Figure 1

Back to TopTop