sensors-logo

Journal Browser

Journal Browser

Artificial Intelligence-Based Audio Signal Processing

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Electronic Sensors".

Deadline for manuscript submissions: closed (31 December 2022) | Viewed by 20695

Special Issue Editors


E-Mail Website
Guest Editor
Department of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome - Via Eudossiana 18, 00184 Rome, Italy
Interests: audio signal processing and machine learning for signal processing, nonlinear adaptive filtering, blind signal processing, and fog computing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
Interests: speech processing, machine learning, deep learning, signal processing, natural language processing, and computer vision

E-Mail Website
Guest Editor
Department of Information Engineering, Università Politecnica delle Marche, 60121 Ancona, Italy
Interests: computational intelligence and digital signal processing, with special focus on speech/audio/music processing and energy management
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Nowadays, artificial intelligence is largely used to face complex modelling, prediction, and recognition tasks in different research fields. The application of artificial intelligence methods to audio sensors has encountered a big interest in the scientific community in the last decade, with a wide diversification of research topics in relationship with the nature of the “microphone” sensors under study (i.e. music, speech, sound). The focus is on suitably processing the audio streams, often acquired in presence of harsh acoustic conditions, to extract the information contained therein to create and control knowledgeable services. More recently, the exploitation of end-to-end computational models, to directly handle the acoustic raw data, and the employment of cross-domain approaches, to jointly exploit the information contained in heterogeneous audio sensors, have been widely used on purpose. The aim of this special issue is therefore to provide the most recent advances on the application of novel artificial intelligence algorithms to a wide range of audio sensing and processing tasks in real acoustic environments.

Potential topics include, but are not limited, to:

  • Machine/Deep Learning for Audio Sensing and Processing
  • Cross-domain Audio Analysis
  • Deep Learning for Audio Applications in Real Acoustic Environments
  • Audio-based Security Systems and Surveillance
  • Speech and Audio Forensic Applications
  • Transfer Learning for Changing Environments
  • Big Data Audio Analysis
  • Separation and Localization of Real Recorded Audio Sources
  • Computational Acoustic Scene Understanding
  • Artificial Intelligence in Wireless Acoustic Sensor Networks
  • Context-aware Audio Interfaces
  • Microphone Array Signal Processing

This Special Issue covers topics about machine/deep learning and artificial intelligence in sensing, and signal processing and data fusion in “microphone” sensor systems. It is expected that the potential submissions to this Special Issue are strongly cross-field and cross-domain contributions.

Dr. Michele Scarpiniti
Prof. Dr. Jen-Tzung Chien
Prof. Dr. Stefano Squartini
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 931 KiB  
Article
CST: Complex Sparse Transformer for Low-SNR Speech Enhancement
by Kaijun Tan, Wenyu Mao, Xiaozhou Guo, Huaxiang Lu, Chi Zhang, Zhanzhong Cao and Xingang Wang
Sensors 2023, 23(5), 2376; https://0-doi-org.brum.beds.ac.uk/10.3390/s23052376 - 21 Feb 2023
Cited by 1 | Viewed by 1600
Abstract
Speech enhancement tasks for audio with a low SNR are challenging. Existing speech enhancement methods are mainly designed for high SNR audio, and they usually use RNNs to model audio sequence features, which causes the model to be unable to learn long-distance dependencies, [...] Read more.
Speech enhancement tasks for audio with a low SNR are challenging. Existing speech enhancement methods are mainly designed for high SNR audio, and they usually use RNNs to model audio sequence features, which causes the model to be unable to learn long-distance dependencies, thus limiting its performance in low-SNR speech enhancement tasks. We design a complex transformer module with sparse attention to overcome this problem. Different from the traditional transformer model, this model is extended to effectively model complex domain sequences, using the sparse attention mask balance model’s attention to long-distance and nearby relations, introducing the pre-layer positional embedding module to enhance the model’s perception of position information, adding the channel attention module to enable the model to dynamically adjust the weight distribution between channels according to the input audio. The experimental results show that, in the low-SNR speech enhancement tests, our models have noticeable performance improvements in speech quality and intelligibility, respectively. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

17 pages, 1178 KiB  
Article
Novel Speech Recognition Systems Applied to Forensics within Child Exploitation: Wav2vec2.0 vs. Whisper
by Juan Camilo Vásquez-Correa and Aitor Álvarez Muniain
Sensors 2023, 23(4), 1843; https://0-doi-org.brum.beds.ac.uk/10.3390/s23041843 - 07 Feb 2023
Cited by 9 | Viewed by 3197
Abstract
The growth in online child exploitation material is a significant challenge for European Law Enforcement Agencies (LEAs). One of the most important sources of such online information corresponds to audio material that needs to be analyzed to find evidence in a timely and [...] Read more.
The growth in online child exploitation material is a significant challenge for European Law Enforcement Agencies (LEAs). One of the most important sources of such online information corresponds to audio material that needs to be analyzed to find evidence in a timely and practical manner. That is why LEAs require a next-generation AI-powered platform to process audio data from online sources. We propose the use of speech recognition and keyword spotting to transcribe audiovisual data and to detect the presence of keywords related to child abuse. The considered models are based on two of the most accurate neural-based architectures to date: Wav2vec2.0 and Whisper. The systems were tested under an extensive set of scenarios in different languages. Additionally, keeping in mind that obtaining data from LEAs are very sensitive, we explore the use of federated learning to provide more robust systems for the addressed application, while maintaining the privacy of the data from LEAs. The considered models achieved a word error rate between 11% and 25%, depending on the language. In addition, the systems are able to recognize a set of spotted words with true-positive rates between 82% and 98%, depending on the language. Finally, federated learning strategies show that they can maintain and even improve the performance of the systems when compared to centralized trained models. The proposed systems set the basis for an AI-powered platform for automatic analysis of audio in the context of forensic applications of child abuse. The use of federated learning is also promising for the addressed scenario, where data privacy is an important issue to be managed. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

20 pages, 5549 KiB  
Article
Non-Contact Vibro-Acoustic Object Recognition Using Laser Doppler Vibrometry and Convolutional Neural Networks
by Abdel Darwish, Benjamin Halkon and Sebastian Oberst
Sensors 2022, 22(23), 9360; https://0-doi-org.brum.beds.ac.uk/10.3390/s22239360 - 01 Dec 2022
Cited by 4 | Viewed by 2183
Abstract
Laser Doppler vibrometers (LDVs) have been widely adopted due to their large number of benefits in comparison to traditional contacting vibration transducers. Their high sensitivity, among other unique characteristics, has also led to their use as optical microphones, where the measurement of object [...] Read more.
Laser Doppler vibrometers (LDVs) have been widely adopted due to their large number of benefits in comparison to traditional contacting vibration transducers. Their high sensitivity, among other unique characteristics, has also led to their use as optical microphones, where the measurement of object vibration in the vicinity of a sound source can act as a microphone. Recent work enabling full correction of LDV measurement in the presence of sensor head vibration unlocks new potential applications, including integration within autonomous vehicles (AVs). In this paper, the common AV challenge of object classification is addressed by presenting and evaluating a novel, non-contact vibro-acoustic object recognition technique. This technique utilises a custom set-up involving a synchronised loudspeaker and scanning LDV to simultaneously remotely solicit and record responses to a periodic chirp excitation in various objects. The 864 recorded signals per object were pre-processed into spectrograms of various forms, which were used to train a ResNet-18 neural network via transfer learning to accurately recognise the objects based only on their vibro-acoustic characteristics. A five-fold cross-validation optimisation approach is described, through which the effects of data set size and pre-processing type on classification accuracy are assessed. A further assessment of the ability of the CNN to classify never-before-seen objects belonging to groups of similar objects on which it has been trained is then described. In both scenarios, the CNN was able to obtain excellent classification accuracy of over 99.7%. The work described here demonstrates the significant promise of such an approach as a viable non-contact object recognition technique suitable for various machine automation tasks, for example, defect detection in production lines or even loose rock identification in underground mines. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

13 pages, 1235 KiB  
Article
Automatic Recognition of Giant Panda Attributes from Their Vocalizations Based on Squeeze-and-Excitation Network
by Qijun Zhao, Yanqiu Zhang, Rong Hou, Mengnan He, Peng Liu, Ping Xu, Zhihe Zhang and Peng Chen
Sensors 2022, 22(20), 8015; https://0-doi-org.brum.beds.ac.uk/10.3390/s22208015 - 20 Oct 2022
Cited by 2 | Viewed by 1498
Abstract
The giant panda (Ailuropoda melanoleuca) has long attracted the attention of conservationists as a flagship and umbrella species. Collecting attribute information on the age structure and sex ratio of the wild giant panda populations can support our understanding of their status [...] Read more.
The giant panda (Ailuropoda melanoleuca) has long attracted the attention of conservationists as a flagship and umbrella species. Collecting attribute information on the age structure and sex ratio of the wild giant panda populations can support our understanding of their status and the design of more effective conservation schemes. In view of the shortcomings of traditional methods, which cannot automatically recognize the age and sex of giant pandas, we designed a SENet (Squeeze-and-Excitation Network)-based model to automatically recognize the attributes of giant pandas from their vocalizations. We focused on the recognition of age groups (juvenile and adult) and sex of giant pandas. The reason for using vocalizations is that among the modes of animal communication, sound has the advantages of long transmission distances, strong penetrating power, and rich information. We collected a dataset of calls from 28 captive giant panda individuals, with a total duration of 1298.02 s of recordings. We used MFCC (Mel-frequency Cepstral Coefficients), which is an acoustic feature, as inputs for the SENet. Considering that small datasets are not conducive to convergence in the training process, we increased the size of the training data via SpecAugment. In addition, we used focal loss to reduce the impact of data imbalance. Our results showed that the F1 scores of our method for recognizing age group and sex reached 96.46% ± 5.71% and 85.85% ± 7.99%, respectively, demonstrating that the automatic recognition of giant panda attributes based on their vocalizations is feasible and effective. This more convenient, quick, timesaving, and laborsaving attribute recognition method can be used in the investigation of wild giant pandas in the future. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

26 pages, 2460 KiB  
Article
Few-Shot Emergency Siren Detection
by Michela Cantarini, Leonardo Gabrielli and Stefano Squartini
Sensors 2022, 22(12), 4338; https://0-doi-org.brum.beds.ac.uk/10.3390/s22124338 - 08 Jun 2022
Cited by 6 | Viewed by 2853
Abstract
It is a well-established practice to build a robust system for sound event detection by training supervised deep learning models on large datasets, but audio data collection and labeling are often challenging and require large amounts of effort. This paper proposes a workflow [...] Read more.
It is a well-established practice to build a robust system for sound event detection by training supervised deep learning models on large datasets, but audio data collection and labeling are often challenging and require large amounts of effort. This paper proposes a workflow based on few-shot metric learning for emergency siren detection performed in steps: prototypical networks are trained on publicly available sources or synthetic data in multiple combinations, and at inference time, the best knowledge learned in associating a sound with its class representation is transferred to identify ambulance sirens, given only a few instances for the prototype computation. Performance is evaluated on siren recordings acquired by sensors inside and outside the cabin of an equipped car, investigating the contribution of filtering techniques for background noise reduction. The results show the effectiveness of the proposed approach, achieving AUPRC scores equal to 0.86 and 0.91 in unfiltered and filtered conditions, respectively, outperforming a convolutional baseline model with and without fine-tuning for domain adaptation. Extensive experiments conducted on several recording sensor placements prove that few-shot learning is a reliable technique even in real-world scenarios and gives valuable insights for developing an in-car emergency vehicle detection system. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

17 pages, 3539 KiB  
Article
Blind Source Separation Based on Double-Mutant Butterfly Optimization Algorithm
by Qingyu Xia, Yuanming Ding, Ran Zhang, Minti Liu, Huiting Zhang and Xiaoqi Dong
Sensors 2022, 22(11), 3979; https://0-doi-org.brum.beds.ac.uk/10.3390/s22113979 - 24 May 2022
Cited by 6 | Viewed by 1467
Abstract
The conventional blind source separation independent component analysis method has the problem of low-separation performance. In addition, the basic butterfly optimization algorithm has the problem of insufficient search capability. In order to solve the above problems, an independent component analysis method based on [...] Read more.
The conventional blind source separation independent component analysis method has the problem of low-separation performance. In addition, the basic butterfly optimization algorithm has the problem of insufficient search capability. In order to solve the above problems, an independent component analysis method based on the double-mutant butterfly optimization algorithm (DMBOA) is proposed in this paper. The proposed method employs the kurtosis of the signal as the objective function. By optimizing the objective function, blind source separation of the signals is realized. Based on the original butterfly optimization algorithm, DMBOA introduces dynamic transformation probability and population reconstruction mechanisms to coordinate global and local search, and when the optimization stagnates, the population is reconstructed to increase diversity and avoid falling into local optimization. The differential evolution operator is introduced to mutate at the global position update, and the sine cosine operator is introduced to mutate at the local position update, hence, enhancing the local search capability of the algorithm. To begin, 12 classical benchmark test problems were selected to evaluate the effectiveness of DMBOA. The results reveal that DMBOA outperformed the other benchmark algorithms. Following that, DMBOA was utilized for the blind source separation of mixed image and speech signals. The simulation results show that the DMBOA can realize the blind source separation of an observed signal successfully and achieve higher separation performance than the compared algorithms. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

15 pages, 699 KiB  
Article
Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
by Mohamed Nabih Ali, Daniele Falavigna and Alessio Brutti
Sensors 2022, 22(1), 374; https://0-doi-org.brum.beds.ac.uk/10.3390/s22010374 - 04 Jan 2022
Cited by 5 | Viewed by 2295
Abstract
Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the [...] Read more.
Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

16 pages, 70469 KiB  
Article
Improved Swarm Intelligent Blind Source Separation Based on Signal Cross-Correlation
by Jiali Zi, Danju Lv, Jiang Liu, Xin Huang, Wang Yao, Mingyuan Gao, Rui Xi and Yan Zhang
Sensors 2022, 22(1), 118; https://0-doi-org.brum.beds.ac.uk/10.3390/s22010118 - 24 Dec 2021
Cited by 6 | Viewed by 2507
Abstract
In recent years, separating effective target signals from mixed signals has become a hot and challenging topic in signal research. The SI-BSS (Blind source separation (BSS) based on swarm intelligence (SI) algorithm) has become an effective method for the linear mixture BSS. However, [...] Read more.
In recent years, separating effective target signals from mixed signals has become a hot and challenging topic in signal research. The SI-BSS (Blind source separation (BSS) based on swarm intelligence (SI) algorithm) has become an effective method for the linear mixture BSS. However, the SI-BSS has the problem of incomplete separation, as not all the signal sources can be separated. An improved algorithm for BSS with SI based on signal cross-correlation (SI-XBSS) is proposed in this paper. Our method created a candidate separation pool that contains more separated signals than the traditional SI-BSS does; it identified the final separated signals by the value of the minimum cross-correlation in the pool. Compared with the traditional SI-BSS, the SI-XBSS was applied in six SI algorithms (Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Differential Evolution (DE), Sine Cosine Algorithm (SCA), Butterfly Optimization Algorithm (BOA), and Crow Search Algorithm (CSA)). The results showed that the SI-XBSS could effectively achieve a higher separation success rate, which was over 35% higher than traditional SI-BSS on average. Moreover, SI-SDR increased by 14.72 on average. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Audio Signal Processing)
Show Figures

Figure 1

Back to TopTop