Next Article in Journal
Fast Semi-Supervised t-SNE for Transfer Function Enhancement in Direct Volume Rendering-Based Medical Image Visualization
Previous Article in Journal
A Parallel Optimization Method for Robustness Verification of Deep Neural Networks
Previous Article in Special Issue
A Negative Sample-Free Graph Contrastive Learning Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DFNet: Decoupled Fusion Network for Dialectal Speech Recognition

School of Digtial and Intelligence Industry, Inner Mongolia University of Science and Technology, Baotou 014010, China
*
Author to whom correspondence should be addressed.
Submission received: 6 May 2024 / Revised: 19 May 2024 / Accepted: 13 June 2024 / Published: 17 June 2024
(This article belongs to the Special Issue Complex Network Modeling in Artificial Intelligence Applications)

Abstract

:
Deep learning is often inadequate for achieving effective dialect recognition in situations where data are limited and model training is complex. Differences between Mandarin and dialects, such as the varied pronunciation variants and distinct linguistic features of dialects, often result in a significant decline in recognition performance. In addition, existing work often overlooks the similarities between Mandarin and its dialects and fails to leverage these connections to enhance recognition accuracy. To address these challenges, we propose the Decoupled Fusion Network (DFNet). This network extracts acoustic private and shared features of different languages through feature decoupling, which enhances adaptation to the uniqueness and similarity of these two speech patterns. In addition, we designed a heterogeneous information-weighted fusion module to effectively combine the decoupled Mandarin and dialect features. This strategy leverages the similarity between Mandarin and its dialects, enabling the sharing of multilingual information, and notably enhance the model’s recognition capabilities on low-resource dialect data. An evaluation of our method on the Henan and Guangdong datasets shows that the DFNet performance has improved by 2.64% and 2.68%, respectively. Additionally, a significant number of ablation comparison experiments demonstrate the effectiveness of the method.

1. Introduction

Dialects are unique linguistic variants that have developed within specific regions and are distinctive in terms of pronunciation, grammar, and language use. The formation of dialects is influenced by many factors, such as mother tongue background, ethnicity, social class, age, and even emotional and health status. Each dialect profoundly reflects the socio-historical and cultural characteristics of its location and plays an indispensable role in facilitating communication and cultural heritage in the local community. Dialect recognition technology is crucial in numerous modern applications, including linguistic research, speech-to-text conversion [1], forensic speech recognition, and various security-related speech systems [2]. This technology can help confirm the identity of the speaker and has a profound impact on advancing the field of speech recognition [3]. A more in-depth examination of the distinctive characteristics and difficulties linked to various dialects can enhance the accuracy of speech recognition systems in environments that involve multiple dialects.
The early days of speech recognition technology mainly revolved around simple systems based on rules and acoustic models. These systems heavily relied on hand-designed features and strict language models [4], and showed obvious limitations due to their low adaptability and limited recognition capabilities. With the rise in deep learning and the widespread application of neural networks, speech recognition technology has experienced a significant breakthrough. Deep learning enables speech recognition systems to automatically extract acoustic and linguistic features by learning complex patterns [5] from a large amount of data, significantly improving recognition accuracy and flexibility. End-to-end automatic speech recognition systems have achieved remarkable results, especially on resource-rich linguistic datasets like Mandarin. These systems simplify multiple steps in the traditional speech recognition process, such as the separation of the acoustic model, speech decoder, and language model, by directly converting audio data to text [6], making the entire recognition process more efficient. However, even state-of-the-art end-to-end systems face many challenges when dealing with language varieties such as dialects, where training data are scarce. Most speech recognition systems rely on standard Mandarin data for training, and therefore lack sufficient modeling capabilities for acoustic variants in dialects [7], leading to significant performance degradation in real-world applications. Existing research in the field of dialect recognition has mainly focused on processing different dialects independently, often overlooking the potential similarities between various dialects and between Mandarin and dialects. For example, Ref. [8] proposed an approach for multi-accent recognition using one-hot vectors. This method encodes various dialects into distinct numerical vectors to facilitate learning and recognition by machine learning models. A Gaussian Mixture Model (GMM) is a commonly used probabilistic model. Through maximum likelihood estimation or expectation maximization algorithms, GMM can estimate model parameters and cluster speech signals to classify similar speech features into the same component [9]. In deep neural networks, pre-trained models have been exposed to a vast amount of linguistic data. Consequently, they possess a profound understanding of the acoustic, rhythmic, and lexical features of speech signals [10,11,12]. These features play a crucial role in dialect recognition and can enhance the accuracy of such recognition. Although this strategy has been effective in recognizing specific dialects, it is generally limited by issues of data scarcity and inadequate training samples, particularly for dialects with fewer available resources. Most studies have developed acoustic models independently for each dialect. This approach results in weak model generalization ability, increases the complexity of the design, and limits the further improvement of the model performance to some extent. In fact, despite the significant differences between Mandarin and dialects in some aspects, a certain degree of phonological and semantic similarity is retained between them [13]. This similarity provides the model with the opportunity to use Mandarin data to assist in dialect recognition. Thus, by leveraging the similarity between these languages, acoustic feature extraction and speech recognition can be effectively performed in the absence of a sufficient number of dialect samples.
In order to overcome the limitations of existing dialect recognition techniques and to better utilize the commonalities between Mandarin and dialects, we designed a novel Decoupled Fusion Network (DFNet). The network first analyzes the acoustic feature data of Mandarin and dialects using a feature decoupled module, which maps them into different spaces—private feature space and shared feature space. The private feature space is dedicated to capturing those unique acoustic attributes in each dialect that are not shared with Mandarin, such as specific phoneme variations or speech rhythms. The shared feature space is responsible for extracting acoustic attributes that are common across dialects and Mandarin, such as the fundamental frequency of speech, shared vowels, and consonants. The decoupled features can easily distinguish the private and shared parts of acoustic features between Mandarin and various dialects. Subsequently, the heterogeneous information-weighted fusion module is utilized to effectively combine the decoupled shared features with the dialect features. This process strengthens the model’s understanding of the unique aspects of the dialects while also incorporating commonalities with Mandarin. The heterogeneous information-weighted fusion module optimizes the quality of the final feature representation by adaptively and dynamically adjusting the weights of the shared and private features. This ensures that the network maintains highly accurate recognition performance in variable language environments. After extensive testing on the datasets of Henan and Guangdong, DFNet demonstrates a significant advantage in reducing the Character Error Rate by 2.64% and 2.68% compared to previous approaches. Furthermore, the effectiveness of DFNet is further validated in a large number of ablation experiments. The contributions of this article can be summarized in the following three points:
  • Current studies on dialect recognition often overlook the inherent relationship between Mandarin and dialects at the acoustic and semantic levels, thereby constraining the enhancement of recognition accuracy. To cope with this problem, we have developed an innovative Decoupled Fusion Network (DFNet), which leverages common features in the extensive Mandarin dataset to improve the recognition of different dialects.
  • One of the core components of DFNet is the feature decoupled module. This module accurately identifies and separates the acoustic features unique to Mandarin and its dialects, as well as the features they share. This greatly enhances the model’s ability to capture dialect-specific phonemes while also revealing the similarities between different languages. Further, the heterogeneous information-weighted fusion module we designed effectively combines the decoupled Mandarin-shared features with dialect-specific features. This fusion enhances the model’s comprehension of dialect features.
  • Extensive experiments have verified the superior performance of DFNet in processing dialect data. In tests on multiple dialect datasets, our network achieves significant progress in reducing word error rate, especially when paired with the use of resource-rich Mandarin data. This pairing improves the performance of the Henan and Guangdong recognition tasks by 2.64% and 2.68%, respectively. In addition, the results of the ablation experiments further corroborate the importance of the individual components in the DFNet architecture, demonstrating the promising broad application of our approach in dealing with multi-dialect speech recognition.

2. Related Works

2.1. Dialects Speech Recognition

In the dialect speech recognition task, researchers employ various approaches. In Gaussian Mixture Model–Hidden Markov Model (GMM-HMM)-based systems, dictionary adaptation methods [14] and multi-factor decision tree adaptation [15] were utilized. Significant progress is being made in dialect recognition; however, the relative complexity of the model components of the technique results in low efficiency in the training and decoding process. Deep neural networks (DNNs) are the primary technique for acoustic modeling in automatic speech recognition (ASR). The utilization of deep neural networks (DNNs) as potent feature extractors could mitigate the issue of performance deterioration in recognizing dialectal speech. Researchers in the literature [16] proposed a multi-dialect deep neural network acoustic model with a layered design. Among them, the top and bottom hidden layers were modeled for specific dialects and shared features, respectively. Specifically, the dialect-specific top layer was used to model various dialect-specific patterns extracted from a small amount of dialectal speech. This design allowed the model to better capture the differences and commonalities between individual dialects. Research in the literature [17] shows that the ASR model can be improved in recognizing dialectal speech by encoding any dialect in the dialect space. This approach utilized an adaptive structure to extract dialect information and converted the input acoustic features into dialect-related features through linear combination. This approach was shown to be effective in dialectal speech recognition. However, there was still a significant difference [18,19] between the recognition effect of Mandarin Chinese and the recognition effect of dialects in DNN-based ASR systems. In order to enhance dialect recognition performance, one approach was to improve adaptation to specific dialect data by fine-tuning the model [20,21,22,23,24]. According to the literature [25], better results were produced by fine-tuning a specific subset of layers (with significantly fewer parameters) than by fine-tuning the entire model. This could be achieved by fine-tuning the model with dialect-specific data during the training process. The goal of these methods was to enhance the model’s ability to adapt to different dialects by introducing dialect-related features or fine-tuning the model to improve the performance of dialectal speech recognition. However, these fine-tuning methods were limited by the quality of the underlying model or the amount of data available for fine-tuning. In addition to fine-tuning methods, multi-task learning is also applied to dialectal speech recognition. Multi-task learning combined all dialect datasets into a single model and adjusted the training target by incorporating one-hot encoding at the end of the original sequence based on specific information [26]. Furthermore, by co-training dialect classifiers [27], it was possible to explicitly supervise multi-dialect acoustic models with dialect information. It was also possible to learn accent embeddings [28] and integrate them into a multitasking framework as auxiliary input. The goal of these approaches was to improve the performance of dialect speech recognition by introducing dialect-related features or fine-tuning the model to enhance its adaptability to different dialects. Another approach was the utilization of domain adversarial training (DAT), which had been implemented in the field of dialectal speech recognition. By utilizing a gradient inversion layer, adversarial training empowered the acoustic model to extract dialect-independent features, thereby learning domain-invariant features to alleviate the mismatch problem. Adversarial training successfully addressed the domain adaptation problem in the field of computer vision [29,30].
Each of these approaches demonstrated some success in dialect speech recognition, but the performance of the multitask learning models on each dialect still lagged behind models that were independently fine-tuned on each dialect [31]. Therefore, researchers are exploring various methods and techniques to enhance the performance of dialect speech recognition.

2.2. Decoupled and Fusion Learning

Research has demonstrated the successful application of feature decoupled and weighted fusion techniques in tasks such as sentiment analysis [32], 3D point cloud data sparsity [33], and speaker recognition [34]. These techniques aim to enhance the performance and generalization ability of the models by leveraging the shared information across tasks. Typically, researchers processed lexical and sentiment information of natural language sentences separately and executed them in a pipelined manner. However, this serial processing may not have fully utilized the shared information between tasks. To address this problem, the literature proposed an interactive learning network that was able to pass useful information from different tasks back to a shared potential representation [35]. By combining this information with the shared potential representation, all tasks could collaborate to advance processing, thereby enhancing overall performance. In the literature [33], an end-to-end collaborative perception framework was introduced to address the problem of sparse 3D point cloud data. The framework captured proprietary and shared feature maps transmitted between different agents through feature decomposition. It determined the information to be delivered, resulting in a trade-off between perception performance and communication bandwidth. In the literature [36,37], the researchers introduced i-vectors and x-vectors as additional inputs to better distinguish between different dialects. These additional features provide more information about dialect differences and enhance the model’s ability to handle dialects. The goal of this approach was to compensate for the limitations of acoustic features in distinguishing different dialects and to provide more comprehensive information for better processing dialectal speech data. In summary, feature decoupled and weighted fusion techniques, along with interactive learning networks, lead to significant improvements in tasks such as sentiment analysis, 3D point cloud data sparsity, and speaker recognition. These methods enhance model performance, broaden application areas, and generate new ideas for solving real-world problems by leveraging the correlations and shared information between tasks.
Inspired by the above, we can explore how to utilize the decoupled module feature to capture the proprietary and shared features between dialect speech and Mandarin. The decoupled feature module allows each language to preserve its distinct acoustic characteristics by segregating the feature representations of dialectal speech and Mandarin, while also capturing the shared features between the two languages. When dealing with dialectal speech, the feature decoupling module helps extract the unique acoustic features of a dialect, such as specific variations in factors or speech rhythms. These features are crucial for dialect recognition to distinguish between different dialects. Additionally, the decoupled module feature captures common features between dialectal speech and Mandarin, such as the audio spectrum of speech and shared speech patterns. Similarly, when processing Mandarin, the feature decoupled module extracts the unique features of Mandarin, such as standard pronunciation and speech rhythm, while also capturing the common features between Mandarin and dialects, such as the shared audio spectrum and speech patterns. To better utilize these features, we can introduce a weighted fusion strategy. By weighting and fusing the dialect features with the common features of Mandarin, we obtain a richer and more comprehensive acoustic representation. The implementation of this weighted fusion technique ensures a proper balance between dialectal features and Mandarin features in the final representation. With these methods and strategies, we are better equipped to address the challenge of identifying low-resource dialects and enhance the performance of the dialect recognition system. This is crucial for addressing the challenges of dialect data scarcity and model training complexity. Moreover, this approach facilitates acoustic learning across different languages, enhancing the robustness and generalization of speech recognition systems in cross-linguistic environments. In conclusion, feature decoupled and weighted fusion techniques offer innovative solutions for addressing the dialect recognition problem. By capturing the proprietary and shared features of dialect speech and Mandarin and performing weighted fusion, we can better utilize the limited resources to enhance the performance of the dialect recognition system and tackle the challenges of dialect data scarcity and model training complexity. This is of great significance for the development and practical application of speech recognition technology.

3. Method

This section details our proposed Decoupled Fusion Network (DFNet), which aims to enhance the accuracy of dialect speech recognition. Specifically, Section 3.1 first describes the entire model framework. Section 3.2 describes the working principle of the decoupled feature module, which separates the proprietary and shared acoustic features of Mandarin and dialects. Then, Section 3.3 describes the heterogeneous information-weighted fusion module, which effectively fuses the decoupled features through dynamic weight adjustment to enhance the model’s performance in recognizing dialects. Section 3.4 introduces four decoding methods to achieve predictions, while Section 3.5 describes the overall optimization objective of the proposed method.

3.1. Model Framework

In our proposed dialect speech recognition system, we first preprocess the audio of the dialect and Mandarin languages and extract spectral features using an 80-dimensional log Mel-filter bank (FBANK) technique. The spectral features are then processed through a series of convolutional neural networks (CNNs) that include convolutional and pooling layers. These layers reduce the dimensionality of the features while extracting a higher-level and more stable feature representation. To encode positional information, capture temporal patterns, and represent semantic associations in the speech data, we utilize positional coding. This allows the model to be sensitive to the order of elements in the input sequence. We then utilize feature decoupling to separate speech features and refine the proprietary and shared features in the audio signal. This enables the model to distinguish unique acoustic attributes among different languages or dialects while maintaining common phonetic features, optimizing the model’s performance and adaptability to various types of dialect data. To enhance sparse dialectal features using Mandarin data, we introduce a feature-weighted fusion strategy. This strategy dynamically adjusts weights to combine dialect-specific features obtained from decoupling with the shared features of the dialect, and then integrates them with the shared features of Mandarin. The fusion process focuses on feature interactions to compensate for information scarcity caused by limited data, resulting in a more accurate and comprehensive feature representation. Finally, we design a joint Connectionist Temporal Classification (CTC) and attention decoder to process and transform the features into the final speech recognition result. The model utilizes both CTC and attention mechanisms to achieve accurate speech recognition. During the decoding stage, the model fully leverages the advantages of both methods. The overall architecture of the model is depicted in Figure 1, illustrating the complete processing flow from the raw audio signal to the final text output.

3.2. Feature Decoupled

In the context of dialectal speech recognition, it is crucial to understand and distinguish the unique acoustic properties of each dialect. To enhance the model’s ability to distinguish between dialects, we introduce the feature decoupled module. This module meticulously decomposes the acoustic properties present in the audio signal to generate unique and common features. The decoupled feature extraction process starts with the CNN sub-sampling step, which reduces the data’s dimensionality and enhances the feature abstraction level through convolution and pooling operations. Combined with positional coding, the resulting compressed feature map contains important time–frequency information of the speech signal and serves as input to the decoupled module. The decoupled module utilizes timing information to process the feature map through two separate paths. One approach focuses on extracting language-specific features, capturing the distinct articulatory and rhythmic patterns of the language. The other path is responsible for capturing generic acoustic features shared among languages, such as fundamental frequency and common phoneme structure. The advantage of this decoupled feature strategy is that it maximizes the utilization of the limited samples in the dialect dataset and enhances the model’s personalized understanding of the dialect through unique features. Simultaneously, by learning shared features, the model can infer the general characteristics of dialects with the assistance of the extensive Mandarin dataset, even with a limited number of dialect samples. This strategy enhances the processing performance of the dialect dataset and improves the model’s generalization ability. Please refer to Figure 2 for a visual representation of the feature decoupled module.
In terms of implementation, the decoupled module feature operates through two parallel attention mechanisms, each corresponding to proprietary and shared feature extraction. The proprietary features focus on language specificity using one attention network, while the shared features learn commonalities across languages using another attention network. The Sigmoid function is a typical nonlinear activation function that maps input features to the interval (0, 1) and has an “S” shape. As a result, the function has large derivative values when the input is close to 0 and 1. This property can promote the decoupling between features, making each feature independent of other features to a certain extent. This helps to reduce redundant information between features and improve the generalization ability of the model. In addition, applying the Sigmoid function for weighting and activation can precisely adjust the contribution of each feature channel. As a result, two sets of feature representations are outputted to capture the proprietary and shared aspects of the languages, respectively. Assuming the features that have been sub-sampled by the CNN are denoted as F Raw , the process can be represented by the following equation:
F Flatten = Flatten ( F Raw ) , F MLP = MLP ( F Flatten ) , F Sigmoid = σ ( F MLP ) , F Exclusive / Shared = F Sigmoid F Flatten .
where F Raw represents the features after CNN sub-sampling, F Flatten represents the features after the tiling of a dialect or Mandarin, and σ denotes the Softmax activation function. Here, E S h a r e d and E E x c l u s i v e represent the embedding vectors of the common and proprietary features decoupled from the audio signal, respectively.
Empirically, shared and proprietary features are expected to be complementary, as they collectively represent the complete language feature. This implies that there should be minimal overlap and similarity between them. To further enforce the independence of the decoupled features, we introduce a cosine similarity measure that maximizes the dissimilarity between the representations of shared and proprietary features. The process can be represented by the following equation:
L C o n t r a s t i v e = 1 E S h a r e d · E E x c l u s i v e E S h a r e d E E x c l u s i v e
where E S h a r e d and E E x c l u s i v e represent the embedding vectors of common and exclusive features decoupled from the audio signal, respectively, and · denotes the vector’s norm. E S h a r e d · E E x c l u s i v e E S h a r e d E E x c l u s i v e This equation represents cosine similarity. Cosine similarity measures the degree of similarity between two features. This equation represents the maximization of contrast: 1 E S h a r e d · E E x c l u s i v e E S h a r e d E E x c l u s i v e . This loss function ensures that the common and proprietary feature vectors are orthogonalized as much as possible in the high-dimensional feature space, thereby enhancing the decoupled capability of the model. By maximizing L C o n t r a s t i v e , the model will clearly distinguish between shared and proprietary features learned during training, ensuring the complementarity of the decoupled features and the robustness of the model. In our task, we need to decouple dialect and Mandarin simultaneously. The loss of decoupled dialect is denoted as L C o n t r a s t i v e D i a l e c t s , and the loss of decoupled Mandarin is denoted as L C o n t r a s t i v e M a n d a r i n . Therefore, the total constraint loss after decoupling for dialect and Mandarin is
L D e c o u p l e d = L C o n t r a s t i v e D i a l e c t s + L C o n t r a s t i v e M a n d a r i n

3.3. Weighted Fusion of Heterogeneous Information

Based on the feature decoupled approach, we propose a heterogeneous information-weighted fusion strategy to enhance the accuracy of the dialect recognition system. This strategy aims to effectively integrate common and individualized features from dialects and Mandarin. Here, “heterogeneous” refers to the integration of data from various sources, and this integration process improves the personalized representation of dialects by dynamically adjusting the weights. It also utilizes the abundant Mandarin data to make up for the limited availability of dialect data. The fusion strategy consists of two key steps. The first step is to weight the features of the dialect data. In this step, we dynamically assign weights to each channel in the feature map using a weight tensor learned from a deep neural network. This process strengthens the features that are crucial for dialect recognition, enabling the model to concentrate more on discriminative information unique to each dialect. To enhance the efficiency of per-frame features in deep learning, pooling is an essential step. Pooling allows the model to selectively focus on frames from different channels by utilizing channel- and context-dependent statistics. This enables the model to weigh the more relevant frames based on the average trend of the overall features and the importance of key features. Pooling also helps reduce the dimensionality of the features, making the network easier to train, and facilitates the aggregation of spatial information. Currently, the average pooling approach is widely used because it extracts overall statistical information about various target features. However, maximum pooling can capture different representations of target features compared to average pooling. Therefore, in this paper, we utilize both the average pooling layer and the maximum pooling layer to acquire global statistical information about each channel. This results in two C-dimensional pooled feature maps, denoted as F avg and F max . Next, F avg and F max are passed through two fully connected layers to learn the weights of the channels. This process generates two 1 × 1 × C channel attention maps. The results obtained from these two processes are summed, and the weights are normalized between 0 and 1 using a Sigmoid function, scaling each channel. Finally, the scaled channel features are multiplied by the original features to generate features M c with enhanced channel importance. The computation of the final feature M c can be expressed as follows:
M c ( F ) = σ ( M L P ( A v g p o o l ( F ) ) + M L P ( M a x p o o l ( F ) ) ) = σ ( W 1 ( W 0 ( F a v g c ) ) + W 1 ( W 0 ( F m a x c ) ) )
Here, W avg and W max are learnable weight matrices for the average pooling and maximum pooling layers, respectively. M c denotes the final feature.
By combining the average and maximum pooling approaches and learning the channel weights through fully connected layers, this strategy enhances the importance of individual channels. The Sigmoid function normalizes the weights, and the multiplication with the original features generates the final features M c with enhanced channel importance. This pooling and channel attention mechanism enables the model to capture and emphasize the most relevant information from different channels, leading to more efficient and discriminative per-frame features.
The second step involves merging the weighted features of the dialect data with the common features of Mandarin. Once the private features of Mandarin are discarded, we proceed to combine the weighted dialect features with the common Mandarin features. This fusion process combines the unique acoustic signals of the dialect with the common information present in Mandarin. To achieve this, we employ channel aggregation and feature splicing. By aggregating the channels, we synthesize the unique dialect-specific acoustic signals with the shared Mandarin information. This process effectively combines the exclusivity of dialects with the commonalities found in Mandarin, resulting in a feature representation that enhances the diversity of dialect features. By implementing these fusion strategies, the model can efficiently process frame-by-frame features and generate output features that are relevant to the dialect being spoken. The resulting feature representation retains the dialect-specific characteristics while incorporating the common aspects shared with Mandarin. Please refer to Figure 3 for a visual representation of the described process.

3.4. Decoding Method

In speech recognition, decoding is a crucial step that converts speech features into corresponding textual representations. It plays a vital role in speech recognition systems by mapping acoustic features (such as speech spectrograms) to text, enabling the transformation of speech signals into a usable and understandable form. The primary goal of decoding is to convert the speech signal into a text representation that can be further processed for tasks such as semantic understanding, text analysis, and various applications. Through the decoding process, we obtain a textual representation of the spoken input, enabling automation and the application of speech recognition tasks. There are two main decoding methods commonly used in speech recognition: methods based on Connectionist Temporal Classification (CTC) and methods based on the attention mechanism (Att). CTC decoding methods can be categorized into two types: greedy decoding and prefix beam search. Greedy decoding involves selecting the most probable output label at each time step, leading to a simple and direct decoding process. Prefix beam search, on the other hand, explores multiple hypotheses by maintaining a beam of the most likely sequences. Attention decoding methods, as the name suggests, utilize the attention mechanism. They can be further categorized into attention decoding and attention rescoring. Attention decoding incorporates the attention mechanism to dynamically align the input speech features with the output text, focusing on relevant parts of the speech signal during decoding. Attention rescoring refines the decoding results by re-evaluating and adjusting the scores of the generated hypotheses.
CTC Greedy Decoding: This method selects the label with the highest probability as the predicted output at each time step. It simplifies the decoding process by merging consecutive repeated labels and blank labels. While this method allows for quick prediction results, it heavily relies on the maximum probability output at each time step. As a result, it may not effectively utilize contextual information and can potentially lead to the selection of words with similar pronunciations, resulting in incorrect predictions.
CTC Prefix Beam Search: The CTC prefix beam search algorithm enhances CTC greedy decoding by retaining a set of the initial m prefix results with the highest probability (where m denotes the search width) and adjusting the probability of these prefixes at each time step. The method not only considers a single best path but also increases the diversity of search results and improves the decoding accuracy by merging intermediate results with the same prefix. Additionally, the algorithm can be used in conjunction with external language models to further enhance the quality of the decoded results.
Attention Decoding: In attention decoding, the decoder utilizes attention weights to dynamically focus on the portion of the input sequence that is most relevant to the current output position. It calculates the weights for each input position and generates the corresponding context vector, enabling the decoder to adjust to the processing requirements of long sequences. By incorporating the current hidden state, the decoder generates the output while considering the dynamically determined context information.
Attention Rescoring: Attention rescoring is a technique that enhances the results of the N-best hypotheses for CTC prefix beam search decoding. It utilizes the attention mechanism to re-evaluate the N-best decoding assumptions and combines it with Teacher-Forcing-Ratio parameter tuning to optimize the decoding performance. This approach reduces chaining errors caused by mistakes in previous outputs and prevents syntax errors resulting from excessive correction. In traditional speech recognition systems, Connectionist Temporal Classification (CTC) methods are popular for their simplicity and efficiency. However, CTC methods rely on independent labeling assumptions and may not effectively capture complex dependencies in speech. To address this limitation, we combine the attention mechanism with CTC decoding to leverage the advantages of both approaches. The attention mechanism enables the effective capture of long-distance dependencies by dynamically adjusting the focus during decoding. This enhances the model’s sensitivity to contextual information. Our combined approach first performs a fast initial decoding using CTC to generate a rough textual output. Subsequently, the output is refined and rescored using the attention mechanism to enhance decoding accuracy. During this process, the attention mechanism evaluates the correlation between each candidate output and the input features, favoring decoding paths that align well with the input data. By combining CTC and attention mechanisms, our speech recognition system efficiently processes large amounts of data. This integration also enhances the system’s capability to capture intricate details and achieve a comprehensive semantic understanding through attention-based re-evaluation. Subsequent experiments demonstrate that this combined approach outperforms using CTC or attention alone on multiple standard datasets, leading to significant improvements in recognition accuracy. The joint Connectionist Temporal Classification (CTC) and attention decoder framework utilizes both CTC and attention losses for training the network, with the loss function denoted as follows:
L J o i n t ( x , y , λ ) = λ L C T C ( x , y ) + ( 1 λ ) ( L A t t ( x , y ) )
In this section, we use the symbol x to represent the acoustic feature and y to represent the corresponding label. To balance the importance of the CTC loss L C T C ( x , y ) and the attention decoder loss L A t t ( x , y ) , we introduce the hyperparameter λ .

3.5. Loss Optimization Objective

Our DFNet utilizes the Adam optimizer [38] and updates the parameters through backpropagation to minimize the training loss. Our approach combines the joint CTC-attention loss with the feature decoupled loss to optimize the model’s performance. The expression for the total loss function is
L T o t a l = L J o i n t ( x , y , λ ) + L D e c o u p l e d
Here, L J o i n t represents the loss of the joint Connectionist Temporal Classification and attention mechanism, which combines the advantages of fast decoding and sensitivity to context. L D e c o u p l e d is the feature decoupled loss that aims to enhance the model’s ability to distinguish and extract unique and shared acoustic features between dialects and Mandarin. The hyperparameter λ is used to balance the importance of these two loss components. With this design, our loss function enables efficient and accurate decoding while also ensuring effective feature decoupling. This enhances the model’s adaptability to complex linguistic environments and improves recognition accuracy.

4. Experimental Setup

4.1. Introduction to the Dataset

The Mandarin dataset used in our study is the open-source Aishell-1 dataset [39]. It is a Chinese speech dataset that contains 178 h of speech data. The recordings were made by 400 Chinese speakers and cover 11 domains, including smart home technology and drones. The text in this dataset has a high accuracy rate and undergoes strict quality checks. The dataset is divided into training, development, and test sets, providing a solid foundation for academic research on Chinese speech recognition. Figure 4 illustrates the audio and labeled text of the Aishell-1 dataset.
The Henan dialect dataset is a collection of speech resources that gathers authentic recordings of the Henan dialect. The dataset contains a substantial amount of speech data recorded by local individuals in Henan, showcasing the distinctive flavor and phonetic characteristics of the Henan dialect. It covers various aspects such as daily speech and local culture. The dataset has undergone meticulous organization and proofreading, ensuring high-quality text transcriptions that guarantee data accuracy. This dataset provides us with a valuable resource to gain an in-depth understanding of the phonetic characteristics of the Henan dialect. It also plays a positive role in promoting the development of speech recognition technologies related to the Henan dialect. Figure 5 displays the audio recordings and labeled text from the Henan dialect dataset.
The Guangdong dialect dataset is a highly distinctive resource in the field of dialect recognition. It gathers a large number of high-quality Guangdong speech samples, covering a wide range of language scenarios from reading articles aloud to professional domains. The dataset has clear audio and precise text annotation, fully demonstrating the unique phonetic features and linguistic charm of Guangdong. This rich resource provides a solid foundation for researching Guangdong dialect recognition technology and also promotes the innovation and development of related technologies. Furthermore, it has a positive impact on the preservation and transmission of Guangdong as a significant local culture. The audio and labeled text of the Guangdong dialect dataset are shown in Figure 6:
In order to enhance the accuracy and efficiency of dialect recognition, we meticulously clean and preprocess the collected datasets from Henan and Guangdong. First, we conduct an initial screening of the original dataset to eliminate records containing invalid audio. Then, we employ a series of audio processing techniques to enhance the audio quality, such as noise reduction, de-reverberation, and volume equalization. These processes help reduce background noise and other interfering factors, making the speech signal clearer and easier to recognize. For each audio file, we check its corresponding tags. We have identified errors or inconsistencies in some of the labels, which could be attributed to manual labeling mistakes or issues in the data conversion process. To correct these errors, we recheck all the labels and make the necessary changes and updates. Then, for the outliers present in the dataset, we perform special treatment. These outliers may adversely affect model training, so we use appropriate statistical methods and business logic to identify and handle them. Finally, after completing the cleaning process, we obtained a 30-h usable Henan dataset and a 30-h usable Guangdong dataset. We validate both datasets to ensure that the cleaning process has the expected effect. We utilized a speech recognition model to conduct initial tests on the cleaned dataset and compare the recognition results before and after cleaning. The results show that the cleaned dataset has significantly improved speech recognition accuracy, laying a solid foundation for subsequent training and application of speech recognition models.

4.2. Dataset Setup

This section describes how the datasets from Henan and Guangdong were acquired and processed. Given the characteristics of low-resource dialects and experimental needs, we process the raw complex audio data by cleaning, truncating, and labeling. In dataset cleaning and preprocessing, we use Fourier-transformable frequency domain analysis to detect and eliminate background noise. For labeling errors, we implemented a rule-based error detection mechanism and conducted manual reviews to calibrate and correct labeling errors. For samples that are difficult to identify, we adopt the majority voting method for labeling. For outliers, we used statistical indicators based on audio hourly frequency, silence percentage, etc., to eliminate samples with obvious anomalies. The cleaned dataset is fed back into the model training to continuously optimize the data quality. The cycle is iterated until satisfactory data quality and model performance are achieved. In order to prevent model overfitting, we employ the SpecAugment strategy [40], which involves maximum frequency masking and temporal masking operations. This approach enhances data diversity by randomly masking speech data in terms of frequency and time, thereby enhancing the model’s generalization capability. Detailed information can be found in Table 1.
We conducted extensive experiments on two dialect datasets, Henan and Guangdong, to evaluate the effectiveness of our proposed DFNet model. In addition, to simulate automatic speech recognition for low-resource dialects, we also utilize the Aishell-1 dataset as training data for non-dialects. In our experiments, We divided the obtained 30-h Henan and Yueyu datasets into a 24-h training set, a 3-h validation set, and a 3-h test set in the ratio of 8:1:1. During the validation phase, the cleaned dataset can more accurately reflect the performance of the model in real-world applications. Meanwhile, by evaluating the model’s performance on the validation set, the data cleaning strategy can be further optimized to create a virtuous circle. This, in turn, enhances the accuracy and efficiency of the DFNet model in the dialect recognition task.

4.3. Model Configurations

We use an 80-dimensional FBANK as an acoustic feature extractor and augment it with one-dimensional pitch information. The specific reasons are as follows: According to the spectral characteristics of speech signals, the human audible frequency range typically falls between 20 Hz and 20 kHz. The FBANK technique can effectively capture the energy distribution characteristics of speech signals within this frequency band. For this reason, 80 dimensions are a common choice to better characterize the spectral properties of speech signals. This choice not only considers the perceptual range of the human auditory system but also effectively extracts the spectral characteristics of the speech signal using the FBANK technique, which can provide high-quality input features for the following dialect speech recognition task. Due to limitations in computational resources, high-dimensional features can lead to increased computational complexity and storage overhead, which may impact real-time performance. Having 80 dimensions is a choice of relative compromise, which ensures the ability to describe features effectively while also taking into account the limitations of computational resources. These features are obtained using a window length of 25 ms and a step size of 10 ms for acquisition. Then, we perform utterance-level cepstrum mean and variance normalization (CMVN) on the FBANK features to standardize the features. All models are trained or fine-tuned using the Adam optimizer, and a warm-up learning strategy is implemented in the first 25 iterations of training to optimize performance. In addition, we applied a label smoothing weight [41] of 0.1 and a dropout ratio of 0.2 to enhance model predictions and mitigate overfitting [42]. Our model utilizes a feature decoupled and heterogeneous information-weighted fusion structure within the Wenet framework [43], enabling the discrimination and integration of shared and unique features of Mandarin and its dialects. In the encoder front, we utilize two convolutional sub-sampling layers with a 3 × 3 kernel size and a stride of 2. Additionally, there are 12 Conformer blocks as described in [42] with 2048 linear units in the feed-forward network, 256 model dimensions, 4 attention heads, and a CNN kernel size of 15. In the decoupled feature module, we initially transform the input x to align with the flattening operation. Subsequently, we process it through two Multilayer Perceptron (MLP) branches and a Sigmoid activation function to extract both the shared and proprietary features of Mandarin or dialects. These features are differentiated by the cosine similarity function and then enter the weighted fusion module. The weights are determined by dynamic weights obtained from the learning process, ultimately forming the output features of the model. The attention decoder consists of 6 Transformer blocks [44]. In the Mandarin task, the DFNet model selected 4231 Chinese characters as the modeling unit. For decoding, we use the Character Error Rate (CER) as the performance evaluation metric, which is calculated using the following formula:
C E R = R + I + D N
where R represents the number of replacement errors, I represents the number of insertion errors, D represents the number of deletion errors, and N represents the total number of words in the correct labeling sequence.

5. Results

In this section, we present the experimental results of our proposed DFNet model in a dialect speech recognition task. First, we conduct a detailed comparison of the model’s performance on the Henan and Guangdong datasets by combining Mandarin data. This evaluation aims to assess its performance relative to other models in the field of speech recognition. Through ablation studies, we further validate the effectiveness of each component in DFNet. Finally, in Table 2, we conduct a performance comparison by including only the Mandarin dataset and not utilizing the proposed method. The results show that simply incorporating other datasets does not lead to performance improvement. Our method can effectively utilize the shared information of the Mandarin dataset, thus enhancing the recognition ability of the model.

5.1. Comparative Experiment

As can be seen from Table 3 and Table 4, DFNet achieves the best performance in both dialectal speech recognition tasks. Specifically, in the Henan dialect recognition task (Table 3), the Character Error Rate (CER) of DFNet is 11.69%, demonstrating a significant advantage over other models. For example, compared with the WeNet model, which has the closest performance, the Character Error Rate of DFNet is reduced by 2.64%. Compared with the latest Icefall model, DFNet shows an improvement of 4.4%.
In the Guangdong recognition task (Table 4), the Character Error Rate of DFNet is 21.79%, outperforming the other models. Although the recognition error rate of all models on the Guangdong dataset is generally higher than that on the Henan dataset, which may be due to the more complex phonetic features of Guangdong itself, DFNet still demonstrates strong robustness and adaptability. This is particularly evident when compared with the second-place WeNet, which reduces the error rate of DFNet by 2.68% in Guangdong recognition.
These results demonstrate that DFNet, with its innovative decoupled and fusion strategies, can significantly enhance recognition accuracy in automatic speech recognition tasks involving complex dialects. Especially on the low-resource dialect dataset, DFNet significantly enhances the model’s understanding of dialect variants by leveraging both shared and proprietary features between Mandarin and dialects, thereby enhancing the overall recognition performance.

5.2. Ablation Experiment

To further investigate the role of each module in our proposed method, we conducted ablation experiments as shown in Table 5. Table 5 presents the results of the ablation study of the DFNet model under various experimental settings. Series A experiments involve Henan plus Mandarin, while series B experiments involve Guangdong plus Mandarin. We note that for Henan (series A), the full DFNet model (A1), which includes both decoupled and fusion modules, achieves the best performance in all tests. When the fusion module is removed (A2), there is a significant drop in performance, indicating that the fusion module is crucial for speech recognition in Henan. In the case without the decoupled module (A3), although the performance drops, it is not as significant as when the fusion module is removed. This may indicate that the model can still recognize the features of Henan to a certain extent even without decoupling. The complete DFNet model (B1) also performs optimally in the experiments for Guangdong (series B). The performance degradation is most significant in the experiment with the fusion module removed (B2), a result that is consistent with the series A experiments and confirms the importance of the fusion module. In contrast, the performance of experiment B3, which does not use decoupling at all and only considers the fusion module, is comparable to that of experiment B2, which only utilizes the fusion module. This observation suggests that Guangdong recognition may rely more on other model features in this particular configuration.
Four decoding methods were used on our test set: attention decoding, CTC greedy search, CTC prefix beam search, and attention rescoring. From the two decoding methods, CTC greedy search and CTC prefix beam search, the prefix beam search usually provides better results. This suggests that utilizing the language model can improve the decoding performance. The traditional CTC-based decoding approach, although capable of handling input and output sequences of unequal length, may not fully leverage the information from the acoustic model in certain cases. This may lead to less accurate decoding results or poor performance in certain complex scenarios. The attention rescoring mechanism is introduced to solve this problem. The core idea is to reevaluate the candidate sequences during the decoding stage by incorporating the outputs of the acoustic model, the language model, and the information from the attention mechanism. In this way, sequences that are more compatible with the input speech will receive higher scores, thereby increasing the probability of being selected.
These results emphasize the importance of feature decoupling and information fusion in dialect speech recognition tasks, as well as the potential value of leveraging different decoding strategies. Especially when working with dialectal speech data that have limited resources, model complexity and precise feature processing are crucial for improving accuracy. By comparing the above experimental results, it is evident that utilizing feature decoupling and heterogeneous information-weighted fusion can significantly improve the performance of the DFNet model in terms of word error rate.

5.3. Add Mandarin Experiments

In Table 2, it is evident that C2 and C4, when compared to C1 and C3, demonstrate that merely augmenting the Mandarin dataset in a dialectal speech recognition task without a specific targeted strategy does not yield any improvement in dialect recognition. On the contrary, these Mandarin datasets may have the side effect of interfering with the model’s ability to capture and understand dialect features. The features in the Mandarin dataset do not exactly match those in the dialect dataset. If they are directly mixed, the model may be influenced by the Mandarin features, leading to a decline in performance in dialect recognition. The details are shown in Table 2.
However, the situation is completely different when we adopt the approach of feature decoupled and weighted fusion of heterogeneous information, as illustrated in A1 and B1 in Table 5. The DFNet model effectively leverages the additional information from the Mandarin dataset through a series of well-designed algorithms and structures. It is not only able to extract the features shared between the Mandarin data and the dialect dataset, but also capable of fusing and coordinating these features with those in the dialect data, thereby enhancing the model’s ability to recognize dialects. Through the processing of the DFNet model, the Mandarin dataset ceases to be a hindrance and instead becomes a tool to improve the model’s performance. By utilizing the auxiliary dataset in this manner, the predictive ability of the model can be enhanced. This approach also increases the model’s robustness and flexibility, enabling it to better adapt to variations and changes across different languages.
Therefore, simply adding Mandarin datasets is not sufficient for the dialect speech recognition task. Appropriate methods and models must be adopted to effectively utilize the information from these additional datasets and improve the performance of the model. The successful implementation of the DFNet model serves as a good example of how to effectively utilize additional linguistic resources to enhance recognition outcomes while preserving dialect characteristics.

5.4. Feature Visualization Experiment

Now, we adopt a more intuitive approach to characterize dialects and Mandarin. Specifically, we use red to indicate private features and blue to indicate shared features. By comparing Figure 7a,c, we can clearly see that without the initial feature decoupled, these features are intertwined with no clear distinction. This hybrid state makes it difficult for the model to find the most essential information when attempting to extract and comprehend linguistic features.
After passing the features through the decoupled module, we successfully separate the private and shared features of dialects. While decoupling the dialect, we also apply the same approach to Mandarin, allowing both to learn their shared features, as illustrated in Figure 7b,d. The advantage of this approach is that we can leverage the shared information in Mandarin to effectively address the challenges arising from the limited amount of dialect data. By utilizing the rich data resources of Mandarin, we can provide more common feature information for the dialect model, making it more stable and accurate.
In the training process of speech recognition technology, the common features of Mandarin provide a stable recognition basis for the model, which plays an important role as the benchmark and framework. Through extensive corpus learning and algorithm optimization, the model gradually masters the pronunciation patterns, intonation changes, vocabulary, and grammatical structure of Mandarin, thereby achieving an efficient recognition of Mandarin speech. However, the diversity of dialects poses challenges to speech recognition. Different dialects exhibit significant variations in pronunciation, vocabulary, and grammar. Traditional speech recognition models frequently struggle with dialect recognition. To solve this problem, we fully utilize the information in Mandarin, analyze and extract dialect features in depth, and incorporate them into speech recognition models.
We use red color to indicate dialect features and blue color to indicate Mandarin shared features in Figure 8a. In the fusion process, we employ the technique of weighted fusion of heterogeneous information. By continuously iterating and optimizing, we aim to converge the distribution of Mandarin and dialect in speech recognition. We visualize the distribution of the features at different epochs during the training process. When the model is trained up to the 80th epoch, we observe from Figure 8b that the shared Mandarin features begin to gradually merge with the dialect features. As the training progresses and reaches the 160th epoch, we observe from Figure 8c that most of the features have already converged. By the 240th epoch, as shown in Figure 8d, we can see that the two features have almost completely converged. This process demonstrates our effective utilization of Mandarin information to enhance the dialect features effectively.

5.5. Different Fusion Methods

Feature fusion is a process of combining features from various levels of abstraction using a specific strategy to create more comprehensive and representative feature representations. This can improve the performance and generalization ability of the model, and help understand the intrinsic structure and patterns of the data. In particular, additive fusion (Add) (Figure 9a) is a parallel strategy that combines two feature vectors by creating a composite vector. For input features x and y, it can be expressed as z = x + iy, where i represents an imaginary unit. Concatenate (Figure 9b) is a tandem strategy that directly connects two features. If the dimensions of the input features x and y are p and q, respectively, then the dimension of the output feature z will be p + q. Softmax attention (Figure 9c) is an attention mechanism that computes (query·key)·value. However, direct global attention computation often leads to excessive computational load. While Softmax attention is effective, this self-attention approach is vulnerable to computational challenges and may struggle with global modeling. Linear attention (Figure 9d) enables interaction between different modalities or features, aiding the model in comprehending the intricate structure and context of the data. However, linear fusion attention typically entails numerous matrix operations and weight calculations, which can result in high computational complexity and time costs. This may be a limiting factor in resource-limited environments. In summary, feature fusion is a method that combines various features, such as additive fusion, connected fusion, Softmax fusion attention, and linear fusion attention. Each fusion method has its advantages and disadvantages. Figure 9a–d show the details of these feature fusion methods.
Suppose the two input channels are X 1 , X 2 , …, X c and Y 1 , Y 2 , …, Y c . The individual output channels of the Add function are
Z A d d = i = 1 c ( X i + Y i ) K i = i = 1 c X i K i + i = 1 c Y i K i
Then, the single output channel of Concatence is (* denotes convolution):
Z C o n c a t e n a t e = i = 1 c X i K i + i = 1 c Y i K i + c
For ease of writing, we abbreviate Softmax attention and linear attention, respectively, as follows:
Z S o f t m a x a t t = σ ( Q K T ) V A t t n S ( Q , K , V )
Z L i n e a r a t t = ϕ ( Q ) ϕ ( K ) T V A t t n ϕ ( Q , K , V )
where Q, K, and V represent the Query, Key, and Value matrices, respectively; σ represents the Softmax function; and ϕ is the mapping function in linear attention.
We conducted a series of experiments on Henan plus Mandarin and Gongdong plus Mandarin to investigate the impact of the feature-weighted fusion method on the experimental results. The outcomes are presented in Table 6 and Table 7.
In the evaluation of fusion methods for the Henan plus Mandarin dataset (Table 6), various fusion strategies significantly impact the final Character Error Rate (CER). The ’Add’ and ’Concatenate’ methods have higher Character Error Rates (CERs) of 19.11% and 19.43%, respectively, while ’Softmax Attention’ and ’Linear Attention’ performed better with CERs of 16.38% and 15.61%, respectively. This indicates that the attention mechanism plays a positive role in the fusion process on this specific dataset. The "heterogeneous information fusion" strategy we proposed, on the other hand, has a CER of 11.69%, which is much lower than the other methods, indicating that it is more effective in extracting and utilizing the correlations between Mandarin and dialects. The test results of the Gongdong plus Mandarin dataset (Table 7) show a similar trend. In this dataset, the CERs of ’Add’ and ’Concatenate’ are similarly high at 28.16% and 28.84%, respectively. While the Classification Error Rate (CER) using ’Softmax Attention’ and ’Linear Attention’ is lower, at 25.69% and 24.27%, respectively. Similarly, the ’Heterogeneous Information Fusion’ method demonstrates the best performance on the Gongdong dataset with a CER of 21.79%, which further proves the superiority of the method in cross-language feature fusion. The above experiments show that the “heterogeneous information fusion” strategy is crucial for improving both the accuracy of dialect speech recognition. The performance advantage it exhibits over other fusion strategies validates the effectiveness of the fusion module design in the DFNet model.

5.6. Ways to Maximize Differences

In the process of feature decoupling, we need to maximize the difference between shared and exclusive features to ensure the independence of each feature. In order to investigate the impact of various maximization loss constraint methods on feature decoupling, we also conducted experiments using the datasets of Henan combined with Mandarin and Gongdong combined with Mandarin, as illustrated in Table 8 and Table 9.
Experimental results show that the best decoupling is achieved by considering only cosine similarity, which enables the shared and private features to maintain low similarity. This ensures the independence of their respective features.

5.7. Hyperparameter Analysis

During machine learning model training, each dataset and model requires a different set of hyperparameters, which can be regarded as adjustable variables in the model. The only way to determine the optimal hyperparameters is through multiple experiments, during which the best hyperparameter configuration is selected and implemented in the model. This process is called hyperparameter tuning. The choice of hyperparameters can have a significant impact on the performance of the model, including key features such as model architecture, learning rate, and model complexity. In summary, hyperparameter tuning is an essential aspect of the machine learning model training process. By determining the optimal hyperparameter configuration, we can enhance the performance and generalization capabilities of the model. In this experiment, we investigate the effect of different hyperparameters λ on the experimental results. The experimental results are shown in Figure 10.
In summary, the results show that setting the hyperparameter λ to 0.3 yields the best performance.

6. Discussion

In this paper, we aim to tackle the issue of data scarcity and the insufficient justification for existing Mandarin datasets in dialect recognition tasks. We propose DFNet, an acoustic model based on feature decoupling and a weighted fusion of heterogeneous information. The model first decouples the input Mandarin and dialect audio into private and shared features. We ensure the independence of the two features by introducing maximized differences. Second, the shared features of Mandarin are combined with the dialect features through a heterogeneous information-weighted fusion module to enhance the expressiveness of the dialect features and alleviate the issue of sparse dialect data. The experimental results show that, compared with the baseline model, our scheme improves the dialect recognition accuracy by 2.64% and 2.68% on the test sets of the Henan-Mandarin dataset and the Gongdong-Mandarin dataset. These results show that our scheme has achieved a significant improvement in the dialect recognition task, approaching the current state-of-the-art performance level. A large number of ablation comparison experiments further demonstrate the effectiveness of our method.
We have alleviated this problem by utilizing feature decoupling and weighted fusion of heterogeneous information. But there is still room for further improvement. In the future, we will consider proposing additional data enhancement methods, similar to the mixup method and others.

Author Contributions

Q.Z. defined the research questions, formulated the methodology, performed data analysis, drafted the initial manuscript, acquired funding, and reviewed and edited the manuscript. L.G. contributed to writing, reviewing, and editing; conducted formal analysis; and secured funding. L.Q. conducted formal analysis and secured funding. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62161041), and the Science and Technology Project of Inner Mongolia Autonomous Region (2021GG0046).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bhukya, S. Effect of gender on improving speech recognition system. Int. J. Comput. Appl. 2018, 179, 22–30. [Google Scholar] [CrossRef]
  2. Ramoji, S.; Ganapathy, S. Supervised I-vector modeling for language and accent recognition. Comput. Speech Lang. 2020, 60, 101030. [Google Scholar] [CrossRef]
  3. Singh, G.; Sharma, S.; Kumar, V.; Kaur, M.; Baz, M.; Masud, M. Spoken language identification using deep learning. Comput. Intell. Neurosci. 2021, 2021, 5123671. [Google Scholar] [CrossRef] [PubMed]
  4. Byrne, W.; Beyerlein, P.; Huerta, J.M.; Khudanpur, S.; Marthi, B.; Morgan, J.; Peterek, N.; Picone, J.; Vergyri, D.; Wang, T. Towards language independent acoustic modeling. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, 5–9 June 2000; Volume 2, pp. II1029–II1032. [Google Scholar]
  5. Kumar, A.; Verma, S.; Mangla, H. A survey of deep learning techniques in speech recognition. In Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 12–13 October 2018; pp. 179–185. [Google Scholar]
  6. Dong, L.; Zhou, S.; Chen, W.; Xu, B. Extending recurrent neural aligner for streaming end-to-end speech recognition in mandarin. arXiv 2018, arXiv:1806.06342. [Google Scholar]
  7. Kibria, S.; Rahman, M.S.; Selim, M.R.; Iqbal, M.Z. Acoustic analysis of the speakers’ variability for regional accent-affected pronunciation in Bangladeshi bangla: A study on Sylheti accent. IEEE Access 2020, 8, 35200–35221. [Google Scholar] [CrossRef]
  8. Deng, K.; Cao, S.; Ma, L. Improving accent identification and accented speech recognition under a framework of self-supervised learning. arXiv 2021, arXiv:2109.07349. [Google Scholar]
  9. Chen, J.; Wang, Y.; Wang, D. A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1993–2002. [Google Scholar] [CrossRef]
  10. Shi, X.; Yu, F.; Lu, Y.; Liang, Y.; Feng, Q.; Wang, D.; Qian, Y.; Xie, L. The accented english speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6918–6922. [Google Scholar]
  11. Chandra, E.; Sunitha, C. A review on Speech and Speaker Authentication System using Voice Signal feature selection and extraction. In Proceedings of the 2009 IEEE International Advance Computing Conference, Patiala, India, 6–7 March 2009; pp. 1341–1346. [Google Scholar]
  12. Liu, Z.T.; Rehman, A.; Wu, M.; Cao, W.H.; Hao, M. Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf. Sci. 2021, 563, 309–325. [Google Scholar] [CrossRef]
  13. Zhu, C.; An, K.; Zheng, H.; Ou, Z. Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 1034–1041. [Google Scholar]
  14. Chen, M.; Yang, Z.; Liang, J.; Li, Y.; Liu, W. Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; pp. 3620–3624. [Google Scholar]
  15. Nallasamy, U.; Metze, F.; Schultz, T. Enhanced polyphone decision tree adaptation for accented speech recognition. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012. [Google Scholar]
  16. Jain, A.; Singh, V.P.; Rath, S.P. A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 779–783. [Google Scholar]
  17. Qian, Y.; Gong, X.; Huang, H. Layer-wise fast adaptation for end-to-end multi-accent speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2842–2853. [Google Scholar] [CrossRef]
  18. Seide, F.; Li, G.; Yu, D. Conversational speech transcription using context-dependent deep neural networks. In Proceedings of the Twelfth Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011. [Google Scholar]
  19. Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
  20. Sim, K.C.; Narayanan, A.; Misra, A.; Tripathi, A.; Pundak, G.; Sainath, T.N.; Haghani, P.; Li, B.; Bacchiani, M. Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 892–896. [Google Scholar]
  21. Seki, H.; Yamamoto, K.; Akiba, T.; Nakagawa, S. Rapid speaker adaptation of neural network based filterbank layer for automatic speech recognition. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 574–580. [Google Scholar]
  22. Chen, X.; Meng, Z.; Parthasarathy, S.; Li, J. Factorized neural transducer for efficient language model adaptation. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 8132–8136. [Google Scholar]
  23. Bansal, S.; Kamper, H.; Livescu, K.; Lopez, A.; Goldwater, S. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv 2018, arXiv:1809.01431. [Google Scholar]
  24. Zuluaga-Gomez, J.; Ahmed, S.; Visockas, D.; Subakan, C. CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice. arXiv 2023, arXiv:2305.18283. [Google Scholar]
  25. Shor, J.; Emanuel, D.; Lang, O.; Tuval, O.; Brenner, M.; Cattiau, J.; Vieira, F.; McNally, M.; Charbonneau, T.; Nollstadt, M.; et al. Personalizing ASR for dysarthric and accented speech with limited data. arXiv 2019, arXiv:1907.13511. [Google Scholar]
  26. Li, B.; Sainath, T.N.; Sim, K.C.; Bacchiani, M.; Weinstein, E.; Nguyen, P.; Chen, Z.; Wu, Y.; Rao, K. Multi-dialect speech recognition with a single sequence-to-sequence model. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4749–4753. [Google Scholar]
  27. Na, H.J.; Park, J.S. Accented speech recognition based on end-to-end domain adversarial training of neural networks. Appl. Sci. 2021, 11, 8412. [Google Scholar] [CrossRef]
  28. Jain, A.; Upreti, M.; Jyothi, P. Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 2454–2458. [Google Scholar]
  29. Li, R.; Jiao, Q.; Cao, W.; Wong, H.S.; Wu, S. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9641–9650. [Google Scholar]
  30. Guan, H.; Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 2021, 69, 1173–1185. [Google Scholar] [CrossRef] [PubMed]
  31. Gaman, M.; Hovy, D.; Ionescu, R.T.; Jauhiainen, H.; Jauhiainen, T.; Lindén, K.; Ljubešić, N.; Partanen, N.; Purschke, C.; Scherrer, Y.; et al. A report on the VarDial evaluation campaign 2020. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain, 13 December 2020; pp. 1–14. [Google Scholar]
  32. Zhang, W.; Li, X.; Deng, Y.; Bing, L.; Lam, W. A survey on aspect-based sentiment analysis: Tasks, methods, and challenges. IEEE Trans. Knowl. Data Eng. 2022, 35, 11019–11038. [Google Scholar] [CrossRef]
  33. Yang, K.; Yang, D.; Zhang, J.; Wang, H.; Sun, P.; Song, L. What2comm: Towards communication-efficient collaborative perception via feature decoupling. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7686–7695. [Google Scholar]
  34. Sang, M.; Xia, W.; Hansen, J.H. Deaan: Disentangled embedding and adversarial adaptation network for robust speaker representation learning. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6169–6173. [Google Scholar]
  35. He, R.; Lee, W.S.; Ng, H.T.; Dahlmeier, D. An interactive multi-task learning network for end-to-end aspect-based sentiment analysis. arXiv 2019, arXiv:1906.06906. [Google Scholar]
  36. Pappagari, R.; Wang, T.; Villalba, J.; Chen, N.; Dehak, N. x-vectors meet emotions: A study on dependencies between emotion and speaker recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7169–7173. [Google Scholar]
  37. Lin, Y.; Wang, L.; Dang, J.; Li, S.; Ding, C. Disordered speech recognition considering low resources and abnormal articulation. Speech Commun. 2023, 155, 103002. [Google Scholar] [CrossRef]
  38. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  39. Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
  40. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
  41. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  42. Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
  43. Yao, Z.; Wu, D.; Wang, X.; Zhang, B.; Yu, F.; Yang, C.; Peng, Z.; Chen, X.; Xie, L.; Lei, X. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv 2021, arXiv:2102.01547. [Google Scholar]
  44. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.0376230. [Google Scholar]
  45. Gao, Z.; Li, Z.; Wang, J.; Luo, H.; Shi, X.; Chen, M.; Li, Y.; Zuo, L.; Du, Z.; Xiao, Z.; et al. Funasr: A fundamental end-to-end speech recognition toolkit. arXiv 2023, arXiv:2305.11013. [Google Scholar]
  46. Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 173–182. [Google Scholar]
  47. Zhu, X.; Zhang, F.; Gao, L.; Ren, X.; Hao, B. Research on Speech Recognition Based on Residual Network and Gated Convolution Network. Comput. Eng. Appl. 2022, 58, 185–191. [Google Scholar]
  48. Yang, Y.; Shen, F.; Du, C.; Ma, Z.; Yu, K.; Povey, D.; Chen, X. Towards universal speech discrete tokens: A case study for asr and tts. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10401–10405. [Google Scholar]
Figure 1. The overall framework of DFNet.
Figure 1. The overall framework of DFNet.
Mathematics 12 01886 g001
Figure 2. Feature decoupled module. Which “*” represents multiplication.
Figure 2. Feature decoupled module. Which “*” represents multiplication.
Mathematics 12 01886 g002
Figure 3. Heterogeneous information-weighted fusion module.
Figure 3. Heterogeneous information-weighted fusion module.
Mathematics 12 01886 g003
Figure 4. Aishell-1 dataset. (a) Selected audio catalogs from the Aishell-1 dataset. (b) An audio presentation of one of the Aishell-1 datasets. (c) Examples of some of the labels in the Aishell-1 dataset.
Figure 4. Aishell-1 dataset. (a) Selected audio catalogs from the Aishell-1 dataset. (b) An audio presentation of one of the Aishell-1 datasets. (c) Examples of some of the labels in the Aishell-1 dataset.
Mathematics 12 01886 g004
Figure 5. Henan dialect dataset. (a) Selected audio catalogs from the Henan dialect dataset. (b) An audio presentation of one of the Henan dialect datasets. (c) Examples of some of the labels in the Henan dialect dataset.
Figure 5. Henan dialect dataset. (a) Selected audio catalogs from the Henan dialect dataset. (b) An audio presentation of one of the Henan dialect datasets. (c) Examples of some of the labels in the Henan dialect dataset.
Mathematics 12 01886 g005
Figure 6. Guangdong dialect dataset. (a) Selected audio catalogs from the Guangdong dialect dataset. (b) An audio presentation of one of the Guangdong dialect datasets. (c) Examples of some of the labels in the Guangdong dialect dataset.
Figure 6. Guangdong dialect dataset. (a) Selected audio catalogs from the Guangdong dialect dataset. (b) An audio presentation of one of the Guangdong dialect datasets. (c) Examples of some of the labels in the Guangdong dialect dataset.
Mathematics 12 01886 g006
Figure 7. Characteristics of dialects and Mandarin before and after decoupling. (a) Dialect features before being decoupled. (b) After decoupling the dialectal features. (c) Mandarin features before decoupling. (d) Mandarin features after being decoupled. (Where red represents shared features, and blue represents exclusive features.)
Figure 7. Characteristics of dialects and Mandarin before and after decoupling. (a) Dialect features before being decoupled. (b) After decoupling the dialectal features. (c) Mandarin features before decoupling. (d) Mandarin features after being decoupled. (Where red represents shared features, and blue represents exclusive features.)
Mathematics 12 01886 g007
Figure 8. Map of shared features of Mandarin and visualization of dialect features at different stages of convergence. (a) Features are not fused. (b) Feature fusion after 80 epochs. (c) Feature fusion after 160 epochs. (d) Feature fusion after 240 epochs. (Where red represents dialect features, and blue represents Mandarin features.)
Figure 8. Map of shared features of Mandarin and visualization of dialect features at different stages of convergence. (a) Features are not fused. (b) Feature fusion after 80 epochs. (c) Feature fusion after 160 epochs. (d) Feature fusion after 240 epochs. (Where red represents dialect features, and blue represents Mandarin features.)
Mathematics 12 01886 g008
Figure 9. Various approaches to integration. (a) Add. (b) Concatenate. (c) Softmax attention. (d) Linear Attention.
Figure 9. Various approaches to integration. (a) Add. (b) Concatenate. (c) Softmax attention. (d) Linear Attention.
Mathematics 12 01886 g009
Figure 10. The impact of different hyperparameter values on performance.
Figure 10. The impact of different hyperparameter values on performance.
Mathematics 12 01886 g010
Table 1. Detailed information on the Aishell-1 Mandarin dataset, Henan dataset, and Guangdong dataset.
Table 1. Detailed information on the Aishell-1 Mandarin dataset, Henan dataset, and Guangdong dataset.
DatasetTotal Duration
(Hours)
Sampling Rate
(Hz)
Style
Aishell-117816,000reading
Henan3016,000reading
Gongdong3016,000reading
Table 2. The impact of Mandarin on dialect recognition without employing any specific methodology. (Where × indicates that the corresponding method is not used; which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 2. The impact of Mandarin on dialect recognition without employing any specific methodology. (Where × indicates that the corresponding method is not used; which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
IDLanguageDecoupledFusionTest CER (%) ↓
Attention Attention Rescoring CTC Greedy Search CTC Prefix Beam Search
C1Henan××24.5115.1316.1416.13
C2Henan + Mandarin××25.1015.7116.7716.79
C3Guangdong××38.1024.4725.9025.84
C4Guangdong + Mandarin××41.0135.2736.2336.16
Table 3. Comparison of DFNet with other classical models based on the combination of Henan dialect and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 3. Comparison of DFNet with other classical models based on the combination of Henan dialect and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
ModelCER (%) ↓
WeNet [43]14.33
FunASR [45]17.19
DeepSpeech [46]20.98
ResNet-GCFN [47]22.36
Icefall [48]16.09
DFNet (ours)11.69
Table 4. Comparison of DFNet with other classical models based on the combination of Guangdong dialect and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 4. Comparison of DFNet with other classical models based on the combination of Guangdong dialect and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
ModelCER (%) ↓
WeNet [43]24.47
FunASR [45]26.19
DeepSpeech [46]32.16
ResNet-GCFN [47]35.44
Icefall [48]24.61
DFNet (ours)21.79
Table 5. Ablation experiments with DFNet. (Where × indicates that the appropriate method is not used, and ✓ indicates that the appropriate method is used; which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 5. Ablation experiments with DFNet. (Where × indicates that the appropriate method is not used, and ✓ indicates that the appropriate method is used; which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
IDLanguageDecoupledFusionTest CER (%) ↓
Attention Attention Rescoring CTC Greedy Search CTC Prefix Beam Search
A1Henan + Mandarin16.5911.6912.6412.53
A2Henan + Mandarin×20.4213.9214.9914.88
A3Henan + Mandarin×17.2912.5213.8313.72
A4Henan + Mandarin××25.1015.7116.7716.79
B1Guangdong + Mandarin25.5721.7922.9322.89
B2Guangdong + Mandarin×27.5723.2624.7324.68
B3Guangdong + Mandarin×25.3122.7523.8423.82
B4Guangdong + Mandarin××41.0135.2736.2336.16
Table 6. Different ways of integrating Henan with Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 6. Different ways of integrating Henan with Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
FusionCER (%) ↓
Add19.11
Concatenate19.43
Softmax Attention16.38
Linear Attention15.61
Heterogeneous Information-Weighted Fusion (ours)11.69
Table 7. Different ways of integrating Gongdong and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 7. Different ways of integrating Gongdong and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
ModeCER (%) ↓
Add30.16
Concatenate28.84
Softmax Attention25.69
Linear Attention24.27
Heterogeneous Information-Weighted Fusion (ours)21.79
Table 8. Various maximization contrasts in Henan and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 8. Various maximization contrasts in Henan and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
ModeShared and Exclusive Feature Similarity ↓
L1 Paradigm0.24
L2 Paradigm0.32
KL Dispersion0.24
Cosine Similarity (ours)0.15
Table 9. Various maximization contrasts in Gongdong and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
Table 9. Various maximization contrasts in Gongdong and Mandarin. (Which ↓ is used to indicate that the lower the word error rate, the better the effect of the model.)
ModeShared and Exclusive Feature Similarity ↓
L1 Paradigm0.32
L2 Paradigm0.26
KL Dispersion0.29
Cosine Similarity (ours)0.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Q.; Gao, L.; Qin, L. DFNet: Decoupled Fusion Network for Dialectal Speech Recognition. Mathematics 2024, 12, 1886. https://0-doi-org.brum.beds.ac.uk/10.3390/math12121886

AMA Style

Zhu Q, Gao L, Qin L. DFNet: Decoupled Fusion Network for Dialectal Speech Recognition. Mathematics. 2024; 12(12):1886. https://0-doi-org.brum.beds.ac.uk/10.3390/math12121886

Chicago/Turabian Style

Zhu, Qianqiao, Lu Gao, and Ling Qin. 2024. "DFNet: Decoupled Fusion Network for Dialectal Speech Recognition" Mathematics 12, no. 12: 1886. https://0-doi-org.brum.beds.ac.uk/10.3390/math12121886

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop