Next Article in Journal
Impact of EEG Frequency Bands and Data Separation on the Performance of Person Verification Employing Neural Networks
Next Article in Special Issue
Dynamic Feature Dataset for Ransomware Detection Using Machine Learning Algorithms
Previous Article in Journal
Dual-Band Branch-Line Coupler Based on Crossed Lines for Arbitrary Power-Split Ratios
Previous Article in Special Issue
Topic-Emotion Propagation Mechanism of Public Emergencies in Social Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multimodal Feature Fusion Method for Unbalanced Sample Data in Social Network Public Opinion

1
School of Cyber Security, Changchun University, Changchun 130022, China
2
School of Computer Science and Technology, Changchun University, Changchun 130022, China
3
Jilin Provincial Key Laboratory of Human Health Status Identification and Function Enhancement, Changchun 130022, China
4
School of Electronic Information Engineering, Changchun University, Changchun 130022, China
*
Author to whom correspondence should be addressed.
Submission received: 30 May 2022 / Revised: 19 July 2022 / Accepted: 20 July 2022 / Published: 25 July 2022
(This article belongs to the Special Issue Security and Privacy in Large-Scale Data Networks)

Abstract

:
With the wide application of social media, public opinion analysis in social networks has been unable to be met through text alone because the existing public opinion information includes data information of various modalities, such as voice, text, and facial expressions. Therefore multi-modal emotion analysis is the current focus of public opinion analysis. In addition, multi-modal emotion recognition of speech is an important factor restricting the multi-modal emotion analysis. In this paper, the emotion feature retrieval method for speech is firstly explored and the processing method of sample disequilibrium data is then analyzed. By comparing and studying the different feature fusion methods of text and speech, respectively, the multi-modal feature fusion method for sample disequilibrium data is proposed to realize multi-modal emotion recognition. Experiments are performed using two publicly available datasets (IEMOCAP and MELD), which shows that processing multi-modality data through this method can obtain good fine-grained emotion recognition results, laying a foundation for subsequent social public opinion analysis.

1. Introduction

Online public opinion gathers public views on public social events and has a huge impact on associated objects [1], such as the melamine event in 2008 [2], the Wei Zexi event in 2016 [3], and “COVID-19” in 2020 [4]. The exposure of these events has a heavy blow on relevant objects, revealing many problems existing behind the event and causing a significant impact on people’s life safety. People’s emotions have undergone great changes with the development and spread of public opinion, which has caused much trouble for public opinion management, including the World Health Organization and governments of different countries [5]. In addition, online public opinion has caused great distress for people to analyze online public opinion because of its attributes of complex content, the coexistence of truth and falsehood, and easy dissemination [6]. As an important part of public opinion analysis, emotion recognition plays an important role in the field of artificial intelligence, which still remains a challenging task even with the development of deep learning and natural language processing. The main reason is that there are many ways and characteristics of expressing emotion, such as implicit emotion, dialogue emotion, and so on. We can capture emotional features by studying different ways, such as speech features, video features, facial features, EEG features [6,7,8,9], etc. Therefore, fine-grained multi-modal emotion recognition has become one of the current hotspots in the field of emotion analysis. Among them, for the selection of multi-modal data, speech is the easiest to obtain, and it is also the most widely used mode in people’s daily communication and there are also a large number of emotional features in the speech signal, so emotion analysis is the core and hotspot in the field of multi-modal research [10,11,12].
In the field of multi-modal emotion analysis, there are many research models for different modalities. Wollmer et al. [13] fused audio and video modalities for the first time and used bidirectional long and short-term memory neural networks to conduct multi-modal emotion analysis, whose experimental results showed that this multi-modality recognition effect was superior to the single-modality results. Morency et al. [14] conducted validation experiments for multi-modality, and the results demonstrated that a joint model integrating video, audio, and text features could be effectively used to identify emotions in online videos. Afterward, Poria et al. [15] proposed a new approach for multi-modal emotion analysis, which consisted of collecting emotion from web videos by presenting a model that used audio, video and text modalities as information sources. Soleymani et al. [16] proposed a multi-modal emotion data analysis framework to retrieve user opinions and sentiments from video content. Poria et al. [17] proposed a multi-modality analysis model based on LSTM that enables utterances to capture contextual information from their surroundings in the same video, thereby aiding emotion analysis. In 2018, Majumder et al. [18] proposed a novel feature fusion strategy, which is performed hierarchically by first fusing the two modes together and then fusing the third pattern to provide a new direction for multi-modal feature fusion. More and more people study different model fusion methods and optimize them continuously, but the problem of sample imbalance in the data set itself, as well as the emotion feature retrieval method that restricts a single modality in the process of modal fusion, are not analyzed.
Based on this problem, the main contributions of this paper are as follows:
(1) In the text and speech models, the paper first analyzes the speech emotion feature retrieval method, which restricts the modal fusion in the existing modal fusion method, analyzes the emotion features in the speech signal, and then puts forward the MA2PE speech feature retrieval method.
(2) As for the problem of sample disequilibrium, several common data processing methods for sample disequilibrium are analyzed, and SOM oversampling method are finally proposed.
(3) A fine-grained emotion recognition method for both text and speech modalities of sample disequilibrium data is proposed and validated on IEMOCAP and MELD datasets, proving that this method outperforms the existing models.

2. Data Preprocessing and Model Architecture

2.1. Audio Feature Extraction Method–MA2PE

The traditional audio feature retrieval method MFCC obtains audio features with loss in treble or bass more or less. Meanwhile, the length of the generated audio features is also different due to the different lengths of each audio segment in the dataset; therefore, its length needs to be processed in a cutting or filling manner before the retrieved audio features are sent into the model for analysis, which to a large extent makes audio features have different deletions or redundancies [19,20,21].
Based on this, a new feature retrieval method, MA2PE, is proposed, which adopts a series of operations to retrieve and convert all speech data of different lengths into 8-dimensional feature vectors. The specific operation methods are as follows:
First, a 44.1 kHz sampling rate is used to read the time series of each audio, and the mean and standard deviation of the absolute value of each audio time series are obtained as two of the features, which not only retain other features but also play a role in integrating all length vectors, and the mean and standard deviation can also reflect certain feature differences.
Second, because the energy of the speech signal is related to its tone, thus it can be used to detect certain elevated emotions, such as anger, excitement, etc., as the tone and loudness of voices emitted by people in the case of anger will be different. Therefore, the following formula is used to calculate the root mean square of each audio spectral feature frame by frame and calculate the corresponding standard deviation to learn and represent the tone and loudness features of speech signals.
E = 1 n i = 1 n y [ i ] 2
Here n denotes the length of the time series, and y [ i ] denotes the input signal.
Then, we obtained the ratio of mute state to the total state in the audio signal as the fifth feature because of the fact that our speech rate will also vary in different emotional states, which leads to different proportions of mute state and it is significant for us to learn emotion features. Besides, we also calculate the harmonic energy in the time-frequency signal as the sixth feature. When people change their emotions, there will be changes in harmonics, which provides a reference for us to do fine-grained emotion analysis.
Finally: we all know that the waveform generated by articulation changes with our mood, which is the treble signal. There are many commonly used treble detection algorithms, including the modified autocorrelation function method (MACF) and normalized cross-correlation function method (NCCF), for example. This paper adopts the autocorrelation algorithm based on the center clipping frame, which is calculated by the following formula:
r e s [ n ] = y [ n ] C l , y [ n ] C l 0 , | y [ n ] | < C l y [ n ] + C l , y [ n ] C l
where y [ n ] is the input signal and C l is half of the mean value of the input signal. We also put the calculated autocorrelation coefficient r e s [ n ] and the normalized value into the features of speech emotion.
Through the above methods, the 8-dimensional feature retrieval method of MA2PE can be obtained.

2.2. Oversampling Processing Method Based on SOM

The traditional sample disequilibrium processing method includes an oversampling method, undersampling method, and reweighting method [22,23,24], among which the traditional oversampling and undersampling method is to copy and delete the original sample. However, this method of obtaining text features causes repetition and waste of data features and cannot play a great role in the learning of fine-grained emotion features, while the multi-modal reweighting method also has problems more or less [25,26,27]. Therefore, this paper proposes a SOM oversampling method - an oversampling method based on TF-IDF synonymous substitution. It uses the commonly used text segmentation method to segment the text, obtains different keywords through TF-IDF, expands data according to the needs of the text, finds multiple alternative topics from the big data thesaurus, and replaces them to generate new sample data. In addition, through the text generated according to the sample and the voice data corresponding to the text of the original data, the package in the moviepy library is used to modify the audio content without changing audio information such as pitch, pitch, etc. The characteristics of the new samples generated by this method are different from those generated by the previous method, and they are diverse. In this way, if they are sent to the model for learning, more characteristics can be obtained, which is very useful for the analysis of fine-grained emotion features.

2.3. Modal Fusion Method

For the modal fusion method, the fusion methods based on the decision layer and the feature layer are selected respectively for the analysis and verification of the comparative experiments. The following two fusion models are designed respectively, as shown in Figure 1 and Figure 2.
Among the fusion methods based on the decision layer, there are several common combination methods, including the voting method, fuzzy integration method and D-S evidence reasoning method. However, this classification method cannot allocate appropriate weights based on the features of each modality sufficiently [28]. Hence, this paper proposes a dynamic weight allocation method. According to the classification results of speech and text, the same weight is given first, and the weight ratio is adjusted continuously by comparing the correct results with the linear weighting plan, and finally, the suitable weight is obtained.

2.4. Model Framework

The multi-modal fine-grained emotion analysis model based on feature layer fusion designed in this paper is mainly composed of four modules: processing module of few shot data, text feature retrieval module, speech feature retrieval mode, and multi-modal feature layer fusion module. As shown in Figure 3, the model processes text and speech modalities separately.
According to the text modality, firstly, the text data expansion method is used to amplify the few shot data in the sample disequilibrium data, and then the amplified text is feature retrieved.
For the speech modality, the corresponding audio data generation method is used to generate the corresponding text, and then the audio data feature is retrieved. Because the audio feature dimension proposed in this paper is only 8-dimensional, it is easy to complicate the features by selecting dot multiplication or dot product operation. Thus, finally, the feature layer is fused by simple stitching and then sent into the model for analysis to obtain the required results.

3. Parameter Setting and Result Analysis

3.1. Datasets

Due to the lack of emotional data of existing network discourse, We mainly use two multi-modal dialogue emotional data sets, MELD and IEMOCAP, for experiments. The MELD dataset comes from about 13,000 utterances from 1433 dialogues from the TV series Friends. It is divided into seven categories in total, as shown in Table 1. The IEMOCAP dataset contains about 12 h of audio-visual data, including video, voice, facial motion capture, and text content. Contains ang, happy, sad, fear, surprise, and neutral six emotions; this study focuses on two modes, the corresponding amount of data (text and audio) shown in Table 1.

3.2. Experiment Procedures

(1) For the few shot data, the oversampling method is used to generate text and speech data corresponding to the results, respectively. The specific generation results are shown in Figure 4.
(2) Text dialogues are divided into tokens and each word becomes lowercase.
(3) As for the text processing scheme, TF-IDF is used to obtain the weight matrix of the text.
(4) Set the audio sampling rate to 44,100 Hz.
(5) Audio features were retrieved by using the audio feature retrieval method of MA2PE.
(6) A new two-modality feature is generated by simply stitching the generated speech and text features.
(7) The model is trained by using the algorithm of Random Forest and tested on the test set, and the results of each modality are calculated.

3.3. Experiment Results and Analysis

In this part, the results of previous experiments are presented and analyzed. Table 2 and Table 3 present the results of the MELD data set and IEMOCAP data set in the case of text, speech, and multi-modal fusion, respectively. Because there are differences in the data volume between MELD and IEMOCAP, and there are differences in the sample disequilibrium proportion, the MELD dataset and IEMOCAP dataset will be analyzed separately in this paper.

3.3.1. Experiment Results and Analysis of MELD Data Set

This paper first conducts an evaluation experiment of model comparison on the MELD data set and the experimental results are shown in Table 2. The heat map of the results on the multi-modal fine-grained emotion analysis structure based on feature layer fusion is shown in Figure 5. The benchmark models of this experiment are the Text-CNN, HiGRU-sf [29], cMKL, bcLSTM [17], and DialogueRNN [15] models shown in Table 2, all of which are the optimal results of multi-modal fine-grained analysis of the MELD data set. Other models, including SVC, LR, etc., are all models used after over-sampling the samples. According to the results in the table and the following analysis results, we can draw the following conclusions:
Table 2. MELD dataset analysis result table.
Table 2. MELD dataset analysis result table.
MethodsModelAngerDisgustFearJoyNeutralSadnessSurpriseaccw_avg
Text-CNNtext34.498.223.7449.3974.8821.0545.4555.02
HiGRU-sftext21.160247.0191.560.4840.9358.58
cMKLText + audio39.5016.103.7551.3972.7323.9546.2555.51
bcLSTMtext42.0621.697.7554.3171.6326.9248.1556.44
Audio25.856.062.9015.7461.8614.7119.3439.08
text + audio43.3923.669.3854.4876.6724.3451.0459.25
DialogueRNNtext40.592.048.9350.2775.7524.1949.3857.03
Audio35.185.135.5613.1765.5714.0120.4741.75
text + audio43.657.8911.6854.4077.4434.5952.5160.25
LRaudio21.470.000.0020.7127.5427.7919.6721.0517.11
text84.7193.7188.9674.2860.2190.5067.6980.1379.76
text + audio79.1493.1488.0173.9662.4190.2071.7479.9479.39
text + audio (back)87.7693.7589.2477.6570.7192.9863.3782.3282.60
MLPaudio23.6425.6925.7821.3222.2227.0823.5224.0724.04
text89.6896.0392.8179.9584.9193.9768.1484.485.69
text + audio95.1095.9597.9486.0282.4094.2381.8090.2390.24
text + audio (back)93.0895.7093.2286.6493.3396.8870.2387.5188.94
MNBaudio20.030010017.217.8622.73
text81.4090.6982.0864.9160.7791.1963.6875.6275.85
text + audio84.0093.1082.3361.2263.9793.4663.3175.3476.56
text + audio (back)83.3292.8983.4965.5676.1693.8265.1277.4779.02
SVCaudio20.8046.4328.3921.300.0029.1621.1021.7123.79
text90.2495.3288.9675.4867.7193.1770.2482.5982.71
text + audio89.8794.3189.3777.1868.7992.6571.4383.0283.07
Our modelaudio84.9197.9597.9182.1287.9690.6986.4288.8889.08
text87.7494.7490.9380.0177.4693.4471.4684.0884.50
text + audio93.8099.4499.3286.1496.7797.8889.8393.7394.13
As for text data: we can observe that the accuracy rate of almost all emotions is higher than that of audio, which may be due to the fact that the features obtained from text are richer than those of audio. As shown in Figure 6, compared with the benchmark model, we can find that almost all oversampling processing methods are superior to other methods, while the multi-modal fine-grained emotion analysis method based on feature layer fusion proposed in this paper can bring better results than other methods, and the classification effect is also the best. The accuracy rate of all emotion categories is about 80%, whether acc or w_avg is used as the evaluation index. It is the best category compared with other methods because the classification effect reaches 84.50%.
As for audio data: Compared with the original audio feature retrieval method, the current one is more general and with a better result, especially for the problem of sample disequilibrium, which can effectively prevent the occurrence of the overfitting phenomenon. As shown in Figure 7, we can observe that the results of logistic regression, multi-layer perceptron and support vector machines are relatively low, while the results of ours are the highest, whose overall classification effect reaches 88.88%, and its accuracy rate of aversive emotion reaches 97.95% at the highest. In addition, it is also found that this method has high accuracy in judging negative emotions. Because of the characteristics of audio signals, it is not easy to learn but very easy to overfit. The commonly used speech feature retrieval method, such as MFCC, retrieves features in various lengths due to the problem of audio length. For the unity of the model, we need to cut or fill, which leads to redundancy or loss of feature information. While our feature retrieval method can effectively solve this problem and also retain the audio features. Therefore, it can be found that the accuracy rate of audio features retrieved on MA2PE is higher than that of text.
For multi-modal data sets: we can find the results of modal fusion, whether feature layer fusion or decision layer fusion is superior to a single modality. In the figure, text + audio (back) refers to decision layer fusion, while text + audio refers to feature layer fusion method. Besides, the best method improves the accuracy rate by almost 10%. Compared with the weighted fusion method based on the decision layer, the method of feature layer fusion is similar in effect, but the method of decision layer fusion requires reasonable learning of the weights of different modalities, which is very time-consuming. While the feature layer fusion method of stitching not only retains the features of the two modalities of text and speech but also does not cause great time consumption of the fused features because the speech features are only 8-dimensional. Therefore, it is a more appropriate choice. Compared with other basic models, our model has better results in any kind of fine-grained emotional data set, and the final classification effect reaches 94.13%, which is 33.88% higher than the best DialogueRNN model of the benchmark model. From Figure 8, we can also find that the overall results of all the models processed by the oversampling method are superior to those of the benchmark model.

3.3.2. Experiment Results and Analysis of IEMOCAP Data Set

A comparison and evaluation experiment of the model is carried out on the IEMOCAP data set and the experimental results are shown in Table 3. The heat map of the results on the multi-modal fine-grained emotion analysis structure based on feature layer fusion is shown in Figure 9 and Figure 10. As shown in Table 3, the benchmark models of this experiment are HiGRU [29], HiGRU-sf [29], mement [30], cLSTM [17], TFN [31], MFN [32], CMU [33], and ICON [10], all of which are the optimal results for the multi-modal fine-grained analysis of the IEMOCAP data set. Other models, including SVC, LR, etc., are all models used after over-sampling the samples. According to the results in the table and the following analysis results, we can reach the following conclusions:
Table 3. IEMOCAP dataset analysis result table.
Table 3. IEMOCAP dataset analysis result table.
MethodModelanghapexcsadfruneuaccw_avg
HiGRUtext64.1239.8662.2178.3760.3762.5062.52
HiGRU-sftext71.1851.7562.8870.2061.6864.8464.06
LRaudio40.0724.3616.6742.8515.38032.4424.47
text72.7169.3669.8961.7849.0345.3764.0162.94
text + audio73.1873.6673.3568.7751.8751.6567.8266.71
MLPaudio40.2225.9128.8642.2017.4146.6634.1033.07
text80.3288.6177.1960.5355.3750.6071.4371.37
text + audio77.6689.6484.9282.1258.4154.3377.9476.48
MNBaudio69.7623.24000024.7817.28
text73.5864.1765.3557.3549.6446.2461.4760.59
text + audio70.9760.2769.1667.5450.6455.8763.0562.79
SVCaudio39.6624.7310039.3711.36032.6736.57
text75.2975.3369.0268.6050.0046.6967.6266.08
text + audio73.2283.1877.0268.8950.8549.6170.5169.07
XGBaudio59.8162.9355.3467.4029.6039.7258.9654.13
text70.1642.7871.3059.6147.6645.8051.4655.91
text + audio55.2058.8973.6763.3147.3966.4060.1660.68
MenmentText + audio + video67.124.465.260.468.456.859.9
Text + audio + video70.025.558.858.667.456.559.8
Text + audio + video69.123.263.158.065.556.658.8
Text + audio + video72.324.064.365.667.955.560.1
CMUText + audio + video67.625.769.966.571.753.961.9
ICONText + audio + video68.223.672.270.671.959.964.0
Our modelaudio57.9771.6759.8966.3038.5541.7661.6257.88
text77.4684.7877.3555.8156.4553.7269.8969.77
text + audio72.7692.3880.7273.7353.2158.1375.8773.93
Analysis of the text: As shown in Figure 11, the method of oversampling and the method without oversampling are compared, and the results of the oversampling method are greatly improved. Compared with other models of oversampling, the emotion analysis model based on feature layer fusion proposed in this paper can achieve the classification effect at 67.77 due to the vast majority of models.
Analysis of the audio: Because there is no benchmark model for comparison, we compare different models of the same processing method. As shown in Figure 12, the SVC of the emotion analysis model based on feature layer fusion proposed in this paper is higher than that of other models. Relatively speaking, our model is superior to other models because of its better stability.
Analysis of multi-modal fusion: As shown in Figure 13, compared with the other three modality fusion benchmark models, the fusion results of our two modalities are superior to the fusion results of the three modalities, showing good results in various emotional result features. Compared with other models in the same processing, the classification effect of our model is basically the same as other models.
However, by comparing the classification effect of IEMOCAP and MELD data, we find that this method has a great effect on the MELD data set and a good effect on the IEMOCAP data set. Relatively speaking, its classification effect is poor. Therefore, we analyze the differences between the two data sets. By comparing the results of the two data sets shown in Table 1, we observe that the MELD data set is very wide and with stronger data disequilibrium performance. After data expansion, the data volume is basically around 8000, as shown in Figure 4, while the data disequilibrium of IEMOCAP data itself is not obvious. After data expansion, the data volume basically remains around 1500. Compared with the MELD data set, the data volume is much less and the samples are of relative equilibrium, so it can be concluded that the multi-modal fine-grained emotion analysis model based on feature layer fusion proposed in this paper has a better effect on the sample disequilibrium data.

4. Conclusions

This paper mainly aims at the limitations of traditional speech emotion features existing in multi-modal emotion recognition and the problem of accuracy rate decline caused by disequilibrium sample data of multi-modal research. Besides, it further studies speech emotion features and sample disequilibrium problems before a speech emotion feature retrieval method of MA2PE and an oversampling method of SOM are respectively proposed. The MA2PE method can express emotion features better, which can improve the accuracy rate of emotion expression by nearly 30%. SOM oversampling method is used to amplify samples and improve the utilization rate of data. In addition, it is verified in two public data sets—MELD and IEMOCAP, whose results show that it can achieve good results.
Based on the methods of MA2PE and SOM, we propose a multi-modal fine-grained emotion analysis model based on feature layer fusion. By combining the first two methods, the model fuses text features and speech features at the feature layer and puts them into the model for analysis. Through the analysis results of different methods of the two data sets, we can conclude that this method has a good effect on the study of multi-modal data based on sample disequilibrium, and the more unequal the data are, the better the effect is.
This paper mainly analyzes public emergencies in voice and text modes, but public opinion events also include other modalities, such as video, expression, etc., which also have important guidance for the event results. Therefore, in the subsequent research, more attention will be paid to the study of public opinion data of more modalities.

Author Contributions

W.D. and J.Z. conceived and put forward the research ideas and carried out the research. W.D., J.Z. and L.S. were in charge of the calculation and experimental data. W.D., J.Z., W.Q., Z.K. and D.X. collected information and wrote the paper. T.A. Suggest revisions and refinements to the thesis. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [the Jilin Provincial Department of Science and Technology] grant number [No. 20190201195JC. 20200601004JC. 20200301054RQ. 20200404207YY], and funded by [Science and Technology Development Plan of Jilin Province] grant number [20200403120SF], and funded by [Natural Science Foundation of Jilin Province] grant number [20210101477JC], and funded by [the National Natural Science Foundation of China for supporting the research in this article] grant number [No. 61502052].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cai, M.; Luo, H.; Meng, X.; Cui, Y. Topic-Emotion Propagation Mechanism of Public Emergencies in Social Networks. Sensors 2021, 21, 4516. [Google Scholar] [CrossRef] [PubMed]
  2. China Dairy Products Found Tainted with Melamine. BBC News, 9 July 2010.
  3. Death of Wei Zexi. Available online: https://en.wikipedia.org/w/index.php?title=Death_of_Wei_Zexi&oldid=1071405010 (accessed on 5 March 2022).
  4. COVID-19. Available online: https://covid19.rs (accessed on 26 September 2021).
  5. Fei, G.; Yue, C.; Yaxin, W.; Yueyi, L.; Zhiran, C. Emotional health status and social mentality of the Chinese general public during the 2019 novel coronavirus pneumonia pandemic. Sci. Technol. Rev. 2020, 38, 68–76. [Google Scholar]
  6. Soleymani, M.; Pantic, M.; Pun, T. Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 2011, 3, 211–223. [Google Scholar] [CrossRef] [Green Version]
  7. Koromilas, P.; Giannakopoulos, T. Deep multimodal emotion recognition on human speech: A review. Appl. Sci. 2021, 11, 7962. [Google Scholar] [CrossRef]
  8. Mittal, T.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1359–1367. [Google Scholar] [CrossRef]
  9. Zheng, W.L.; Dong, B.N.; Lu, B.L. Multimodal emotion recognition using EEG and eye tracking data. In Proceedings of the 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 26–30 August 2014; pp. 5040–5043. [Google Scholar]
  10. Hazarika, D.; Poria, S.; Mihalcea, R.; Cambria, E.; Zimmermann, R. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2594–2604. [Google Scholar] [CrossRef] [Green Version]
  11. Jiang, Q.; Chen, L.; Xu, R.; Ao, X.; Yang, M. A Challenge Dataset and Effective Models for Aspect-Based Sentiment Analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6280–6285. [Google Scholar] [CrossRef]
  12. Mai, S.; Xing, S.; Hu, H. Analyzing Multimodal Sentiment Via Acoustic- and Visual-LSTM With Channel-Aware Temporal Convolution Network. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1424–1437. [Google Scholar] [CrossRef]
  13. Wöllmer, M.; Metallinou, A.; Eyben, F.; Schuller, B.; Narayanan, S. Context-Sensitive Multimodal Emotion Recognition from Speech and Facial Expression using Bidirectional LSTM Modeling. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010. [Google Scholar]
  14. Morency, L.P.; Mihalcea, R.; Doshi, P. Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web. In Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, Spain, 14–18 November 2011; Association for Computing Machinery: Stroudsburg, PA, USA, 2011; pp. 169–176. [Google Scholar] [CrossRef]
  15. Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 527–536. [Google Scholar] [CrossRef] [Green Version]
  16. Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
  17. Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, July 30–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 873–883. [Google Scholar] [CrossRef]
  18. Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef] [Green Version]
  19. Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching. IEEE Trans. Multimed. 2018, 20, 1576–1590. [Google Scholar] [CrossRef]
  20. Chi, P.H.; Chung, P.H.; Wu, T.H.; Hsieh, C.C.; Chen, Y.H.; Li, S.W.; Lee, H.Y. Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 344–350. [Google Scholar] [CrossRef]
  21. Hou, Y.; Yu, X.; Yang, J.; Ouyang, X.; Fan, D. Acoustic Sensor-Based Soundscape Analysis and Acoustic Assessment of Bird Species Richness in Shennongjia National Park, China. Sensors 2022, 22, 4117. [Google Scholar] [CrossRef] [PubMed]
  22. Zhou, Y.; Xie, H.; Fang, S.; Wang, J.; Zha, Z.; Zhang, Y. TDI TextSpotter: Taking Data Imbalance into Account in Scene Text Spotting. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2510–2518. [Google Scholar]
  23. Wang, Z.; Wu, C.; Zheng, K.; Niu, X.; Wang, X. SMOTETomek-Based Resampling for Personality Recognition. IEEE Access 2019, 7, 129678–129689. [Google Scholar] [CrossRef]
  24. Wasikowski, M.; Chen, X.W. Combating the Small Sample Class Imbalance Problem Using Feature Selection. IEEE Trans. Knowl. Data Eng. 2010, 22, 1388–1400. [Google Scholar] [CrossRef]
  25. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Ro{BERT}a: A Robustly Optimized {BERT} Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  26. Li, R.; Chen, H.; Feng, F.; Ma, Z.; Wang, X.; Hovy, E. Dual Graph Convolutional Networks for Aspect-based Sentiment Analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 6319–6329. [Google Scholar] [CrossRef]
  27. Aye, Y.M.; Aung, S.S. Sentiment analysis for reviews of restaurants in Myanmar text. In Proceedings of the 2017 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Kanazawa, Japan, 26–28 June 2017; pp. 321–326. [Google Scholar] [CrossRef]
  28. Song, X. Research on Multimodal Emotion Recognition Based on Text, Speech and Video. Master’s Thesis, Shan Dong University, Shandong, China, 2019. [Google Scholar]
  29. Jiao, W.; Yang, H.; King, I.; Lyu, M.R. HiGRU: Hierarchical Gated Recurrent Units for Utterance-level Emotion Recognition. arXiv 2019, arXiv:1904.04446. [Google Scholar]
  30. Sukhbaatar, S.; Szlam, A.; Weston, J.; Fergus, R. End-To-End Memory Networks. In Proceedings of the Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
  31. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1103–1114. [Google Scholar] [CrossRef] [Green Version]
  32. Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory Fusion Network for Multi-View Sequential Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018. [Google Scholar]
  33. Hazarika, D.; Poria, S.; Zadeh, A.; Cambria, E.; Morency, L.P.; Zimmermann, R. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2122–2132. [Google Scholar] [CrossRef]
Figure 1. Feature Layer Fusion Scheme for Fewer Samples.
Figure 1. Feature Layer Fusion Scheme for Fewer Samples.
Sensors 22 05528 g001
Figure 2. Decision Level Fusion Approach for Few Shot.
Figure 2. Decision Level Fusion Approach for Few Shot.
Sensors 22 05528 g002
Figure 3. Multi-modal fine-grained emotion classification model based on sample disequilibrium.
Figure 3. Multi-modal fine-grained emotion classification model based on sample disequilibrium.
Sensors 22 05528 g003
Figure 4. The representation of sample data volume after over-sampling in MELD dataset (where 0–6 in label represents six emotional labels: ‘anger’, ‘disgust’, ‘fear’, ‘joy’, ‘neutral’, ‘sadness’, ‘surprise’).
Figure 4. The representation of sample data volume after over-sampling in MELD dataset (where 0–6 in label represents six emotional labels: ‘anger’, ‘disgust’, ‘fear’, ‘joy’, ‘neutral’, ‘sadness’, ‘surprise’).
Sensors 22 05528 g004
Figure 5. Results Heat Map of MELD Data Set Fusion Model.
Figure 5. Results Heat Map of MELD Data Set Fusion Model.
Sensors 22 05528 g005
Figure 6. Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (Text).
Figure 6. Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (Text).
Sensors 22 05528 g006
Figure 7. Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (speech).
Figure 7. Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (speech).
Sensors 22 05528 g007
Figure 8. Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (Multimodalities).
Figure 8. Fine-grained Emotion Accuracy Rate of Different Methods of MELD Data Set (Multimodalities).
Sensors 22 05528 g008
Figure 9. Result Heat Map of Fusion Model of IEMOCAP Data Set (out model).
Figure 9. Result Heat Map of Fusion Model of IEMOCAP Data Set (out model).
Sensors 22 05528 g009
Figure 10. Speech Feature Heat Map of IEMOCAP Data Set (our model).
Figure 10. Speech Feature Heat Map of IEMOCAP Data Set (our model).
Sensors 22 05528 g010
Figure 11. Fine-grained Emotion Accuracy Rate Analysis of Different IEMOCAP Methods (Text).
Figure 11. Fine-grained Emotion Accuracy Rate Analysis of Different IEMOCAP Methods (Text).
Sensors 22 05528 g011
Figure 12. Fine-grained Emotion Accuracy Rate Analysis of Different MOCAP Methods (Audio).
Figure 12. Fine-grained Emotion Accuracy Rate Analysis of Different MOCAP Methods (Audio).
Sensors 22 05528 g012
Figure 13. Fine-grained Emotion Accuracy Rate Analysis of Different MOCAP Methods (Multimodalities).
Figure 13. Fine-grained Emotion Accuracy Rate Analysis of Different MOCAP Methods (Multimodalities).
Sensors 22 05528 g013
Table 1. IEMOCAP dataset and MELD dataset data volume.
Table 1. IEMOCAP dataset and MELD dataset data volume.
Emotion CategoryIEMOCAP Total Data VolumeMELD Total Data Volume
ang11031607
Hap/joy5952308
Sad/sadness10411002
fear1084358
surprise18491636
neutral17086436
disgust361
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhao, J.; Dong, W.; Shi, L.; Qiang, W.; Kuang, Z.; Xu, D.; An, T. Multimodal Feature Fusion Method for Unbalanced Sample Data in Social Network Public Opinion. Sensors 2022, 22, 5528. https://0-doi-org.brum.beds.ac.uk/10.3390/s22155528

AMA Style

Zhao J, Dong W, Shi L, Qiang W, Kuang Z, Xu D, An T. Multimodal Feature Fusion Method for Unbalanced Sample Data in Social Network Public Opinion. Sensors. 2022; 22(15):5528. https://0-doi-org.brum.beds.ac.uk/10.3390/s22155528

Chicago/Turabian Style

Zhao, Jian, Wenhua Dong, Lijuan Shi, Wenqian Qiang, Zhejun Kuang, Dawei Xu, and Tianbo An. 2022. "Multimodal Feature Fusion Method for Unbalanced Sample Data in Social Network Public Opinion" Sensors 22, no. 15: 5528. https://0-doi-org.brum.beds.ac.uk/10.3390/s22155528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop