1. Introduction
Voice recognition is a significant technique for humans to communicate with each other to express their emotions, cognitive processes, and objectives. Humans produce distinct voices by a natural biological system in which the lungs exhale air and convert it to sounds via several organs including the lips, tongue, and teeth [
1]. One of the most important systems for voice recognition is the ear, where the ear can differentiate between the gender voice based on various properties such as loudness and frequency. One of the most significant properties of voice is sexual dimorphism, particularly in pitch, which is particularly marked in human beings [
2]. Gender recognition is used in several applications such as human-to-machine interaction, automatic salutations, speech emotion recognition and sorting the phone calls via gender categorization [
3,
4]. According to the acoustic properties, information regarding voice can be acquired by several acoustic factors, such as in the spectral formant frequencies and perceptual relevance frequencies [
5,
6]. Software for voice recognition converts analog signals to digital signals, known as analog-to-digital conversion [
7]. In order to decode a signal, a computer needs a vocabulary or a dictionary of syllables, and also a way to compare the data to the signals. There is a hard disk that stores the speech patterns and loads them into the memory when running the program. These patterns are checked by a comparator against the outcome of the analog-to-digital convertor; this process is known as pattern recognition. Artificial intelligence (AI), machine learning, and deep learning have made advances in speech recognition technology [
8,
9]. Machine learning plays vital role in addressing various problems in many fields, including such as medicine, banking, and finance and has been used in many studies involving gender voice recognition and classification by using data mining techniques and machine learning [
10,
11,
12].
Ensemble learning has expanded over recent decades and is used to obtain high accuracy and better results for classification [
13,
14]. Ensemble learning has resolved the problems of traditional machine learning models by using multiple classifiers to gain better results for the predictive performance rather than using one single classifier [
15,
16]. Stacking learning is an ensemble model that performs a combination of multiple classifiers using a meta classifier [
17]. In this paper, a novel stacked model is used to obtain improved results for the process of classification between male and female voices. This model uses four classifiers, namely, support vector machine (SVM), k-nearest neighbor (KNN), logistic regression (LR), and stochastic gradient descent (SGD), as base classifiers and linear discriminant analysis (LDA) as meta classifier. The stacked model results are compared with other machine learning models and also compared with other research studies which used the same dataset used in this work. Experimental results show that the proposed stacked model achieved high accuracy and proved to be a suitable model for gender voice recognition.
The rest of this paper is organized as follows:
Section 2 summaries the related work on gender voice recognition.
Section 3 describes the materials and methods.
Section 4 demonstrates the machine learning models used in this study.
Section 5 presents the experimental results and discussion.
Section 6 presents the conclusion.
2. Related Work
Voice recognition and classification have been utilized for a long period of time. In recent decades, a lot of data mining and machine learning models have been conducted for gender voice recognition. A system to recognize gender voice is proposed in [
1], using data from 46 speakers. The model consists of two classifiers, neural network classifier and support vector machine (SVM) via a stacking model. The accuracy obtained by the proposed model was about 93.48%. In [
6], an android platform for speech was produced as a smartphone application for language and gender classification via more than one support vector machine model. The challenge in this work was utilizing dynamic training via the characteristics extracted by each user via installing spectra on the smartphone. This developed a robust classifier that achieved high accuracy during the process of classification. In [
10], a multilayer perceptron (MLP) model with deep learning was produced to identify voice gender. The dataset contained 3168 voice samples of males and females, where the samples were developed via acoustic analysis. The accuracy obtained by the model was about 96.74%. In [
11], the authors reported a study using 438 males and 192 females for gender voice recognition using different scenes (indoor and outdoor). The experimental results demonstrated that using non-linear smoothing improved accuracy to about 99.4%. In [
18], the hyper parameters of the random forest model were optimized using the grid search method and used for gender voice classification. The experimental results conducted on the gender voice dataset indicated that the accuracy was 96.90%. In [
19], a gaussian mixture model (GMM) classifier was used to differentiate gender and age. The classifier model’s accuracy for gender recognition was above 90%. In [
20], a gender voice classification model using the feature selection method via random forest recursive feature elimination and a gradient boosting model was also used. The dataset consisting of 1584 males and 1584 females was collected from various gender voices. The experimental results achieved an accuracy of 97.58%. In [
21], the authors demonstrated an ensemble-based self-labeled algorithm (iCST-Voting) for voice gender classification. The algorithm performs a combination of three efficient self-labeled methods, namely, co-training, tri-training, and self-training, using an ensemble as a base classifier. The proposed algorithm achieved an accuracy of 98.4%. In [
22], a deeper long short-term memory (LSTM) model was used to recognize gender voice. The model consisted of 3 steps. First, 10 efficacious attributes of the data were selected; second, a deep learning-based network was constructed with a double-layer LSTM frame, and third, the values for specificity, sensitivity, and accuracy were calculated. The study obtained an accuracy of 98.4%. In [
23], several machine learning models, namely, k-nearest neighbor (KNN), artificial neural network (ANN), logistic regression, support vector machine (SVM), naïve Bayes, decision tree, and random forest were used for gender voice recognition. The results demonstrated that ANN achieved the best accuracy, with 98.35%.
5. Results and Discussion
This section reports on a set of key experiments carried out to evaluate the performance of the stacked model. The execution of the stacked model was carried out using jupyter notebook version (6.4.6). Jupyter notebook assists in the process of executing and writing python codes simply and is widely used as an open source for implementing and executing machine learning models for classification. To evaluate the effectiveness of the stacked model, five metrics were utilized to evaluate the performance of the final prediction for the stacking model. The performance of the stacked model proposed in this study was estimated using the accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC) [
52]. Accuracy is the ratio between the number of correct predictions and the total number of predictions. Accuracy is calculated using Equation (12):
where TP if true positive, TN is true negative, FP is false positive, and FN is false negative.
Recall is the proportion between true positives and all the actual positives. Recall is calculated using Equation (13):
Precision is the proportion between true positives and all the predicted positives. Precision is calculated using Equation (14):
The F1 score is a combination between recall metric and precision metric. The F1 score is a maximum when recall is equal to precision. The F1 score is computed using Equation (15):
The area under the receiver operating characteristic curve (AUC) is a vital metric for evaluating classification models. The receiver operating characteristic (ROC) curve is a graph that illustrates the execution of a binary classification model [
53]. The ROC curve plots two values: the first value is called the true positive rate (TPR), which lies on the y-axis, and the second value is called false positive rate (FPR), which lies on x-axis at various thresholds. The AUC measures the whole area under the ROC curve and demonstrates the ability of the model to differentiate between classes. When the value of AUC is near to 1, this means that the accuracy of the model is good, and when the value of AUC is near to 0, this means that the accuracy of the model is poor.
In this study, the performance of the stacked model was compared with its component base classifier models, namely, KNN, SVM, SGD, and LR. The performance of these classification models was evaluated using accuracy, precision, recall, F1 score, and AUC as the evaluation metrics. All the models were evaluated using 10-fold cross validation to avoid overfitting (OF), where the error of the testing was 95%.
Table 1 illustrates the configuration of the parameters for the four classification models, namely, KNN, SVM, SGD, and LR, respectively.
The experimental results of accuracy, F1 score, recall, precision, and AUC for KNN model, SVM model, SGD model, LR model, and the stacked model, respectively, are presented in
Table 2. The best results of the evaluation metrics are highlighted in bold.
As illustrated in
Table 2, the stacked model achieved better results in terms of accuracy, with 99.64%. The lowest accuracy was obtained by the SGD model, with 96.20%. In terms of the F1 score, the stacked model and SVM model produced better results with 99.42%, while the SGD model produced the poorest result, with 96.20%. The stacked model, SVM model, and LR model obtained the best results in terms recall, with 99.50%, while the SGD achieved the lowest recall score, with 96.83%. In terms of precision, the stacked model and SVM model demonstrated the bests results, with 99.60% and the SGD model achieved the worst results, with a score of only 95.62%. The best value for the AUC was achieved via the stacked model, with 0.999639. This value is considered excellent, as it is close to 1. The lowest value for AUC was achieved by the SGD model with only 0.997797.
Figure 2 demonstrates the area under the ROC curve for the models, namely, KNN model, SVM model, SGD model, LR model, and the stacked model, respectively.
Figure 3 demonstrates a comparison between the actual values and the predicted values for the models, namely, KNN, SVM, SGD, LR, and the stacked model, respectively.
Table 3 demonstrates the accuracy and the running time in milliseconds for the models, namely, the KNN, SVM, SGD, LR, and the stacked model, respectively.
To demonstrate the influence of the stacked model, the performance of the stacked model was compared with the performance of three conventional machine learning models, namely, decision tree (DT) model, random forest (RF) model, and adaptive boosting (Adaboost) model. Decision tree (DT) is a supervised learning model utilized for classification and regression problems [
54].
DT is a set of consecutive decisions in order to reach to desired result. DT consists of root nodes, branches, child nodes, and leaf nodes. The significant role of decision tree is to search for the descriptive features that include vital information related to the target feature [
55]. The next step is splitting the dataset over the features’ values, where the target feature values for the dataset are clear as possible. Random forest (RF) is an ensemble learning model that is constructed using multiple decision trees that have various depth [
56]. The decision trees utilize multiple features and variables to produce the final classification. Adaptive boosting (Adaboost) is an ensemble learning model that is constructed using decision stumps [
57]. Decision stumps are similar to decision trees, where the decision stumps include only one node and two leaves. Adaboost utilizes multiple decision stumps, such that every decision stump utilizes one feature or variable.
Table 4 shows the configuration of the parameters for the three conventional machine learning models, namely, DT model, RF model, and Adaboost model, respectively.
The experimental results for accuracy, F1 score, recall, precision, and AUC for the supervised machine learning models, namely the DT model, RF model, Adaboost model, and the stacked model, respectively, are demonstrated in
Table 5.
The stacked model achieves the best results compared with the supervised machine learning models as shown in
Table 5. The DT model exhibited the poorest performance, in terms of all its scores.
Table 6 compares the results of several studies that used the same dataset used in this study.
From
Table 6, it can be seen that the proposed stacked model achieved the best performance in the terms of accuracy compared to the previous studies.
6. Conclusions
In this work, an effective stacked model was constructed for gender voice recognition. The proposed model utilizes four models as base classifiers, namely, the KNN model, SVM model, SGD model and LR model, and one model as a meta classifier, namely, the LDA model. Several performance metrics, namely, accuracy, recall, precision, F1 score, and AUC were used to evaluate the impact of the proposed model. The performance of the proposed model was compared with traditional machine learning models, where the proposed model achieved the best results for accuracy (99.64%), F1 score (99.42%), recall (99.50%), precision (99.60%), and AUC (0.999639). The performance of the proposed model was compared with traditional machine learning models, where the proposed model achieved the best results.