Nuclear Family Type Identification Based on Deep Forest Algorithm in Residential Power Consumption

Huang, Zhaoxiang; Wang, Hangjun

doi:10.3390/app13116602

Open AccessArticle

Nuclear Family Type Identification Based on Deep Forest Algorithm in Residential Power Consumption

by

Zhaoxiang Huang

¹ and

Hangjun Wang

^2,*

¹

College of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311300, China

²

College of Engineering and Technology, Jiyang College of Zhejiang Agriculture and Forestry University, Shaoxing 311800, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(11), 6602; https://0-doi-org.brum.beds.ac.uk/10.3390/app13116602

Submission received: 18 April 2023 / Revised: 20 May 2023 / Accepted: 22 May 2023 / Published: 29 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

As the fertility rate declines, it becomes increasingly necessary for governments to guide power companies in introducing preferential tariffs to encourage nuclear families to have children. However, traditional household statistics for residential households are time-consuming and insufficient for enterprises seeking to adopt intelligent marketing schemes for different types of households. To address these issues, this paper proposes a nuclear family type identification method for residential electricity consumption based on a deep forest algorithm. The method first classifies nuclear households according to the number of children in them. Then, features are selected by combining the daily 48-point load and prior knowledge of nuclear families. The Pearson correlation coefficient and random forest importance ranking are used to remove features with low correlation and low importance. Additionally, features are classified based on their importance, and the number of features is balanced by stratified sampling to optimize the multi-granularity scan results and improve the model’s generalization. Finally, the improved cascade forest with feature input replacement base learner is trained, and the model is evaluated using accuracy evaluation metrics.The experimental results demonstrate that the proposed model accurately recognizes the number of children in different nuclear families and can be used in power companies to improve lean management. The results show that the improved method is effective in improving recognition com-pared to the original deep forest method, with recognition accuracy 5.1% higher than the random forest method and 0.7% higher than the deep forest method, reaching 94%.

Keywords:

residential electricity consumption; lean management; deep forest; nuclear family; fertility rate

1. Introduction

The current low fertility rate in society may threaten the future demographic dividend [1]. To increase residents’ willingness to have children, the government has introduced a fertility subsidy policy that primarily benefits nuclear families [2,3]. However, the current tariff policy for childbirth is based solely on the number of people in a household, without considering the type of nuclear family [4]. This approach may lead to imprecise benefits and wastage of resources, particularly with the increasing number of three-child families [5]. The current implementation model for the fertility subsidy policy involves residents completing a range of documents, including online declarations, signed commitments, and regular extension procedures, which make it difficult to monitor the movements of nuclear family members. As a result, the differential tariff policy in the fertility package has been slow to implement [6]. To promote the policy and enable power companies to achieve refined management and provide differentiated electricity services, it is imperative to accurately identify nuclear family types in a reasonable manner, simplify the processing process, and respond promptly to changes in nuclear family mobility [7]. Therefore, research on intelligent algorithms based on big data of electricity offers new research avenues to identify specific resident types [8].

Current research on identifying specific types of electricity users employs various classification and identification methods based on different research purposes and user features. For example, Shuai CY proposed a CHAID decision tree algorithm to identify outage-sensitive users in residential, industrial, and commercial sectors [9]. Lu Zimeng developed a library of electricity consumption feature indicators for empty nest electricity users, and improved the distribution recognition ability of random forest data through weighted random forest [10]. Shen Hongtao studied changes in the electricity consumption load features of coal-to-electricity users after heating, and developed a support vector machine model for particle swarm optimization to identify typical daily electricity consumption loads in winter [11]. Niu N proposed an improved Bayesian algorithm for better recognition of relatively poor households in high-rise buildings [12]. Zhuo L proposed a probabilistic model identification of quadratic programming to improve vacant users’ recognition rates, while Cai used a multi-model stratified fusion method to identify distribution data for 20 industry samples in Chongqing [13,14]. Li Qiushuo classified users into three categories of high, medium, and low energy consumption using a BP neural network, and identified three categories of users by combining the features of electricity consumption at peak and valley level three [15].

The innovation of this paper lies in proposing a novel method for identifying nuclear families in residential electricity consumption using a deep forest algorithm [16]. The proposed method combines stratified sampling with an improved deep forest approach to determine the number of children in nuclear families based on load data. The method also uses feature selection based on the Pearson correlation coefficient and random forest importance ranking to remove features with low correlation and importance. Additionally, the proposed method balances the number of features by classifying them based on their importance and using stratified sampling to optimize the scan results. Finally, an improved cascade forest with feature input replacement base learner is trained and evaluated using accuracy evaluation metrics. The experimental results show that the proposed method outperforms the original deep forest and random forest methods, achieving a recognition accuracy of 94%. The proposed method has potential applications in power companies for improving lean management and can contribute to intelligent marketing schemes for different types of households.

However, there is a research gap in identifying nuclear families in residential electricity consumption. Several studies have used machine learning algorithms to identify and predict energy consumption in residents. For example, a research team at Princeton University achieved increased accuracy and speed in detecting energy consumption in households using deep learning algorithms [17]. Similarly, researchers at Stanford University improved the detection and prediction accuracy of energy consumption in households using stationary wavelet transform [18]. The random forest algorithm was also used to achieve an increase in accuracy and a reduction in energy consumption costs [10]. Moreover, MacDermott et al. developed an intelligent energy management system for households using deep learning algorithms [19]. Another study at the Massachusetts Institute of Technology also improved the accuracy of detecting and predicting energy consumption in households while reducing energy consumption costs using deep learning algorithms [20]. These studies demonstrate the effectiveness of deep learning algorithms in identifying and predicting energy consumption in residents, which can contribute to the development of intelligent energy management systems in homes and industries, resulting in energy cost savings and reduced consumption.

Using machine learning algorithms for predicting energy consumption in households offers various advantages and disadvantages. Among the advantages are high accuracy in predicting energy consumption, the ability to predict energy consumption accurately at different times of the day and week, the reduction of energy consumption costs in households and industries, and the ability to use large and complex data. However, this method also has disadvantages, such as a higher probability of data deviation if the input data is of low quality, the need for training machine learning models and optimizing them to achieve higher accuracy, and the highest prediction accuracy based on the type of selected algorithm. In the field of machine learning and artificial intelligence, a GAN-based generative model is used to generate new data that resembles the original data [21].

This can help address the issue of low-quality input data by providing a larger and more diverse dataset for training machine learning models. Additionally, generative models have shown promising results in improving the accuracy of machine learning models by providing a more diverse set of data for training. One major disadvantage of using generative models is the potential for overfitting, where the model becomes too specialized to the training data and performs poorly on new data. To address this issue, regularization techniques such as dropout and weight decay can be employed, as well as early stopping to prevent the model from continuing to train once its performance on a validation set begins to worsen. Another potential concern with using generative models is the potential for bias in the generated data. This can be mitigated by utilizing techniques such as adversarial training, where a discriminator network is trained to distinguish between real and generated data, and the generator network is trained to produce data that is more difficult for the discriminator to distinguish. Despite these challenges, generative models have shown great promise in a variety of applications, including natural language processing, computer vision, and even healthcare. By providing a larger and more diverse dataset for training machine learning models, generative models can help improve the accuracy and robustness of these models, leading to better performance and more accurate predictions. As the field of machine learning continues to evolve, it is likely that generative models will become increasingly important in a wide range of applications. This model consists of two independent neural networks, one acting as a generator and the other as a discriminator. Although there are research studies on identifying specific electricity users, there is a gap in research when it comes to identifying nuclear families. To fill this gap, we propose a method that combines stratified sampling with an improved deep forest approach to determine the number of children in nuclear families based on load data. This method can assist power companies in providing targeted and differentiated electricity services and provides methodological support for enterprises’ response policies.

2. Family Type Identification Model

This paper analyses the Irish load dataset [22]. A high number of linear features of the load data in the dataset were found, and different features need to be used selectively. Depending on the characteristics of the dataset, the model needs to perform feature engineering and learn the feature vectors. Therefore, an improved deep forest model based on stratified sampling is proposed. The model is derived from the deep forest algorithm, the core idea of which is an integrated decision forest with a deep architecture. The model comprises three primary modules: stratified sampling, multi-grained scanning, and cascade forest.

Stratified sampling solves the following problems by adjusting the feature set. The imbalance in numbers resulting from the division of the load feature set into 4 categories of importance. multi-grained scanning tends to ignore the head and tail of the feature set. This leads to the deep forest tending to learn a larger number of features. And, the head and tail features are under-learnt after multi-grained scanning. The cascade forest is optimised by adding the XGBoost learner, which is optimized for the following aspects. Multi-grained scanning allows for a significant increase in the number of feature vectors, which can easily lead to a complex tree model, thus making the model over-fitted. XGBoost controls model complexity by introducing regularisation methods and has more complex decision boundaries to help capture complex patterns. With the addition of XGBoost as a new base learner in the cascade forest, the variety and number of base learners are enriched to make more effective use of the features after multi-granularity scanning, thus improving generalisation.

2.1. Stratified Sampling

Stratified sampling consists of two main steps: stratification and sampling. Firstly, the original feature set is stratified to obtain four categories of features with different levels of importance. Afterward, sampling is used to balance the number of features and re-duce similarity between them, resulting in a more balanced feature set. This approach effectively addresses the following two issues. One issue is the crosstalk property of multi-granularity scanning, which can cause the model to overlook features at the beginning, and end of the feature set during scanning. Another issue is the model’s susceptibility to unbalanced feature sets during training. Firstly, Pearson correlation coefficient [23], and random forest importance ranking methods [24] were used to eliminate features with low importance and high similarity. Then, based on prior knowledge, the features were categorized into four categories based on their importance. After that, the sampling was appropriately replicated based on the proportion of feature types. Finally, the features with low importance were placed at the beginning, and end of the feature set, while the remaining features were randomly arranged and distributed. The process is illustrated in Figure 1.

2.2. Multi-Grained Scanning

Multi-granularity scanning is inspired by deep learning which is similar to convolu-tional neural networks with multiple sliding windows, aimed at enhancing feature learning capabilities. Through training and mining for more feature information, multigranularity scanning samples the feature set by sliding windows of various sizes. Firstly, a D-dimensional feature is sampled using a sliding window of length W, resulting in a sample set of D-W + 1 W-dimensional feature vectors. Then, input random forest, and completely random forest are trained on the sample set to obtain a probability vector of the number of base estimators, with the length of probability vector being the number of classifications. Finally, the probability vectors are stitched together to obtain the results of multi-granularity scan. Figure 2 shows the structure of multi-granularity scan.

2.3. Cascade Forest

The cascade forest is composed of multiple decision trees, with each cascade layer using two random forests and two completely random forests as base learners. The overall structure is organized layer-by-layer using a stack strategy for representation learning. The original feature vectors are cascaded as training data for the next layer by transforming the feature vectors between cascading layers, thus alleviating the overfitting phenomenon that occurs when passing features layer by layer. The model also employs cross-validation to adaptively adjust the depth of the cascading layers. In order to enhance the diversity of learning, this paper replaces the original base learners with two random forests, two extremely randomized forests, and two XGBoost. Multi-grained scanning increases the number of features significantly, so it tends to make the tree model more complex. XGBoost, as a base learner for cascaded forests, can bridge the gap between random forests and completely random forests in two ways. First, XGBoost reduces the risk of overfitting by introducing regularization to control the complexity of the model. Second, XGBoost has more complex decision boundaries that help capture complex patterns. At the same time, by increasing the type and number of base learners, the features from the multi-grained scanning can be used more effectively, thus improving the generalization ability of the model. The structure of the cascaded forest with modified base learners is illustrated in Figure 3.

2.4. Identification Process

The experimental process is divided into data pre-processing, feature selection and identification. Pre-processing of data sets to eliminate the negative effects of dirty data, incomplete data and unbalanced data. Analyze and calculate the dataset to obtain the required feature vectors and normalize them. The feature set is fed into a deep forest model for training, and finally a test set is used to test the effectiveness of the model. The exact process of the experiment is shown in Figure 4.

3. Experimental

3.1. Data Cleaning

The data comprises of information regarding the electricity consumption of 4, 225 households in Ireland, collected every 30 min, resulting in 48 load data points per customer per day. The data spans a period of 500 days, from 15 July 2009 to 31 December 2010.Incomplete samples with missing, negative, or zero values were excluded from the dataset, and only households with complete data were retained, resulting in a total of 1054 households. Table 1 shows the types and numbers of households.

To address the issue of an imbalanced sample distribution in the dataset that could lead to insufficient generalization ability of the model, the Border-line SMOTE [25] method was employed to balance the sample ratio at 1:1:1:1:1. The samples were then randomly divided into training and test sets at a ratio of 7:3.

3.2. Feature Engineering

The feature construction process consists of two main steps: feature extraction and stratified sampling. First step is feature extraction. The classification task involves children, hence load data were extracted for two four-week periods during the summer and autumn vacations. The summer period spanned from July 20th to August 2nd and August 10th to 23 August 2009, while the autumn period spanned from September 7th to 4 October 2009. According to the literature [26] and a priori knowledge, time periods were divided and load features were extracted. The four-week load data were used to calculate the average daily load for each collection point on workdays and weekends. The daily load was then divided into five time periods: night (0:00–6:00), morning (6:00–10:00), noon (10:00–14:00), afternoon (14:00–18:00), and evening (18:00–0:00), resulting in a total of 49 feature vectors. To eliminate the impact of dimensionality among features of varying dimensions and reduce the computational complexity, normalization is applied to process the feature vectors. In this study, the maximum-minimum normalization method is adopted.

f (x) = \frac{x - x_{\min}}{x_{\max} - x_{\min}}, x \in [x_{\min}, x_{\max}]

(1)

X is the value of the feature vector, xmax and xmin are the maximum and minimum values of the feature vector, respectively, and f(x) represents the normalization function.

Second step is stratified sampling. Firstly, the features were divided into four categories based on their level of importance using the random forest importance ranking method. The Pearson coefficients were used to remove feature vectors that were significantly correlated and unimportant. This resulted in four categories of feature vectors, ranging from weak to strong, with 2, 18, 12, and 9 feature vectors, respectively, for a total of 41. Secondly, the four types of feature vectors with different levels of importance were randomly sampled and ordered to create feature sets with 4, 18, 24, and 18 features respectively. This was done to address the issues of unbalanced feature vectors and incomplete scanning of feature vectors at the beginning and end of the multi-granularity scanning method, which helped to improve the generalization ability of the model.

3.3. Identification

The experiments were conducted using both confusion matrices and accuracy for model evaluation. A confusion matrix is a matrix used to assess the performance of a classification model. It shows the difference between the results predicted by the model and the actual situation. Typically, the confusion matrix is a 2 × 2 matrix, as shown in Table 2. Where the rows represent the true categories and the columns represent the predicted categories. In a binary classification problem, the four elements of the confusion matrix are: TP (True Positive) indicates the number of samples whose actual positive category is correctly predicted as positive. FP (False Positive) indicates the number of samples whose actual negative category is incorrectly predicted as positive. FN (False Negative) indicates the number of samples whose actual positive category is incorrectly predicted as positive. TN (True Negative) indicates the number of samples where the actual negative class was correctly predicted as the negative class.

A cc = \frac{T P + T N}{T P + F N + F P + T N}

(2)

Table 2 shows the definitions of TP, FP, TN and FN.

4. Example Analysis

4.1. Different Time Periods’ Experiment

In order to investigate the impact of load data from different time periods on classification performance, various classification algorithms including SVM, Adaboost, Decision Tree, KNN, Random Forest, and Deep Forest were tested using accuracy rates. The results of each classification model for identifying the number of children in different nuclear families are presented in Figure 5.

Based on the experimental data shown in Figure 5, it is evident that the model trained on the selected workday data performs significantly better in identifying the number of children in different nuclear families than the model trained on the selected weekend data. This could be attributed to the fact that nuclear families exhibit more stable electricity consumption patterns on workdays compared to weekends. However, a few classifiers displayed better recognition results on weekends than on workdays, and this was observed only during summer when children’s holiday uncertainties lead to irregular loads.

The models trained using the entire week’s data exhibited better recognition results than those trained using only workday or weekend data. This suggests that the electricity usage patterns of nuclear families vary between workdays and weekends, which facilitates the model’s learning and training.

The recognition results of the model trained with autumn data were better than those of the model trained with summer data, indicating that the model is more effective in identifying children from nuclear families while they are in school than during the holiday period. The data from the school period is therefore more useful for actual identification of children. The Deep Forest classifier has significantly improved recognition accuracy over traditional machine learning classifiers.

4.2. Improved Algorithms’ Experiment

Based on the experimental results, it was observed that the model trained using whole week data performed better than the models trained using either workday or weekend data. Additionally, the model trained using autumn data performed better than the model trained using summer data. Taking these results into consideration, the current study employed whole week family load data from autumn to train the Stratified sampling and improved deep forest model. The proposed method was validated, and the experimental results are presented in Table 3.

Table 3 presents the experimental results, indicating that the proposed deep forest classification model, incorporating stratified sampling, outperformed the deep forest model by 0.7% in terms of accuracy. The recognition accuracy of the deep forest model, when using stratified sampling and the improved deep forest model, increased by 0.3% and 0.5% respectively, compared to the deep forest model alone. These findings suggest that both stratified sampling and cascaded forests with replacement base learners can enhance the model’s generalization ability.

An example of the confusion matrix for the Irish dataset experiment is shown in Figure 6a. Nuclear families with two or three children are best identified, with the other categories and single children being the next best. The second, third and fourth categories are more easily confounded, and represent partial samples of nuclear families with similar electricity use patterns for single, two and three children. The first category is confounded to varying degrees with the other categories, and represents a more complex picture of the true electricity use of childless households. An example of the confusion matrix for the improved deep forest experiment based on stratified sampling is shown in Figure 6b, where the recognition of each class is improved and the overall confusion is similar to that in Figure 6a. Looking at the sample set composition shows that the two-adult childless households have the largest number of real samples, and the other households in the fifth category have more complex real samples. It can be seen that the true electricity use is more complex, with more households with children having similar electricity use profiles.

In summary, both stratified sampling and cascade forests with additional base learners improve the generalization ability of the model. Customers with similar demographics have similar electricity usage patterns, and they are more complex with some of their usage patterns crossing and overlapping.

The proposed nuclear family type identification method for residential electricity consumption based on a deep forest algorithm can have potential applications in improving lean management in power companies, as well as contributing to intelligent marketing schemes for different types of households. Recent developments in energy technology and battery estimation, such as the electrodeless nanogenerator for dust recovery (Wang et al., 2022) and online estimation of SOH for lithium-ion battery based on SSA-Elman neural network, further highlight the importance of efficient and accurate energy consumption analysis and management in the current energy landscape [27,28]. The proposed nuclear family type identification method for residential electricity consumption based on a deep forest algorithm can contribute to improving energy management and enhancing the profitability of renewable energy sources.

5. Discussion and Conclusions

This paper investigates a deep forest-based approach to nuclear family identification in residential electricity consumption. Due to the unbalanced importance of feature sets in electricity consumption data and the tendency to ignore head and tail features in multi-grained scanning in deep forests, it is easy for the model to learn incomplete features. In order to balance the feature set and improve the complex learning capability, this paper proposes a stratified sampling method and adds XGBoost as the base learner of the cascade forest. At the same time, five time periods of load data-summer, autumn, weekday, weekend and whole week-are distinguished to explore the impact of different time periods of electricity consumption data on nuclear families identification. The results show that the improved method is effective in improving recognition compared to the original deep forest method, with recognition accuracy 5.1% higher than the random forest method and 0.7% higher than the deep forest method, reaching 94%. The best results were achieved by selecting the model trained on a full week of load data in autumn, probably due to the relatively stable electricity consumption of children during the school year. The current study continues to have shortcomings and space for improvement. The model is not effective during the summer time period and has time limitations. The feature set expanded after stratified sampling, resulting in increased training time for the cascade forest and reduced operational efficiency. The next step in the direction is to improve and refine the temporal generality and performance of the method.

Author Contributions

Conceptualization, H.W. and Z.H.; methodology, Z.H.; software, Z.H.; validation, Z.H.; formal analysis, H.W. and Z.H.; investigation, Z.H.; resources, H.W.; data curation, Z.H.; writing—original draft preparation, Z.H.; writing—review and editing, Z.H.; visualization, Z.H.; supervision, H.W.; project administration, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Zhejiang Provincial Natural Science Foundation of China, grant number No.LY23C140004.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author would like to thank from Zhejiang Agriculture and Forestry University for their great support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, X.D.; Zhang, Y.L.; Jia, G.P.; Tang, M.J.; Chen, G.; Zhang, L. Research Progress on Low Fertility Rate in China: A Literature Review. Popul. Dev. 2021, 5, 39–48. [Google Scholar]
Hu, Z.; Peng, X.Z.; Wang, X.H. The current cognitive misunderstanding in the field of family changes and family policy in country. Study Pract. 2018, 11, 101–108. [Google Scholar]
Tong, H.J.; Song, D. Analysis of the characteristics and development Trend of family structure in China. J. Shenzhen Univ. Humanit. Soc. Sci. 2016, 33, 118–123+149. [Google Scholar]
Wang, Y.S. Family structure in the mid and late 18th century in China. Soc. Sci. China 2000, 2, 167–177+209. [Google Scholar]
Chen, Y.Y. Problems an countermeasures in the implementation of the ladder electricity price policy in Huizhou city. Master’s Thesis, Guangxi Normal University, Guilin, China, 2019. [Google Scholar]
Sun, L.; Tian, M. Study on migration mode in nuclear family of migrants: Based on the perspective of family life cycle theory. Hum. Geogr. 2021, 4, 48–82. [Google Scholar]
Xu, C.Q.; Zhao, H.D.; Song, X.H. Research on method of power user group identification and analysis based on large data. Zhengzhou Univ. Nat. Sci. Ed. 2016, 48, 113–117. [Google Scholar]
Zhu, T.Y.; Ai, Q.; He, X.; Li, Z.L.; Sun, D.L.; Li, X.L. An Overview of Data-driven Electricity Consumption Behavior. Power Syst. Technol. 2020, 44, 3497–3507. [Google Scholar] [CrossRef]
Shuai, C.Y.; Yang, H.C.; Ouyang, X.; He, M.W.; Gong, Z.W.Y.; Shu, W.N. Analysis and Identification of Power Blackout-Sensitive Users by Using Big Data in the Energy System. IEEE Access. 2019, 7, 19488–19501. [Google Scholar] [CrossRef]
Lu, Z.M.; Chen, J.Y.; Li, J.; Xie, Y.; Jiang, X.L.; Han, L.; Guo, Q. An empty-nest power user identification method based on weighted random forest algorithm. Telecommun. Sci. 2020, 36, 112–121. [Google Scholar]
Shen, H.T.; Zhang, C.; Li, C.R.; Wu, Y.D. Load identification of coal to electricity users based on intelligent algorithm. Electr. Meas. Instrum. 2020. Available online: http://kns.cnki.net/kcms/detail/23.1202.TH.20201126.1747.015.html (accessed on 20 May 2023).
Niu, N.; Jin, H. Identifying urban households in relative poverty with multi-source data: A case study in Zhengzhou. J. Urban Aff. 2022. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, Y.; Liu, S.; Li, K.; Li, C.B.; Yi, Y.Q.; Tang, J. Consumer phase identification in low-voltage distribution network considering vacant users. Int. J. Electr. Power Energy Syst. 2020, 121, 106079. [Google Scholar] [CrossRef]
Cai, J.; Xie, H.; Wu, G.X.; Tang, X.L.; Zhou, M. User data identification of power distribution system based on multi—Model hierarchical fusion. Adv. Technol. Electr. Eng. Energy 2022, 41, 49–58. [Google Scholar]
Li, Q.S.; Wang, Y.; Sun, Y.J.; Xiao, Y.; Ou, Y.T. Application of BP neural network in electricity users classification. Mod. Electron. Tech. 2017, 40, 156–158+162. [Google Scholar] [CrossRef]
Zhou, Z.H.; Feng, J. Deep forest: Towards an alternative to deep neural networks. IJCAI 2017, 3553–3559. [Google Scholar]
Lawal, A.S.; Servadio, J.L.; Davis, T.; Ramaswami, A.; Botchwey, N.; Russell, A.G. Orthogonalization and machine learning methods for residential energy estimation with social and economic indicators. Appl. Energy 2021, 283, 116114. [Google Scholar] [CrossRef]
Saoud, L.S.; Al-Marzouqi, H.; Hussein, R. Household Energy Consumption Prediction Using the Stationary Wavelet Transform and Transformers. IEEE Access. 2022, 10, 5171–5183. [Google Scholar] [CrossRef]
Mohi-Ud-Din, G.; Marnerides, A.K.; Shi, Q.; Dobbins, C.; MacDermott, A. Deep COLA: A Deep COmpetitive Learning Algorithm for Future Home Energy Management Systems. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 5, 860–870. [Google Scholar] [CrossRef]
Abreu, J.M.; Pereira, F.C.; Ferrao, P. Using pattern recognition to identify habitual behavior in residential electricity consumption. Energy Build. 2012, 49, 479–487. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Process. Mag. 2018, 35, 53–56. [Google Scholar] [CrossRef]
Commission for Energy Regulation. CER Smart Metering Projectelectricity Customer Behavior Trial, 1st ed.; Commission for Energy Regulation: Dublin, Germany, 2012; Available online: https://www.ucd.ie/issda/data/commissionforenergyregulationcer/ (accessed on 20 May 2023).
Wang, J.; Wu, X.M.; Wang, A.F. The application of pearson correlation coefficient algorithmin in searching for the users with abnormal watt-hour meters. Power Demand Side Manag. 2014, 16, 52–54. [Google Scholar]
Qiu, Y.; Li, S.; Jin, L.; Zhang, M.M.; Wang, J. Bridge Anomaly Monitoring Data Identification Method Based on the Statistical Feature Mixture and Random Forest Permutation Importance Index. Chin. J. Sens. Actuators 2022, 35, 756–762. [Google Scholar]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Lect. Notes Comput. Sci. 2005, 878–887. [Google Scholar]
Shaukat, M.A.; Shaukat, H.R.; Qadir, Z.; Munawar, H.S.; Kouzani, A.Z.; Mahmud, M.A.P. Cluster analysis and model comparison using smart meter data. Sensors 2021, 21, 3157. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Yang, D.; Huang, Z.; Hu, H.; Wang, L.; Wang, K. Electrodeless nanogenerator for dust recover. Energy Technol. 2022, 10, 699. [Google Scholar] [CrossRef]
Guo, Y.; Yang, D.; Zhang, Y.; Wang, L.; Wang, K. Online estimation of SOH for lithium-ion battery based on SSA-Elman neural network. Prot. Control. Mod. Power Syst. 2022, 7, 40. [Google Scholar] [CrossRef]

Figure 1. Stratified sampling structure.

Figure 2. Multi-grained scanning.

Figure 3. Improved cascade forest.

Figure 4. Identification process.

Figure 5. Performance comparison of various algorithm.

Figure 6. Irish dataset experiment: confusion matrix. (a) Deep forest. (b) Improved deep forests based on stratified sampling.

Table 1. Family sample size, types and numbers of households.

Family Type	2 Adults and 0 Child	2 Adults and 1 Child	2 Adults and 2 Child	2 Adults and 3 Child	Other Families
Number of households	682	127	74	61	100
Label	0	1	2	3	4

Table 2. Definition for TP, FP, TN and FN.

Actual	Predicted
Actual	Positive	Negative
Positive	TP	FN
Negative	FP	TN

Table 3. Optimization algorithm comparison experiment.

Models	Acc (%)
Random Forest	88.9
Deep Forest	93.3
Deep Forest Based On Stratified Sampling	93.6
Improved Deep Forest	93.8
Deep Forest Based On Improved Stratified Sampling	94.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Z.; Wang, H. Nuclear Family Type Identification Based on Deep Forest Algorithm in Residential Power Consumption. Appl. Sci. 2023, 13, 6602. https://0-doi-org.brum.beds.ac.uk/10.3390/app13116602

AMA Style

Huang Z, Wang H. Nuclear Family Type Identification Based on Deep Forest Algorithm in Residential Power Consumption. Applied Sciences. 2023; 13(11):6602. https://0-doi-org.brum.beds.ac.uk/10.3390/app13116602

Chicago/Turabian Style

Huang, Zhaoxiang, and Hangjun Wang. 2023. "Nuclear Family Type Identification Based on Deep Forest Algorithm in Residential Power Consumption" Applied Sciences 13, no. 11: 6602. https://0-doi-org.brum.beds.ac.uk/10.3390/app13116602

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Nuclear Family Type Identification Based on Deep Forest Algorithm in Residential Power Consumption

Abstract

1. Introduction

2. Family Type Identification Model

2.1. Stratified Sampling

2.2. Multi-Grained Scanning

2.3. Cascade Forest

2.4. Identification Process

3. Experimental

3.1. Data Cleaning

3.2. Feature Engineering

3.3. Identification

4. Example Analysis

4.1. Different Time Periods’ Experiment

4.2. Improved Algorithms’ Experiment

5. Discussion and Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI