PMV Dimension Reduction Utilizing Feature Selection Method: Comparison Study on Machine Learning Models

Park, Kyung-Yong; Woo, Deok-Oh

doi:10.3390/en16052419

Open AccessArticle

PMV Dimension Reduction Utilizing Feature Selection Method: Comparison Study on Machine Learning Models

by

Kyung-Yong Park

¹ and

Deok-Oh Woo

^2,*

¹

Center for Housing Environment Research and Innovation of Korea Land and Housing Research Institute, 66, Raon-ro, Sejong 30065, Republic of Korea

²

College of Engineering, Lawrence Technological University, 21000 W 10 Mile Rd., Southfield, MI 48075, USA

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(5), 2419; https://0-doi-org.brum.beds.ac.uk/10.3390/en16052419

Submission received: 1 February 2023 / Revised: 22 February 2023 / Accepted: 28 February 2023 / Published: 3 March 2023

(This article belongs to the Section G: Energy and Buildings)

Download

Browse Figures

Versions Notes

Abstract

:

Since P.O. Fanger proposed PMV, it has been the most widely used index to estimate thermal comfort. However, in some cases, it is challenging to measure all six parameters within indoor spaces, which are essential for PMV estimation; a couple of parameters, such as Clo or Met, tend to show a large deviation in accuracy. For these reasons, several studies have suggested methods to estimate PMV but their accuracies were significantly compromised. In this vein, this study proposed a way to reduce the dimensions of parameters for PMV prediction utilizing the machine learning method, in order to provide fast PMV calculations without compromising its prediction accuracy. Throughout this study, the most influential features for PMV were pinpointed using PCA, Best Subset, and the Gini Importance, with each model compared to the others. The results showed that PCA and ANN achieved the highest accuracy of 89.70%, and the combination of Best Subset and Random Forest showed the fastest prediction performance among all.

Keywords:

PMV; feature selection; machine learning; dimension reduction

1. Introduction

Global energy consumption has increased significantly, with the building energy accounting for a great portion of total energy consumption [1]. In order to reduce the energy from the building sector, the European Union (EU) approved the Energy Performance of Building Directive (EPBD) to improve building energy performance [2,3], with various countries benchmarking the passive house standards from Germany [4]. To achieve the passive house standard, buildings should have a significant reduction in energy without compromising the thermal comfort requirement [5]. In this regard, accurate thermal comfort level estimation in the conditioned space will allow the systems to provide the right amount of heating or cooling to maintain thermal comfort for occupants.

Predicted mean vote (PMV), which was developed by P.O. Fanger, is the most widely used index to estimate thermal comfort [6]. PMV is expressed by a seven scaled predicted thermal sensation vote (Table 1) that can be calculated from four environmental parameters: Air temperature (Ta), Relative humidity (RH), Mean radiant temperature (MRT), and Air velocity (Vel), and two personal parameters: Clothing insulation (Clo) and Metabolic rate (Met). PMV is utilized in various standards such as ASHRAE or ISO [7,8]. However, in real building settings, it is challenging to monitor MRT and Vel due to how the measuring equipment is installed or occupying too much space [9]; Clo and Met are also difficult to measure accurately because these parameters depend on human behavior [10,11].

Even if Fanger’s PMV study was conducted in a carefully controlled laboratory, the results guaranteed 95% reliability with 25% deviation. In these circumstances, several research studies were conducted to find better ways to simply calculate PMV [12,13,14]. Rohles [12] fixed Clo and Met at 1.2 clo and 0.6 met, which are difficult to measure in the PMV calculation formula, and presented a PMV prediction model using Ta and RH, which are relatively easy to measure. However, this model had a limitation in that it was specific to the sedentary condition because the personal parameters were assumed constant. Later, Buratti et al. [13] proposed a linear regression model using the correlation between operative temperature and PMV using Rohles’ simplified PMV prediction model. However, Sherman’s [14] study revealed that thermal comfort prediction requires a greater accuracy and proposed a simplified PMV equation by calculating the comfort coefficient based on Fanger’s PMV calculation formula.

Recently, a number of research studies have utilized machine learning methods to predict more accurate PMV values [11,15,16]. Feng et al. [15] suggested the framework of personal thermal comfort prediction through a literature review of the data-driven method. This study showed the possibility of various data measurements and predictions due to the development of sensors and computers. Based on the ASHRAE RP-884 dataset, Farhan et al. [16] proposed a support vector machine (SVM)-based model to predict TSV, which is an individual thermal sensation index, and the obtained results were subsequently compared with the results obtained from the PMV model. The accuracy of the proposed prediction model was 76.7%, more than twice higher than that of Fanger’s PMV model (35.4%). However, these studies were merely focused on the TSV, leaving Fanger’s PMV aside. Therefore, the limitations remained that those hard-to-be-measured physical variables are still required or need to be assumed from references.

Furthermore, many researchers have explored ways to predict hard-to-be-measured variables with a certain reliability [17,18,19]. Lee et al. [17] used an infrared (IR) camera to predict Clo, which is a personal factor. Compared with the conventional method, the accuracy of PMV was estimated to be lower; however, the study showed that a real-time Clo can be predicted and this approach can be utilized in automatic controls of HVAC systems. Park et al. [18] developed the MRT prediction model with several machine learning models, such as linear regression, regression tree, and ANN, and compared the energy consumption and thermal comfort through Energy Plus simulation. The indoor air temperature, outdoor air temperature, set point temperature, and time were adopted as input variables to estimate the proposed MRT prediction model. ANN and regression tree outperformed other models. Ruivo et al. [19] presented a PMV-PS model, derived from air pressure mainly because the Met was directly affected by air pressure. Although these studies predicted the variables required for PMV estimation, the accuracies were still inevitably compromised. Thus, dimension reduction, which prioritizes input variables depending on higher importance, is essential to allow effective PMV prediction in the conditioned space.

Several ways to select input variables were found. Bolon-Canedo et al. [20] classified the feature selection method into three main categories: Filters, Embedded method, and Wrapper. Filter (e.g., correlation coefficient analysis) is a technique that selects according to basic data characteristics; Embedded method (e.g., Perceptron) is a classification that selects the learning process during machine learning; Wrapper (e.g., SVM, Random Forest, etc.) is a technology that includes skills for deriving importance in the algorithm itself. Figure 1 shows types of Bolon-Canedo’s feature selection method. On the other hand, Kumar et al. [21] addressed that there is no absolute best method among the diverse feature selection methods, and most suitable methods should be found according to the characteristics of the data. Therefore, a comparison of feature selection methods is essential to pinpoint the most important input variables for PMV prediction.

Considering the previous studies, the discrepancy between TSV and PMV can be minimized by reducing the number of input variables which are essential for PMV calculation, which in turn leads to reducing the amount of measuring equipment installation. Thus, this study explores an optimal feature selection method for PMV prediction, with a resultant PMV prediction model developed from the optimal feature selection method. The proposed machine learning-based feature selection and prediction methods can be utilized and explored to reduce the number of PMV parameters.

2. Methodology

2.1. Research Procedure

The optimal PMV prediction model was developed using the procedure shown in Table 2.

Step 1 involves establishing a dataset. In this study, the ASHRAE Global Thermal Comfort Database II was used [22]. At this stage, input variables for PMV prediction were selected to establish training data and test data.

Step 2 is the key variables’ selection. In this step, based on the data established in Step 1, the importance of each variable is evaluated to select an appropriate input variable. In this study, principal component analysis (PCA), Best Subset, and Gini Importance were estimated and compared to find a variable selection method which is suitable for the PMV prediction model.

Step 3 is the prediction model development. In this step, the optimal machine learning model for PMV prediction is found using the input variables derived in Step 2. To find a machine learning algorithm which is suitable for PMV prediction, prediction models were generated using ANN, LSTM, and Random Forest, and the results were compared with each other.

Step 4 is verifying the performance of the PMV prediction model created through Steps 1–3. Comparative analyses were performed based on the model calculation time and prediction accuracy, which vary depending on the variable selection method and the type of prediction model.

2.2. Dataset

The ASHRAE Global Thermal Comfort Database II was used as a major dataset. The ASHRAE Global Thermal Comfort Database II comprises data that summarize the results of field studies conducted globally from 1995 to 2015, and consists of a total of 81,846 indoor environment measurement and survey data. For 52 locations, which denotes building information and environmental data, 6 essential variables for PMV (Ta, RH, Vel, MRT, Clo, and Met) were collected to build a dataset with 7 columns. A total of 25,261 data were used for data training and verification, excluding incomplete data. The ratio of training data and verification data throughout the study was kept at 70% and 30%, respectively. Figure 2 shows the data composition and utilization.

2.3. Feature Selection

In the process of predictive model development based on several variables, there is a risk of having a low accuracy model derived from the interference of noise or a heavy prediction model due to redundant and unnecessary data. Therefore, when developing a predictive model using machine learning, it is essential to extract the key variables. Conventionally, six variables are used to calculate PMV; the importance of the variables is calculated using PCA, Best Subset, and Gini Importance to reduce the variable dimensions. For feature selection, Minitab, Weka, and Python were used in this study [23,24].

PCA is a statistical variables selection method and is a representative multivariate statistical analysis method used for dimensionality reduction in data of various patterns [25]. Principal components are formed by converting the data to an axis with maximum variance, and at this time, by removing components with small variance, the dimensions and noise can be reduced. Best Subset is also a statistical variable selection method, and finds an optimal model among all possible subsets of input variables in a regression equation for prediction [26]. This technique has been used for decades as a highly effective technique, especially in linear regression prediction models [27]. Feature selection was performed using the Gini Importance calculated for prediction in the Random Forest model as an index [28]. Gini Importance is a number expressed from 0 to 1 in consideration of the Gini impurity of Random Forest. The closer to 1, the more important the variable [29].

Through these techniques, the importance of the input variables was determined, and three different datasets were constructed according to this importance. Using the selected dataset, it was selected as an input variable to create a predictive model according to various machine learning techniques.

2.4. Prediction

After selecting the input variables, various machine learning techniques were reviewed to develop a predictive model for PMV. In this study, a prediction model was built using ANN, a representative neural network model, LSTM, a type of RNN model, and Random Forest, which developed a decision tree.

The ANN used in this study is a neural network model developed by S. McCulloch and H. Pitt in 1943 by referring to the structure of human neurotransmission [30]. The structure of the model consists of an input layer composed of several input variables, a hidden layer, and an output layer to be predicted. Here, the input layer is set to the input variable selected in Step 2 and the output layer is set to the PMV value. Figure 3 shows the structure of ANN used in this study.

The second compared model, LSTM, is a neural network model that improves the problems of gradient vanishing and gradient exploding of the recurrent neural network (RNN), and is specialized in handling time series data [31]. In particular, as can be seen in Figure 4, memory cells exist in the LSTM model and there is a forget gate that determines the degree to which previous data affects the present; when the past data of the output layer affects the current output value, it becomes effectively predictable [32].

Finally, Random Forest is an effective algorithm for classification, clustering, and regression, and is an ensemble technique proposed by Breiman [33]. A random forest is composed of several decision trees and grows by dividing into two sub-nodes from training data [34]. Figure 5 conceptually shows the structure of a random forest.

In order to find the most suitable feature selection method for PMV prediction, the total nine combinations between selection and prediction methods were set as shown in Table 3. The performance of the prediction model for each case was compared based on calculation time and accuracy.

3. Results and Discussion

3.1. Feature Importance

In this study, major variables were selected using PCA, Best Subset, and Gini Importance. As a result of the PCA analysis, it was found that the cumulative value in PC4 occupied 91% of the total eigenvalue. In general, eigenvalues of 1 or less are judged as major variables but in this study, four variables, Ta, Tg, Clo, and Met, were selected as major variables based on 0.7, as suggested by Jolliffe [35]. Table 4 shows the result of eigenanalysis for PCA.

The results of Best Subset are shown in Table 5. It shows that the R² value changed according to the number of input variables and, in particular, the accuracy increased significantly when the number of input variables increased from two to three. Among the three input variable sets, when Tg, Clo, and Met were used, the R² value was the highest at 87.4%, so Tg, Clo, and Met were selected as the main variables.

Finally, as a result of calculating the Gini Importance of Random Forest, the importance of the Vel was less than 0.1, which was significantly smaller than the importance of other variables. Therefore, based on the Gini Importance of 0.1 or higher, five variables excluding Vel were selected as the main variables. Figure 6 shows the Gini Importance calculation result. Table 6 shows the main variable selection results for each feature selection method.

3.2. Prediction Performance

A prediction model was created through ANN, LSTM, and Random Forest using the main variables that were selected based on each criterion as input variables; the performance of the model was then analyzed in terms of accuracy and computation time. Accuracy was evaluated by the R², MAPE, and cvRMSE values for the prediction (y_predict) and the test value (y_test) derived from the generated predictive model. MAPE and cvRMSE can be calculated by Equations (1) and (3). R² is usually used as a measure of the accuracy of prediction, while MAPE and cvRMSE relate to the error between the predictions and measured data. Therefore, the R² value represents the accuracy of the prediction model.

M A P E = \frac{\sum \frac{| y_{p r e d} - y_{r e a l} |}{y_{r e a l}}}{N} \times 100

(1)

R M S E = \sqrt{\sum \frac{{(y_{p r e d} - y_{r e a l})}^{2}}{N}}

(2)

c v R M S E = \frac{R M S E}{{\bar{y}}_{r e a l}} \times 100

(3)

where:

MAPE = Mean absolute percentage error (%);
RMSE = Root mean square error (%);
cvRMSE = Coefficient of variation RMSE (%);
y_pred = Prediction data;
y_real = Real data;
N = Number of data.

Each machine learning model has a different model structure, while sharing the same hyper-parameters within the model. In the ANN model, the multilayer Perceptron regressor model was used with ReLu activation, and the number of hidden layer and nodes were 20 and 8, respectively. The momentum was set to 0.4 and the iteration to 1000. For LSTM, the sequential model was used with linear activation, and the number of units was set from 50 to 64. Due to the independent data shape, density was set as 1 and total parms were 39,905. For the last regressor model used in the Random Forest case, the number of estimators was set to 1000 and minimum leaf and split were set to 2. All other hyper-parameters were set as automation.

In Cases 1–3, which were based on the ANN prediction model, the performance of the model (using the four input variables selected by PCA) was the highest at 89.70%, 88.96%, and 87.83%, respectively. However, in Cases 4–6, which were based on the LSTM prediction model, the model selected as Best Subset had the highest accuracy at 87.97%, even with fewer variables than PCA. Finally, in the case of the Random Forest prediction model, Case 8, which was entered with the smallest number, showed the highest accuracy, but there was still a minor difference in accuracy depending on the selection of input variables until Cases 7–9.

Regarding the calculation time for a predictive model, the time required was mainly governed by the weight of the data derived from the number of input variables but, in general, there was a significant difference depending on the type of algorithm. In the case of ANN, Case 1 took the fastest time at 140 s, while Case 3 took the longest at 156 s. The LSTM model showed a fast calculation speed in the order of Case 5, Case 4, and Case 6, proportional to the amount of input data and time; the Random Forest model’s calculation time was also in proportion to the amount of data. The performance of the predictive models in relation to the selection of input variables is shown in Table 7.

3.3. Discussion

In this study, the dimensions of the six input variables that are conventionally used for PMV calculation were reduced, with a prediction model created through a machine learning process and its performance analyzed. The number of input variables can be reduced from five to three depending on the feature selection method. Overall, Tg, Clo, and Met turned out to be of higher importance for PMV prediction, whereas RH and Vel showed relatively low importance. From the precedent studies, it was confirmed that even if RH is measured within a single location, sensor readings tend to fluctuate since the value is dynamically affected by Ta and Vel. In addition, it was found that the calculation speed of the predictive model was determined by the reduced number of input variables (according to their importance).

Among the nine cases classified in this study, Case 1, an ANN prediction model in which input variables were selected by PCA, had the highest accuracy at 89.70%. However, in terms of the speed of operation, which is another performance indicator, Case 8, a Random Forest model in which the input variable was selected as Best Subset, had the fastest operation speed. On average, it was found that the ANN model had high accuracy while the operation speed was slow, while among the feature selection methods, Best Subset had high accuracy compared to the number of input variables. In addition, Random Forest showed relatively low accuracy compared to other prediction models, but there was little change in accuracy depending on the input variable selection method, and the calculation time was also the shortest. These analyses showed that ANN is a model with strength in terms of sophistication and random forest in terms of stability, when it is compared to the other methods to predict PMV. Figure 7 plots the computation time and performance of the prediction model. Overall, computation time and accuracy were directly proportional.

3.4. Contributions and Limitations

Conventionally, due to the risk of having high deviations in the process of measurement, Tg, Clo, Met, and Vel have been considered to be difficult-to-measure metrics. In this vein, this study explored ways to exclude these metrics from PMV estimation, but this study still verified the relatively high importance of the Tg, Clo, and Met metrics. In future studies, it is essential to find an approach that allows for the reduction of the deviation of these three metrics.

Despite these limitations, the major goals of this study were to find an appropriate feature selection method and build a framework for predicting PMV through machine learning, and those goals were thoroughly addressed throughout this study. Additionally, this study made its contribution in that it conducted a comprehensive quantitative analysis of those difficult-to-measure metrics.

4. Conclusions

PMV is the most commonly used thermal comfort index, but it is difficult to accurately measure the six conventionally used input variables in reality. Thus, in this study, machine learning was used to predict PMV by minimizing the input variables, which allowed simpler and faster prediction of PMV. Since the appropriate preprocessing method is different for the predictive model depending on the type and nature of the data, each input variable was selected based on PCA, Best Subset, and Gini Importance.

According to each criterion, three main input variables were selected for Best Subset, four for PCA, and five for Gini Importance. Although there was a minor difference depending on the selection method, Tg, Clo, and Met appeared to be important variables in common, and RH and Vel turned out to be relatively less important variables. Predictive models were then created with ANN, LSTM, and Random Forest according to each variable selection method, with their performance compared based on accuracy and computation time. The ANN prediction model, using the four variables selected through PCA based on accuracy, was derived as the most accurate model with 89.70%, while the Random Forest model, using the three input variables selected based on Best Subset, was the fastest model, taking only 30 s for prediction. Based on these estimation results, the most suitable prediction model selection method can be pinpointed considering accuracy or calculation time. Furthermore, it was confirmed that the essential number of input variables to predict PMV can be reduced to three or four.

Although this study has its limitations in not being able to weigh the importance of performance parameters between accuracy and computation time and answer whether it is a model that can represent TSV better than Fanger’s PMV, it was proved that the number of the conventional six parameters can be reduced to almost half. As a result, PMV can be estimated with a simpler approach with faster calculation time.

With the growing demands of building energy management systems (BEMS) and smart buildings integrated with the IoT, this research clearly meets their important needs of minimizing effort and time for PMV prediction. When provided with easier and faster PMV prediction, HVAC systems can be operated optimally in a way that minimizes their operational energy without compromising thermal comfort for users. Furthermore, as the calculation effort lessens for the CPU, the overall budget for either the BEMS or IoT-based smart technology can be reduced significantly, which will lower the entry for building owners to access these advanced technologies.

Author Contributions

Writing—original draft preparation, K.-Y.P.; writing—review and editing, D.-O.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

This study was conducted through the E-Challenge 5 program with the support of the Engineering Society of Detroit and DTE Energy.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

Clo	Clothing insulation
Cv	Coefficient of variation
Met	Metabolic rate
MRT	Mean radiant temperature (°C)
RH	Relative humidity (%)
Ta	Air temperature (°C)
Tg	Globe temperature (°C)
Vel	Air velocity (m/s)
Acronym
ANN	Artificial neural network
ASHRAE	American Society of Heating, Refrigerating and Air-Conditioning Engineers
BEMS	Building energy management system
EPBD	Energy performance of building directive
HVAC	Heating, ventilation, and air-conditioning
IoT	Internet of Things
IR	Infrared
LSTM	Long short-term memory
MAPE	Mean absolute percentage error
SVM	Support vector machine
PCA	Principal component analysis
PMV	Predicted mean vote
RNN	Recurrent neural network
RMSE	Root mean square error
TSV	Thermal sensation vote

References

IEA. Global Energy&CO2 Status Report; IEA: Paris, France, 2019. [Google Scholar]
EPBD. Directive 2002/91/EC of the European Parliament and the Council, 16 December 2002, Concerning the Energy Efficiency of the Buildings. 2002. Available online: https://eur-lex.europa.eu/legal-content/en/ALL/?uri=CELEX%3A32002L0091 (accessed on 23 January 2022).
Figueiredo, A.; Figueira, J.; Vincente, R.; Maio, R. Thermal comfort and energy performance: Sensitivity analysis to apply the Passive House concept the Portuguese climate. Build. Environ. 2016, 103, 276–288. [Google Scholar] [CrossRef]
Park, K.Y.; Woo, D.O.; Leigh, S.B.; Junghans, L. Impact of hybrid ventilation strategies in energy savings of buildings: In regard to mixed-humid climate regions. Energies 2022, 15, 1960. [Google Scholar] [CrossRef]
Feist, W.; Schnieders, J.; Dorer, V.; Hass, A. Re-inventing air heating: Convenient and comfortable within the frame of the Passive House concept. Energy Build. 2005, 37, 1186–1203. [Google Scholar] [CrossRef]
Fanger, P.O. Thermal Comfor; McGraw-Hill: New York, NY, USA, 1972. [Google Scholar]
ISO 7730; Ergonomics of Thermal Environment. ISO: London, UK, 2005.
ASHRAE Fundamental, Ch.9 Thermal Comfort; American Society of Heating, Refrigerating, and Air Conditioning Engineering: Peachtree Corners, GA, USA, 2017.
Guenther, J.; Sawodny, O. Feature selection and Gaussian Process regression for personalized thermal comfort prediction. Build. Environ. 2019, 148, 448–458. [Google Scholar] [CrossRef]
Chaudhuri, T.; Soh, Y.C.; Li, H.; Xie, L. Machine learning based prediction of thermal comfort in buildings of equatorial Singapore. In Proceedings of the 2017 IEEE International Conference on Smart Grid and Smart Cities, Singapore, 23–26 July 2017; pp. 72–77. [Google Scholar]
Khan, M.H.; Pao, W. Thermal comfort analysis of PMV model prediction in air conditioned and naturally ventilated buildings. Energy Procedia 2015, 75, 1373–1379. [Google Scholar]
Rohles, F.H., Jr. Thermal sensations of sedentary main in moderate temperature. Hum. Factors 1971, 13, 553–560. [Google Scholar] [CrossRef]
Buratti, C.; Ricciardi, P.; Vergoni, M. HVAC systems testing and check: A simplified model to predict thermal comfort conditions in moderate environments. Appl. Energy 2013, 104, 117–127. [Google Scholar] [CrossRef]
Sherman, M. A simplified model of thermal comfort. Energy Build. 1985, 8, 37–50. [Google Scholar] [CrossRef] [Green Version]
Feng, Y.; Liu, S.; Wang, J.; Jao, Y.L.; Wang, N. Data-driven personal thermal comfort prediction: A literature review. Renew. Sustain. Energy Rev. 2022, 161, 112357. [Google Scholar] [CrossRef]
Farhan, A.A.; Pattipati, K.; Wang, B.; Luh, P. Predicting individual thermal comfort using machine learning algorithm. In Proceedings of the 2015 IEEE International Conference Automation Science and Engineering, Gothenburg, Sweden, 24–28 August 2015; pp. 708–713. [Google Scholar]
Lee, K.S.; Choi, H.N.; Kim, H.K.; Kim, D.D.; Kim, T.Y. Assessment of a Real-Time prediction method for high clothing thermal insulation using a thermoregulation model and an infrared camera. Atmosphere 2020, 11, 106. [Google Scholar] [CrossRef] [Green Version]
Park, J.S.; Choi, H.N.; Kim, D.H.; Kim, T.Y. Development of novel PMV-based HVAC control strategies using an mean radiant temperature prediction model by machine learning in Kuwaiti climate. Build. Environ. 2021, 206, 108357. [Google Scholar] [CrossRef]
Ruivo, C.R.; da Silva, M.G.; Broday, E.E. Study on thermal comfort by using an atmospheric pressure dependent predicted mean vote index. Build. Environ. 2021, 206, 108370. [Google Scholar] [CrossRef]
Bolon-Canedo, V.; Sanchez-Marono, N.; Alonso-Betanzos, A. A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 2013, 34, 483–519. [Google Scholar] [CrossRef]
Kumar, V.; Minz, S. Feature selection: A literature review. SmartCR 2014, 4, 211–229. [Google Scholar] [CrossRef]
ASHRAE. ASHRAE Global Thermal Comfort Database II; American Society of Heating, Refrigerating, and Air Conditioning Engineering: Peachtree Corners, GA, USA, 2018. [Google Scholar] [CrossRef] [Green Version]
Helsel, D.R. Statistics for Censored Environmental Data Using Minitab and R; John Wiley&Son: Hoboken, NJ, USA, 2011; p. 77. [Google Scholar]
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl. 2009, 11, 10–18. [Google Scholar] [CrossRef]
Smith, L.I. A Tutorial on Principal Component Analysis; University of Otago: Dunedin, New Zealand, 2002. [Google Scholar]
Zhang, Z. Variable selection with stepwise and best subset approaches. Ann. Transl. Med. 2016, 4, 136. [Google Scholar] [CrossRef] [Green Version]
Furnival, G.M.; Wilson, R.W. Regressions by leaps and bounds. Technometircs 2000, 42, 69–79. [Google Scholar] [CrossRef]
Han, H.; Guo, X.; Yu, H. Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest. In Proceedings of the 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 26–28 August 2016; pp. 219–224. [Google Scholar]
Jang, J.; Lee, J.; Son, E.; Park, K.; Kim, G.; Lee, J.; Leigh, S.B. Development of an improved model to predict building thermal energy consumption by utilizing feature selection. Energies 2019, 12, 4187. [Google Scholar] [CrossRef] [Green Version]
McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biol. 1990, 52, 99–115. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Liang, C.; Li, H.; Lei, M.; Du, Q. Dongting lake water level forecast and its relationship with the three gorges dam based on a long short-term memory network. Water 2018, 10, 1389. [Google Scholar] [CrossRef] [Green Version]
Breiman, L. Random forest. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; Volume 2, pp. 1–758. [Google Scholar]
Jolliffe, I.T. Principal Component Analysis for Special Types of Data; Springer: New York, NY, USA, 2002; pp. 338–372. [Google Scholar]

Figure 1. Classification of feature selection techniques [20].

Figure 2. Diagram of dataset for predicting PMV values.

Figure 3. Structure of artificial neural network to predict the PMV value.

Figure 4. Structure of long short-term memory.

Figure 5. Structure of a random forest.

Figure 6. Calculation results of Gini Importance by Random Forest.

Figure 7. Comparison of the performance of the prediction model according to prediction time and accuracy.

Table 1. ASHRAE thermal sensation scale [8].

Thermal Sensation	Scale
Hot	+3
Warm	+2
Slightly warm	+1
Neutral	0
Slightly cool	−1
Cool	−2
Cold	−3

Table 2. Steps of the research process.

Step	Definition	Demonstration
Step 1	Establishing dataset	∙ Establishing the dataset using ASHRAE global thermal comfort database II ∙ Eliminating all variables except those correlated with PMV, such as Ta, RH, Vel, MRT, Clo, and Met
Step 2	Feature selection	∙ Calculating the importance of each variable by various feature selection methods ∙ Finding the optimum feature selection method to predict the PMV value
Step 3	Prediction model	∙ Predicting the PMV value utilizing the input variables that were selected in Step 2 ∙ Finding the optimum prediction model in the machine learning algorithm
Step 4	Evaluation	∙ Comparing the computation time and accuracy depending on each case

Table 3. Cases to find PMV prediction method.

	PCA	Best Subset	Gini Importance
Prediction	PCA	Best Subset	Gini Importance
ANN	Case 1	Case 2	Case 3
LSTM	Case 4	Case 5	Case 6
Random Forest	Case 7	Case 8	Case 9

Table 4. Eigenanalysis of the correlation matrix.

Variable	PC1	PC2	PC3	PC4	PC5	PC6
Eigenvalue	2.8035	1.0288	0.8778	0.7490	0.4702	0.0707
Proportion	0.467	0.171	0.146	0.125	0.078	0.012
Cumulative	0.467	0.639	0.785	0.910	0.988	1.000

Table 5. Results of Best Subset regression.

Num. of Variables	R²	Ta	Tg	RH	Vel	Met	Clo
1	44.1		●
1	38.7	●
2	75.1		●				●
2	69.0	●					●
3	87.4		●			●	●
3	81.6	●				●	●
4	89.8		●		●	●	●
4	88.6		●	●		●	●
5	92.5		●	●	●	●	●
5	90.7	●	●		●	●	●
6	92.9	●	●	●	●	●	●

Table 6. Results of feature selection according to method.

Feature Selection	T_a	T_g	RH	Clo	Met	Number of Inputs
PCA	●	●		●	●	4
Best Subset		●		●	●	3
Random Forest	●	●	●	●	●	5

Table 7. Comparison of the prediction models’ performance.

	PCA		Best Subset		Gini Importance
ANN
	MAPE [%]	6.36	MAPE [%]	6.62	MAPE [%]	7.01
	cvRMSE [%]	7.68	cvRMSE [%]	7.87	cvRMSE [%]	8.15
	R² [%]	89.70	R² [%]	88.96	R² [%]	87.83
	Time [s]	140	Time [s]	142	Time [s]	156
LSTM
	MAPE [%]	6.18	MAPE [%]	5.45	MAPE [%]	5.99
	cvRMSE [%]	8.23	cvRMSE [%]	7.03	cvRMSE [%]	7.42
	R² [%]	84.90	R² [%]	87.97	R² [%]	82.23
	Time [s]	153	Time [s]	119	Time [s]	169
Random Forest
	MAPE [%]	6.81	MAPE [%]	6.72	MAPE [%]	6.69
	cvRMSE [%]	8.62	cvRMSE [%]	8.50	cvRMSE [%]	8.56
	R² [%]	81.75	R² [%]	82.28	R² [%]	82.02
	Time [s]	39	Time [s]	30	Time [s]	51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, K.-Y.; Woo, D.-O. PMV Dimension Reduction Utilizing Feature Selection Method: Comparison Study on Machine Learning Models. Energies 2023, 16, 2419. https://0-doi-org.brum.beds.ac.uk/10.3390/en16052419

AMA Style

Park K-Y, Woo D-O. PMV Dimension Reduction Utilizing Feature Selection Method: Comparison Study on Machine Learning Models. Energies. 2023; 16(5):2419. https://0-doi-org.brum.beds.ac.uk/10.3390/en16052419

Chicago/Turabian Style

Park, Kyung-Yong, and Deok-Oh Woo. 2023. "PMV Dimension Reduction Utilizing Feature Selection Method: Comparison Study on Machine Learning Models" Energies 16, no. 5: 2419. https://0-doi-org.brum.beds.ac.uk/10.3390/en16052419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PMV Dimension Reduction Utilizing Feature Selection Method: Comparison Study on Machine Learning Models

Abstract

1. Introduction

2. Methodology

2.1. Research Procedure

2.2. Dataset

2.3. Feature Selection

2.4. Prediction

3. Results and Discussion

3.1. Feature Importance

3.2. Prediction Performance

3.3. Discussion

3.4. Contributions and Limitations

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI