1. Introduction
According to the Global Alliance for Building and Construction (GABC) Global Status Report, buildings account for approximately 40% of global energy use and 38% of global greenhouse gas (GHG) emissions [
1]. Commercial buildings play a crucial role as the primary contributors to energy consumption and greenhouse gas emissions in the sector. In line with this, the World Green Building Council (World GBC) has emphasized the urgency of adopting net-zero carbon buildings as a standard practice in the commercial sector, with a target starting in 2030 [
2]. Accurately predicting building energy consumption and developing various building energy efficiency strategies (e.g., optimization of air conditioning systems, lighting systems, etc.) is the best way to reduce greenhouse gas emissions from buildings. It is of great practical importance to study the mechanisms and patterns of building energy consumption and to develop accurate and effective building energy prediction models to assess the average annual energy consumption at the beginning of building design.
Current building energy consumption prediction methods are divided into three main categories: physical modeling (white-box model), data-driven modeling (black-box model), and hybrid modeling (gray-box model) [
3,
4]. Physical modeling methods use thermodynamic principles for energy consumption modeling and analysis, such as thermodynamic models [
5], computational fluid dynamics models (CFD) [
6], etc. Accurate simulations usually require the input of very detailed building information, such as individual spatial characteristics, the thermal properties of building materials, etc., which are often difficult to obtain [
7]. In addition, physical modeling is often applied to individual buildings or small subsets of buildings and is not ideal for depicting the spatial variability of energy use in neighborhoods, blocks, or broader areas [
8]. Compared with physical modeling approaches, data-driven approaches do not require extensive expertise in parametric mechanisms and internal building components and use historical data for energy consumption prediction with more accurate and faster predictions [
9]. Hybrid approaches involve a combination of both physical and data-driven methods and use the outputs of physical models as inputs to data-driven models [
10]. These models aim to offset some of the limitations involved in physical modeling through the flexibility of statistical methods [
11]. However, this method requires more computational resources and has high runtimes and computational costs as it requires running two models simultaneously. In addition, since two different types of models are involved, the predicted results may be more difficult to interpret and understand.
Energy consumption in buildings is influenced by a variety of factors, such as building type, climatic conditions, building structure, heating system, human activities, etc. Therefore, predicting building energy consumption is also a complex and variable problem. Applying machine learning models to predict building energy consumption is more advantageous than traditional statistical models. For example, Arjunan [
12] compared the performance of MLR, multiple linear regression with feature interactions (MLRi), and gradient boosted tree (GBT) and showed that GBT outperformed the other two linear regression models. Chen et al. [
13] developed a neural network model specifically for high-rise buildings, showcasing its accuracy, computational speed, and flexibility compared with conventional building simulation methods. This model enables designers to predict building performance during the early stages of design, thereby improving energy efficiency and occupant comfort. Fu et al. [
14] applied a clustering method to study the factors influencing energy consumption in a school building, and the results showed that occupancy rate was the main influencing factor.
In addition to the above machine learning models, a large number of existing scholars have begun to explore the effectiveness of deep learning on building energy consumption prediction and have achieved some results. For example, in Razak’s study [
15], the performances of eight machine learning models and one deep learning model in predicting the annual energy consumption of residential buildings were compared. The results showed that the deep neural network (DNN) outperformed the other machine learning models with an R
2 of 0.95 and an RMSE of 1.19, which motivates building designers to use it to make informed decisions and to manage and optimize their designs prior to construction. In Moisés’ study [
3], the hourly electricity consumption of a single-family home was predicted using a variety of machine learning and deep learning models, and the results showed that LSTM had the best prediction performance (nRMSE of 4.74%). In addition to predicting residential energy consumption, LSTM also shows excellent prediction results in predicting the energy consumption of other building types. For example, the LSTM model developed in the studies of Kim [
16] and Li [
17] achieved better performance in predicting hospital buildings. The study of Dinh [
18] also demonstrated the effectiveness of LSTM in predicting the energy consumption of a commercial building.
Although a large number of studies [
19,
20,
21] have shown that LSTM has promising performance in predicting energy consumption in buildings, it is limited to predicting short-term energy consumption (energy consumption with a time granularity of daily or hourly) [
18]. From the existing literature, most of the previous studies have been conducted on a specific building or a small number of buildings in a region, which provides limited guidance on energy efficiency for other building types and poor scalability. There is a lack of more generalized predictive models, and with the increased availability of data on actual energy consumption, there is now ample opportunity to apply advanced techniques to large datasets in order to build more generalized predictive models [
22].
The Commercial Building Energy Consumption Survey [
23] is the most comprehensive publicly available data set on commercial building energy use in the United States, originally developed to make statistical inferences about the national commercial building population [
24]. The CBECS is also attractive to building energy modelers due to the ample amount of data, the representativeness of the sample, and the broad geographic coverage [
22]. Deng et al. [
22] predicted the total EUI, heating EUI, cooling EUI, and plug load EUI of an office building using the CBECS dataset, and the results showed that the RF model had an RMSE of 28.3, which was superior to other machine learning models. Norouziasl et al. proposed a data-driven prediction framework based on the energy consumption of lighting in office buildings. The results showed that the Support Vector Machine (SVM) algorithm provided the best prediction performance with an R-squared value of 0.78. Robinson et al. [
11] used the CBECS dataset to develop prediction models for 18 building types, including office buildings, hotels, and educational buildings, respectively, using machine learning models, such as linear regression and stochastic forest gradient boosting. The results showed that Extreme Gradient Boosting (XGBoost) has a higher prediction performance (R
2 of 0.82) than other machine learning models. However, this method is time-consuming and has low scalability as it has to model each building individually. Kumar et al. [
25] developed a Random Forest prediction model for the fuel consumption of an entire commercial building, which has better prediction and scalability with training and validation delays of only 0.82 s and 1.14 s, respectively.
Despite the widespread use of various data-driven models in the field of building energy consumption, machine learning methods remain the dominant approach due to limitations in data volume and dimensionality. However, numerous studies have shown the great potential of deep learning methods in this field [
26,
27,
28,
29]. For instance, Goodfellow et al. [
30] argued that having 5000 labeled complete data can lead to improved results for algorithms in deep learning. Nevertheless, deep learning methods also have their limitations and face significant challenges in interpreting predicted results. Furthermore, most prediction models are built for a single building type, lacking the ability to characterize other similar buildings and limiting their guidance for energy-saving and emission reduction. Therefore, this study aims to develop a model based on Deep Forest (DF) and the SHAP value theory for assessing energy consumption in multiple types of commercial buildings. In summary, the novelty of this study is as follows:
Compared with the current deep neural network algorithms in the field of building energy consumption prediction, the prediction effect is good but sacrifices the interpretability. The Deep Forest algorithm proposed in this paper fills the gap of poor interpretability of deep learning in the field of building energy consumption prediction;
Unlike the predictive models developed in the literature based on a single or a few building types, the building energy assessment model in this paper was developed based on 20 commercial building types, such as office buildings, warehouses, and schools. It has broader adaptability to provide energy consumption prediction and energy saving recommendations for 20 commercial building types.
The rest of this study is organized as follows.
Section 2 describes the proposed modeling framework, dataset, and related theories for the energy consumption assessment of commercial buildings.
Section 3 demonstrates the performance evaluation and DF model interpretation results of various deep learning models used to predict energy consumption. Finally, the conclusion of this study is given in
Section 4.
2. Materials and Methods
The building energy consumption assessment model proposed in this paper is a deep learning-based prediction method constructed on DF. The framework of this study is shown in
Figure 1. First, the CBECS dataset is pre-processed with missing value processing, outlier processing, and constant value processing. Subsequently, feature selection is performed after removing a large number of redundant features using Spearman feature filtering, including three different types of feature selection methods: mutual information feature selection (MI), forward feature selection (SFS), and Least Absolute Shrinkage and Selection Operator (LASSO). After extracting features using three different selection methods, these features are used as inputs for energy consumption prediction in three deep learning and two machine learning models. The prediction effects of combining various feature selection algorithms and prediction models are then compared to develop an effective energy consumption assessment model. Finally, by using SHAP values for the interpretability study of the established prediction model, the main influencing factors are analyzed, the causes of high energy consumption are identified, and the model prediction results and analysis conclusions are used as feedback on which to formulate energy-saving strategies that will inversely guide the energy-saving design of new buildings.
2.1. Data Description
CBECS 2012 is the largest commercial building energy survey conducted by the U.S. Energy Information Administration (EIA) to date, providing data on the annual energy consumption of over 6700 commercial buildings [
22]. This dataset represents approximately 5.6 million commercial buildings across the United States [
31].
Figure 2a displays the distribution of different building types, including office buildings, school buildings, and shopping malls, among others. Office buildings are of particular interest to most researchers due to their higher representation compared with other building types. The dataset comprises more than 1181 features related to four types of energy consumption, including heating, cooling, ventilation, and others.
Figure 2b illustrates the percentage of energy consumption in commercial buildings, with fuel and electricity accounting for over 80% of the total energy consumption and serving as the primary sources of energy consumption and emissions. The remaining two energy sources contribute to only 17.3% of the total energy consumption. In this study, due to a significant number of missing values for fuel and natural gas consumption and their relatively low contribution to the total energy consumption (less than 20%), this paper primarily focuses on combining fuel and electricity consumption to represent the total energy consumption.
2.2. Data Pre-Processing
While the CBECS 2012 dataset provides valuable insights for energy efficiency analysis and prediction research, it is not exempt from certain limitations. These include a considerable number of missing values, the presence of outliers, and inconsistencies within the data. These factors can potentially compromise the accuracy and reliability of the analysis. Therefore, it is essential to preprocess the raw data before conducting any data analysis. The preprocessing phase involves various operations, such as data cleaning, missing value imputation, data standardization, and data filtering. These measures are implemented to enhance the quality and reliability of the data, enabling improved data analysis and inference. In this study, the following preprocessing steps were executed on the CBECS 2012 dataset:
Empirical removal of variables not relevant to predicting energy consumption, e.g., ‘Imputed roof replacement’ used to describe whether the roof parameter input is estimated or measured data, or the removal of constant and quasi-constant parameters (<5% change), leaving 614 features;
To reduce the uncertainty of the data, 162 variables with more than 80% missing values were removed to ensure data quality, leaving 452 characteristics;Missing values were filled using 0 values because missing values are not applicable in the database and 0 values have no significant effect on the results in regression prediction; in addition, samples with outliers were directly removed to preserve the characteristics of the data.
In order to better satisfy the model’s requirement that the data conform to the properties of a normal distribution, this paper used Z-Score normalization to process the data so that they are close to a normal distribution.
2.3. Feature Selection
During the development of an energy consumption assessment model, feature selection plays a crucial role as it directly impacts the model’s performance [
32]. Building energy consumption features often exhibit high dimensionality, and as the number of features increases, the density of training samples dramatically decreases, leading to an elevated risk of overfitting. Moreover, high-dimensional features require more computational resources and longer training times, which can increase the cost of the machine learning problem. Currently, feature selection methods can be classified into three types: filter, wrapper, and embedded methods [
33]. In this study, we applied and compared these three typical feature selection methods to evaluate their performance.
MI (Mutual Information) is a commonly used filtered feature selection method to measure the correlation between two variables. Specifically, the MI method calculates the mutual information
I(
X,
Y) between two variables
X and
Y. Mutual information measures the degree of interdependence between variables
X and
Y. In feature selection, we usually represent the original dataset as a matrix of np, where
n denotes the number of samples and
p denotes the number of features. For the relationship between each feature and the target variable, the MI method calculates the mutual information between it and the target variable, and selects the feature that has the greatest correlation with the target variable. The formula of the MI method is shown in Equation (1):
where
denotes the joint probability of
X =
x and
Y =
y,
denotes the marginal probability of
X =
x, and
denotes the marginal probability of
Y =
y. The MI method is suitable for dealing with both discrete and continuous variables, does not need to normalize or standardize the data beforehand, and is widely used in filtered feature selection.
SFS begins with an empty set S. It iteratively selects features from the original feature set based on evaluation criteria and adds them to the current feature subset S. This process continues until the desired number of features in the subset is achieved or no further optimal feature can be selected. SFS based on Random Forest (RF) is a popular wrapper feature selection method. It evaluates feature importance by constructing a Random Forest and gradually selects features to build the best model possible. SFS allows for stepwise selection of features, preventing overfitting, and can find the optimal feature subset within a predefined range of feature numbers. SFS has demonstrated excellent performance in fields such as gene expression data analysis, image processing, and text classification.
LASSO is a widely employed feature selection method known for its interpretability and stability compared with traditional approaches [
33,
34,
35]. The core concept behind LASSO is to select features based on linear regression while imposing a constraint that the sum of the absolute values of the independent variable coefficients is less than a threshold. By minimizing the sum of squared residuals, the LASSO method forces the coefficients of weakly correlated independent variables with the target variable to become zero. There are several techniques available to solve the LASSO model and, in this study, we employ the widely recognized Least Angle Regression (LAR) method [
36,
37], known for its computational efficiency. The LAR algorithm identifies the next most influential feature along the direction defined by the first two selected features, requiring only a few iterations of least squares fitting to obtain the feature variable coefficients β, thereby achieving a rapid solution.
2.4. Predictive Modeling
2.4.1. Deep Forest Model
The Deep Forest Model (DF) consists of a Multi-Grained Scanning (MGS) structure and a cascading forest structure. The process of the multi-grain size scanning structure is illustrated in
Figure 3. Each sample in the input dataset is divided into multiple subsamples by MGS, and each subsample contains a continuous segment of feature vectors that are fed into a Random Forest (R-Forest) and a completely random tree forest (E-Forests) for prediction. Each Random Forest outputs a real value, so a Deep Forest model will obtain multiple real numbers as prediction outputs and stitch these results into a vector to input into the cascade forest structure for further prediction.
One of the sources of the superior performance of deep neural networks is their multilayer connection structure [
38]; similarly, the cascade forest structure in the Deep Forest model also adopts a multilayer cascade structure. The cascade forest consists of multiple cascade layers, each consisting of two R-Forests and two E-Forests. As shown in
Figure 4, each Random Forest is trained after receiving the multi-grain scan data features and using the out-of-bag estimation as the prediction of the corresponding samples. The four feature vectors will be combined into a new vector and spliced with the features input from the previous layer into new features. After the second cascade layer receives the features passed down from the first layer, in addition to outputting the new features, it will also compare the prediction performance of this layer with the previous layer using an evaluation function. If the performance improvement exceeds a set threshold, the new cascade layer is retained and the category vector of the previous layer is updated with the new category vector and then passed down. When the prediction performance of the new cascade layer does not meet expectations, the cascade forest stops training. It then combines all the previous cascade layers to form the final cascade forest model. The cascade forest’s prediction value is obtained by averaging the output values of the last cascade layer.
2.4.2. Deep Multilayer Perceptron
Deep Multilayer Perceptron (Deep MLP) is a deep neural network used to efficiently predict nonlinearly varying data, typically with more than three hidden layers.
Figure 5 illustrates a multilayer perceptron network structure with a 3-component input layer and a 1-component output layer with 4 hidden layers, with each layer containing 4 neurons. The
representative activation function can be one of the Sigmoid, Tanh, or ReLU functions. The role of the activation function is to introduce nonlinear operations into the learning network so that the network can approximate any nonlinear function and substantially improve the model generalization ability. The most widely used activation function is the ReLU activation function, which avoids the “gradient disappearance” defect, i.e., y is no longer sensitive to the increase in x after a large value of x, compared with the Sigmoid and Tanh functions.
The parameters to be solved in the Deep MLP model are the connection weights
and the bias constants
. The convex optimization objective generalization of the Deep MLP model is established using the prediction error least squares criterion and regularization constraints. The convex optimization objective function is shown in Equation (2):
where
represents the true value and
is the predicted value. The equation is solved using the error Back Propagation strategy. Deep MLP has significant improvement in model generalization ability and prediction accuracy compared with the classical neural network model due to the modules of multiple hidden layers, multiple neural nodes, and activation functions.
2.4.3. Convolutional Neural Network
Convolutional neural network (CNN) belongs to one of the deep learning methods, and its network architecture is shown in
Figure 6. CNN is composed of three fundamental components: the convolutional layer, the pooling layer, and the activation layer [
39]. The role of the convolutional layer is to first convolve the input one-dimensional feature data with a one-dimensional convolutional kernel, and then use the activation function to non-linearize the data after the convolutional operation [
40].
The main function of the pooling layer is to remove some redundant information and extract the important features while maintaining feature invariance, i.e., feature extraction and data dimensionality reduction [
41]. The common pooling methods are divided into mean pooling, maximum pooling, etc., and mean pooling is less used because its performance is inferior to that of maximum pooling [
42,
43].
The fully connected network layer encodes the local features into the global features of the convolutional neural network input data. It then adjusts and updates the model parameters by calculating the error between the model prediction and the true label [
39], using a backward propagation algorithm. After several iterations of training, the loss bias is converged and the model parameters are no longer updated. At this point, the optimal values of the algorithm parameters are obtained, and the optimal model parameters are saved for subsequent testing of the algorithm on the validation set for testing and evaluation [
44].
2.5. Model Evaluation
Based on the prediction results of the test set, the goodness of fit (R
2) and root mean square error (RMSE) are used as the evaluation indexes of the regression model performance in this paper. The evaluation indicators R
2 and RMSE are defined as shown in Equations (3) and (4):
where
is the
i-th sample predicted value;
is the
i-th sample true value;
is the sample mean; and
N is the number of samples.
R
2 is used to measure the goodness of fit of the regression model to the data; a higher value of R
2 indicates a better fit of the algorithm and better interpretability [
18]. RMSE is used to compare the predictive performance of the model. It measures the average difference between the model’s predicted values and the actual observed values, i.e., the standard deviation of the prediction error. The smaller the value of RMSE, the better the predictive performance of the mode. Therefore, it is widely used to evaluate the model performance in the field of building energy consumption prediction [
45,
46,
47,
48]. By using both R
2 and RMSE, we aim to gain a comprehensive understanding of how well our model captures potential patterns in the data and how accurately it predicts future outcomes.
2.6. Model Interpretation
An effective energy consumption assessment model should not only be accurate but should also provide an appropriate interpretation of the model using the features. Although tree-based machine learning algorithms such as XGBoost and RF can only provide results on feature importance, they do not support providing the effect of each feature on the predicted composite energy consumption of each sample. Furthermore, they do not indicate whether these features are positively or negatively correlated with composite energy consumption.
This study uses the SHAP method to interpret the proposed energy assessment model. SHAP is an algorithm that utilizes a game theoretic approach to interpret the output of any machine learning model. Traditionally, simple models such as decision trees and linear regression have been well visualized, but the visualization of other models, especially those that integrate learning, is not feasible. To address this problem, Lundberg et al. proposed the use of local interpretation to understand advanced decision tree-based models based on game theory [
49]. SHAP values were developed to analyze the results of integrated decision tree learning models, especially the importance of output feature parameters. The specific process of interpreting the trained RF model using SHAP values is as follows: firstly, calculate the contribution of each feature vector to the integrated energy consumption, and then count the mean of the absolute SHAP values of all the features of the interpreted samples to obtain the contribution value of each feature to the integrated energy consumption, where the specific formula for the ith feature SHAP value is shown in Equation (5):
where
is the number of feature parameters,
is the set of all feature parameters,
is the interpreted model,
is an instance of the interpreted feature vector, and
is the
i-th feature in the feature vector.
S is a subset of
, and
is the SHAP value of the
i-th feature.
5. Conclusions
This study proposes a framework for a data-driven commercial building energy assessment model, including five parts: data pre-processing, feature selection, predictive model development, model evaluation, and model interpretation. Taking the CBECS 2012 dataset as an example, the performance of three feature selection methods (filtering method (MI), wrapper method (SFS), and embedding method (LASSO)) was used to select the most favorable features for energy consumption prediction; the features were also used as input to build three deep learning models and two machine learning models and compare their performance. Finally, SHAP theory was used on multiple perspectives to explain the best-performing model and analyze the explanation results.
Among the results of the proposed framework, the feature selection algorithm SFS causes the most obvious improvement in the prediction effect, especially when predicting Total EUI; the algorithm incurs a 2~9.6% improvement in all five prediction algorithms. In addition, the combination of SFS with the DF model exhibits the best performance, achieving an R2 value of 0.90. Although the combination of SFS with the Deep MLP model achieves a similar level of accuracy, the Deep MLP model has certain drawbacks compared with the former, such as the complexity of hyperparameter tuning and poorer model stability. So, the SFS + DF model is the best choice for developing building energy assessment models. The model interpretability was analyzed at three levels, namely, the impact of 20 features on the output, the impact of individual features on the output, and the impact of features in a single sample on the output, respectively. The results of the feature global SHAP interpretation plots show that the features with the most significant impact on energy consumption are SQFT and NWKER, which are positively correlated, followed by NELVTR, WKHRS, etc. In addition, certain features (NWKER) have stronger interactions, and it is difficult to analyze the cause of energy consumption from their feature dependence plots, while the single-sample SHAP value force plots are easier to see the roles of features in a certain sample and are not affected by feature interactions. The results show that although SQFT is the most influential factor among them in the global explanation, the influence of certain secondary factors on energy consumption in the SHAP explanation of the sample may be greater than SQFT, such as NWKER, NELVTR, WKHRS, etc. Therefore, architects should take into account the extreme values of non-significant features in addition to the major factors when designing buildings for energy efficiency.
This paper demonstrates the effectiveness of deep learning algorithms in the field of building energy consumption prediction, and the proposed model can provide suggestions and references for building designers, building energy managers, and other related staff.
The limitation of this study is that solving the hyperparameters of the DF model using the grid optimization algorithm does not significantly improve the model performance and is time-consuming. In the future, the predictive performance of the model will be further improved by pruning and optimizing the structure of the DF model for higher dimensional feature sets.