Next Article in Journal
Lumped Kinetics for Homogeneous Reactions of n-Hexadecane and n-Decene as Model Compounds for PE Pyrolysis Primary Tars
Previous Article in Journal
Kinetics of Biotic and Abiotic CO Production during the Initial Phase of Biowaste Composting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PLS-CNN-BiLSTM: An End-to-End Algorithm-Based Savitzky–Golay Smoothing and Evolution Strategy for Load Forecasting

1
Department of Electrical and Computer Engineering, Texas A and M University at Qatar, Doha 3263, Qatar
2
Unité de Recherche de Physique des Semi-Conducteurs et Capteurs, Carthage University, Tunis 2070, Tunisia
3
Laboratory of Energy Applications and Renewable Energy Efficiency (LAPER), El Manar University, Tunis 1002, Tunisia
*
Author to whom correspondence should be addressed.
Submission received: 2 September 2020 / Revised: 13 October 2020 / Accepted: 14 October 2020 / Published: 19 October 2020
(This article belongs to the Section A1: Smart Grids and Microgrids)

Abstract

:
This paper proposes an effective deep learning framework for Short-Term Load Forecasting (STLF) of multivariate time series. The proposed model consists of a hybrid Convolutional neural network-Bidirectional Long Short-Term Memory (CBiLSTM) based on the Evolution Strategy (ES) method and the Savitzky–Golay (SG) filter (SG-CBiLSTM). The adopted methodology incorporates the virtue of different prepossessing blocks to enhance the performance of the CBiLSTM model. In particular, a data-augmentation strategy is employed to synthetically improve the feature representation of the CBiLSTM model. The augmented data is forwarded to the Partial Least Square (PLS) method to select the most informative features above the predefined threshold. Next, the SG algorithm is computed for smoothing the load to enhance the learning capabilities of the underlying system. The structure of the SG-CBiLSTM for the ISO New England dataset is optimized using the ES technique. Finally, the CBiLSTM model generates output forecasts. The proposed approach demonstrates a remarkable improvement in the performance of the original CBiLSTM model. Furthermore, the experimental results strongly confirm the high effectiveness of the proposed SG-CBiLSTM model compared to the state-of-the-art techniques.

Graphical Abstract

1. Introduction

For power systems, grid stability is becoming very sensible to the bulk integration of renewable energy sources due to the irregularity and seasonality of numerous external factors. These factors fall into two major aspects: (i) Meteorological, such as temperature, irradiation, and wind speed, and (ii) social, such as working days, lifestyles, and number of consumers [1]. Thus, grid sustainability is threatened by the high fluctuations in the load profiles. These fluctuations cause financial penalties and serious technical issues related to the electrical grid. Load forecasting (LF) is a central solution to effectively cope with the dynamic operational environment. The LF is mainly classified into three categories based on the time horizon: long-term, medium-term, and short-term. The Long-Term Load Forecasting (LTLF) ranges from months until years ahead. The LTLF is used for project planning and long-term capital investment return tasks. While the medium-term load forecasting results, ranging from few days until weeks ahead, are employed for asset maintenance schedule, power interchange with other utilities, and security of the grid. Finally, the Short-Term Load Forecasting (STLF) is valid for minutes until days ahead. STLF achieves higher unit commitment, sustainable operations, and safe economic dispatch [2].
STLF is vitally beneficial for the optimization of the operational generation capacity between the generation units and demand-side units. Furthermore, STLF plays an extremely important role in enabling the bulk penetration of grid-connected wind and solar energy in a reliable manner. The energy balance between the utility grid and load demand accentuates the need for STLF techniques to comprehensively manage electrical asset utilization and operating efficiency. However, precise STLF persists a fairly challenging problem due to the greater complexity of underlying load patterns and the high volatility of the meteorological parameters.
Many techniques have been employed to improve the forecasting accuracy of the STLF [3,4,5]. Most efforts in the area of LF have been focused on the application of Machine learning (ML) models due to their mature development, large popularity, and ease of implementation [6]. Nevertheless, ML techniques in many case studies were unable to satisfactory meet the actual load demand in terms of prediction accuracy [7]. Moreover, manpower and human expertise are still required to tune the hyperparameters such as learning rate and number of iterations. Since the load patterns have greater complexity, the traditional ML-based STLF models lack accuracy and robustness due to the narrow variations of the load patterns [8]. Alternatively, Deep Learning (DL) presents cutting-edge technology to model complex and nonlinear systems with higher generalization capabilities and better reliability with the increase in depth [8]. Some recent advances of DL methods could be found in [9,10,11,12,13]. Nevertheless, DL techniques remain less explored compared to the large deployment of ML techniques. This is due to the fact that DL requires a high understanding of the model construction and deployment [2]. Many researchers paid attention to DL deployment with the progress of hardware and software resources. DL techniques, especially Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) achieve accurate and defensible forecasts than conventional forecasting-based modeling [14]. Authors in [15] used LSTM based ensemble learning. The proposed model is employed for multi-step industrial power demand forecasting. In [16], a multi-scale CNN with time-cognition is developed for STLF. A state-of-the-art time coding to emphasize the uniqueness of the moment in one period is implemented [15]. This gives more information about the samples periodicity [15]. To improve cognition for time series data, the authors in [17] coupled LSTM with Singular Spectral analysis and variational mode decomposition. However, most of CNN and LSTM deployment is applied to classification problems such as natural language recognition and image classification. The applications of LSTM and CNN networks on regression tasks are relatively limited according to the recently published research papers.
In [18], the authors proposed a deep CNN method based on modified Residual Networks (ResNet) to perform the STLF. The proposed model implements the knowledge expertise based on Modified ResNet model to cope with the nonlinear variation of the load demand. However, the computational burden causes a low and heavy training process that remains to over 8 h. Further, the proposed Modified ResNet does not have a higher generalization capability such as the original ResNet [18]. The authors in [19] implemented LSTM model to forecast the load of individual residential households. The proposed method in [19] presents one of the most efficient architectures in the realm of sequential learning. The forecasting framework is coupled with Density-Based Spatial Clustering of Application with Noise (BSCAN) clustering algorithm for the consistency analysis of daily power profile. The proposed LSTM model provided accurate results compared to the back-propagation neural network and empirical Mean Absolute Percentage Error (MAPE) minimization [19]. However, even with a deep architecture, the proposed framework-based LSTM was found very sensible to the inconsistency of residents’ lifestyles which significantly diminishes its extrapolation ability [19].
Authors in [14] introduce a hybrid CNN-LSTM model for residential energy consumption forecasting. While the proposed model outperforms several benchmarks, the visualization graphs and the score metrics clearly show that the experimental results are not satisfactory with a MAPE = 34.84% due to the irregular trend of power consumption data. To the authors’ best knowledge, the research works focus to date on the development of the forecasting engine rather than the deep investigation of the feature representations related to load data itself. The DL architectures can automatically carry out the data processing. However, there are very few available research works oriented towards the hybridization of DL models along with different features engineering techniques to achieve higher prediction accuracy [20,21]. Therefore, this paper aims to tackle the non-linearity and volatility of STLF by coupling data augmentation, variable importance analysis, and feature smoothing alongside with hybrid CBiLSTM model with the aim to provide a highly reliable hourly STLF system. The major contributions of this paper are the following:
  • A novel forecasting framework is comprehensively presented for STLF. The proposed SG-CBiLSTM model is recommended for highly nonlinear time series, whose patterns are difficult to capture by sample models.
  • A novel method for STLF is introduced and verified based on real-world dataset and simulation results. Regardless the numerous benefits of STLF, the actual forecasting methods remain less effective yielding non-negligible forecast errors due to the non-linearity of the load patterns. The proposed methodology greatly enhances the overall accuracy of the prediction engine for the STLF application.
  • The performance of the hybrid model is investigated using point forecast assessment, ten-fold cross-validation, score metrics, reference DL models, and simulation graphs.
The structure of this paper is organized as follows: Section 2 introduces the existing state-of-the-art techniques. Section 3 presents the preliminaries relative to the current study and describes the proposed model. Section 4 evaluates the feasibility of the model with the numerical analysis. Moreover, the comparison of the proposed method with some relevant ML techniques considered as benchmarks for this study is conducted in Section 4. Section 5 concludes the study and ends the paper.

2. Related Works

The need for very precise load forecasts requires the use of well-defined techniques in the feature engineering procedure. In this research study, the proposed techniques are divided into three folds: data smoothing, features selection, and Data Augmentation (DA). A brief introduction of these techniques is given as follows:

2.1. Data Augmentation (DA)

The DA technique is employed to reduce the underfitting problem of DL architectures by describing the data in a meaningful way [19]. DA improves the size and quality of the load information. This improvement lies in generating more information from the data representation to give a better insight about the load profile for a specific forecasting horizon. However, the application of DA is not very common on time series contrary to its vast applicability to image recognition and natural language processing. For instance, the authors in [22] derive from the datetime data type additional information to the celebration days and holidays such as Christmas Day, Thanksgiving Day, and Independence Day. Authors in [23] proposed an augmented CNN method to provide better learning potential covering unexplored input space for the CNN training. The key idea of the proposed DA procedure consists of concatenating heterogeneous residual load data generated from the electrical load of a single household. A technical paper in [24] provides a complete survey on the latest taxonomy of time series DL technique. In our work, the hourly temperature is used as indicator feature for the load state, the proposed DA technique consists of two stages: firstly apply the time decomposition to reveal the behavior of consumers and then lagged load generation to follow the chains of load variation in time.

2.2. Feature Importance and Dimensionality Reduction

Feature importance is highly required to identify the most relevant features of the data representation. Feature dimension optimization is essential to improve the forecasting accuracy comprehensibility, and avoid overfitting problems. For a high dimensional data analysis, The removal of irrelevant and repetitive feature labels is the simplest technique to perform Dimensionality Reduction (DR). In [23], the authors studied the effect of DR on the load demand. Their study consists of generating 67 vectors regarded as possible potential inputs of the forecasting system. Next, they demonstrated that DR-based Principal Components Analysis (PCA) and Canonical Variate Analysis (CVA) succeed to enhance the accuracy of the forecasting results with the performance superiority of the PCA method. PCA is a nonparametric classical method frequently used to extract a smaller representative data set from high-dimensional data while conserving its spatial characteristics. The mechanism of PCA lies in using orthogonal linear transformation to transfer a dataset to another coordinate system, In the new coordination, the first parameter has the greatest variance followed by the remaining coordinates, respectively. The number of coordinates is fixed based on the data variability.
However, Feed Foreword Neural Network (FFNN) is not very effective for high dimensional space. Thus, the FFNN-based PCA produces non-negligible forecast errors. On the other hand, PCA presents a high sensitivity to the data distribution in the search space. Another study in [25] employed Echo State Networks with PCA decomposition for STLF application. While PCA achieves an acceptable results for correlated and accelerates the forecasting system, it has been noticed that the selection of the number of Principal Components (PC) requires a level of expertise to achieve the desired results. A wrong PC parameter causes a significant decrease in the performance on the model performance since some relevant features are omitted. Further, the interpretability of the features of PCA decomposition is relatively a difficult task.

2.3. Data Smoothing

To improve temporal cognition of data representation, the removal of the noisy samples is mandatory to boost the forecasting system accuracy by allowing important patterns to stand out. Furthermore, smoothing techniques are very important to track the seasonality of the data and thus improve the performance of ML techniques. Most of the commonly used techniques implement moving average and seasonal exponential smoothing due to their simplicity and good performance in time series forecasting [26]. For instance, the authors in [27] reviewed exponential smoothing techniques, specifically, single exponential smoothing, double exponential smoothing, holt’s method, and adaptive response rate exponential smoothing. The cited methods are advantageous in terms of ease of interpretability and integration of the changing pattern effect of the data series into the prediction model. However, most of those approaches face problems with seasonal data and parameter tuning difficulties [28]. To overcome these problems, a Stavisky Golay (SG) filter is proposed in this paper to tackle the rapid variations from signals. In [29], the authors used an SG filter to remove the noise from phonocardiogram signals in the preprocessing stage. Then, an automatic classification CNN model receives the smoothed results to generate forecasts. In [30], an LSTM SG model was proposed to predict the Large-Scale Water Quality Time Series. However, the SG smoothing receives little attention in the field of smart grid applications despite its considerable importance. From our paper, an SG filter is implemented for the noise reduction of the load demand to further supplement the existing smoothing techniques for STLF in generating more accurate results.

3. The Proposed Methodology

The various parts of the hybrid SG-CBiLSTM, particularly, SG, PLS, and CBiLSTM have been presented. Furthermore, the workflow of the proposed technique is explicitly explained.

3.1. Bidirectional Lstm (BiLSTM) Model

The major success of LSTM model is accorded to its excellent ability in solving the vanishing problem of vanilla Recurrent Neural Network (RNN). The LSTM mechanism consists of the deployment of three memory gates, namely, input gate ( i t ), forget ( f t ) gate, and output gate ( o t ). Therefore, LSTM network carries out long-term dependencies of sequential data with high efficiency. The mathematical equations of LSTM gates are given as follows [31]:
i t = σ ( W i x t + U i h t 1 + b i )
f t = σ ( W f x t + U f h t 1 + b f )
o t = σ ( W o x t + U o h t 1 + b o )
c ˜ = f t c t 1
h t = o t t a n h ( c t )
where x t , σ , and c t denote the input sample at time t, the sigmoid activation function, and the memory unit, respectively. ( b f , b i , b o ) and ( W f , W i , W o ) stands for the bias and weight matrix for each gate, respectively. The symbol ⊙ is the corresponding multiplication of the elements. Firstly, the h t 1 , c t 1 , and x t pass the input information to the LSTM unit. Then, the LSTM gates interact with the inputs to generate an action based on a logic function. After passing by f t , a new cell state c t is built. x t and h t 1 move to the forget layer to quantify the importance of the information between 0 and 1. At this stage, the f t gate takes a decision whether if the information has to be stored, maintained, or removed. Then, the forget gate will update the cell state c t with the new important information based on the proportion of the information occupied by the actual and the previous cell state. The final hidden layer of the LSTM is computed to obtain the remaining state value [31].
To enhance the learning capabilities of the traditional LSTM model, the temporal structure considers two-way relationship of the input data. In other words, instead of using one direction of input processing through the LSTM gates, the Bi-directional LSTM (BiLSTM) model considers the next information when dealing with the current time series data as shown in Figure 1.
This bidirectional processing enhances the information-wise fashion manner with a door mechanism by capturing more structural information. The BiLSTM model encodes the information on back to front. This leads to acquiring additional context information that leads to better generalization ability. First, the LSTM cells are forwarded from the input sequence. Then, the reverse form of the input sequence is integrated into the LSTM network. The output of the forward layer ( h t f ) and backward layer ( h t b ) of the BiLSTM model are generated as [32]:
h t = α h t f + β h t b
y t = σ ( h t )
where α and σ are the numerical factors respecting the equality α + σ = 1 [32]. The experimental results verify that the BiLSTM significantly improves the accuracy of the original LSTM using the spatial correlation mechanism [32].

3.2. Convolutional Neural Network (CNN)

The CNN model is a DL network that maps local features from the input at higher layers and combines them into more complex features at lower layers. The general architecture of this model consists of convolutional layers, activation function, pooling layers, fully connected layer, and dropout layer. The convolutional layers apply a set of filters to the input to generate a set of feature maps. The convolutional layers are advantageous for preventing the model from overfitting, accelerating the training process, decreasing the input volume. The max-pooling layers present the commonly used type of pooling layers. The max-pooling consists of selecting the maximum value of the field covered by the pooling kernels with its parametric equation given as [33]:
P l ( i , j ) = m a x ( j 1 ) W + 1 t j W { a l ( i , j ) }
where P l ( i , j ) and a l ( i , j ) denote the width of the pooling layer and the activation value of t neuron. The activation function permits changing the linearly of an inseparable problem into a separable one to enhance the adaptability of the model. The Rectified Linear Unit (ReLU) function is used as an activation function do to its inherent ability to derive values between 0 and 1. The equation of ReLU function is given by [33]:
a l ( i , j = f ( y l ( i , j ) ) = m a x { 0 , y l ( i , j ) }
where a l ( i , j ) denotes the activation value of the layer output y l ( i , j ) . The fully connected layers combine the previously extracted information to generate the final prediction as shown in the following function [33]:
z l + 1 ( j ) = i = 1 n w i j l a l ( i ) + b j l
where w i j l , a l ( i ) and b j l denote the weight of the i t h neuron, the i t h neuron of the l layer, and the bias values. n is the length of the input data. Finally, the dropout function is employed as a regularization technique to avoid overfitting at the end of iterations by dropping randomly selected neurons during the training process. Much of the research work has emphasized CNN due to its better ability to approximate complex functions.

3.3. Convolutional Bi-Directional Lstm Network (CBiLSTM)

The CBiLSTM model consists of initial convolution layers that receive feature embeddings as input. Its output is pooled to a smaller dimension, then fed into BiLSTM layers. The association of the CNN layers is beneficial for its strong capabilities in local feature extraction. The proposed mechanism consists of extracting the inherent characteristics of the load demand by CNN model and extracting the temporal dependencies by BiLSTM layer. The motivation behind the fusion between CNN and BiLSTM is explained by (1) the excellent feature-extracting ability of CNN model in capturing short-term trends in the time series data. (2) the BiLSTM network is efficient in saving the temporal order between the data in both directions and avoiding the gradient disappearance. Owing to the association of CNN with BiLSTM, the pattern recognition for time series data is tracked spatially and temporally. Figure 2 presents a flowchart of CBiLSTM model.

3.4. Savisky Golay Filter (SG)

The SG filter is a popular smoothing technique for removing the noise from signals of abruptly fluctuating and non-stationary data [34]. This technique lies in moving average the least square polynomial approximation with conserving the main characteristics of the signal(width, peak, and high). The core concept of SG consists of generating 2 n + 1 equidistant points (centered at n = 0) to represent a polynomial of degree p. Thus, SG filter uses three elements, specifically, the input signal, the order of the polynomial, and the frame size. These elements require a specific tuning to achieve the desired results. The polynomial order value attenuates the width of the signals. SG filter computes the value of the least square polynomial at point i = 0 , over the entire sample space. Let x ( n ) be a time series of n elements. We consider y ( n ) = x ( n ) + w ( n ) as the observed time series where w ( n ) denote the noise of y ( n ) signal. The output of SG filter denoted by x ^ is equal [35]:
x ^ ( n ) = k = M M h ( n ) y ( n k )
where h ( n ) and M denote the filter impulse response and the SG filter parameter, respectively. The polynomial p ( n ) and the squared error between the smoothed and noisy signal E are given as follows [35]:
p ( n ) = k = 0 N c k n k
E = n = M M ( x ( n ) p ( n ) ) 2 = n = M M ( k = 0 N c k n k x [ n ] ) 2
where k denotes the polynomial order that ranges between 0 and 2 M inclusive. C ( 0 ) is the best fitting polynomial coefficient satisfying p ( 0 ) = c ( 0 ) as [36]:
E c i = n = M M 2 n i ( k = 0 N c k n k x [ n ] ) = 0
Analytically, the optimal coefficients c k could be calculated by interchanging the order of the summations. In order to find the optimal polynomial coefficients, we differentiate E as follows [36]:
n = 0 N ( n = M M n i + k c k ) = n = M M n i x [ n ]
where i ( 0 , 1 , . . . , N ) . Equation (15) could be reformulated as ( C T C ) c = C T x . c = [ c 0 , c 1 , . . . , c n ] T denotes the polynomial coefficient vectors and x = [ x M , . . . , x 1 , x 0 , x 1 , . . . , x M ] T denotes the input samples vector. Thus, the coefficient vector is conducted as c = ( A T A ) 1 A T x . Solving Equation (15) passes by setting the derivatives equal to zero [36]. Typically, this digital filter uses the technique of linear least squares for data smoothing, which helps to obtain a high signal-to-noise ratio and retains the original shape of the signal. The SG filter has been employed in this paper for two major merits: (1) the SG filter retains the width and height of waveform peaks in noisy load which allows the system to achieve higher performance; (2) the SG method transforms the load demand without destroying its fundamental properties.

3.5. Partial Least Square Method (PLS)

The PLS is a widely used dimentionality reduction technique. Unlike PCA dimension reduction method, which maximizes the variance-based objective function, PLS maximizes the covariance-objective function [37]. The covariance-objective function maximization is conducted by the optimization of the square of the L 2 norm. Assuming we have two sets of dependent variables Y = { y 1 , y 2 , . . . , y m } and X = { x 1 , x 2 , . . . , x n } representing the input variables and the response variables, respectively. For the STLF application, we have s samples. So, X 0 and Y 0 could be written as:
X 0 = x 1 , 1 x 1 , n x s , 1 x s , n , Y 0 = y 1 y s
The PLS method consists of extracting two components ( t 1 , u 1 ) from X and Y, respectively. The vectors t 1 and u 1 acquire the variation information from the initial dataset where the correlation is set to its maximum. Thus, the regression of y k , k N and t 1 is conducted. In case the regression result provides the best accuracy, the PLS stop improving. Otherwise, the PLS algorithm passes to the next pair of components ( t r , u r ) . with r presents the pair number until the PLS algorithm achieves a satisfactory result. Mathematically, the first component could be written as [38]:
t 1 = w 1 T X
where w 1 is the eigenvector corresponding to the maximum eigenvalue of the matrix X 0 T Y 0 Y 0 T X 0 . The component score vector t ^ 1 and the residual matrix X 1 are calculated as [38]:
t ^ 1 = X 0 w 1
X 1 = X 0 t ^ 1 α 1 T , α 1 = X 0 T t ^ 1 t ^ 1 2
This process is repeated untill the construction of the r components for (X,Y). The optimization of the objective function is based on the covariance c o v ( X w , Y ) between the output vector Y and the input matrix Xw of a size ( N × P ) . Therefore, the K t h PLS component is identified by searching the weight vector w k as [37]:
w k = a r g m a x w w = 1 c o v ( X w , Y ) = a r g m a x w w = 1 c o v ( ( N 1 ) 1 w X Y )
where the equation X w k = w k X X w j = 0 is valid for all 1 j < k . The PLS method is advantageous for its ease of implementation and unique biased solutions.

3.6. Evolution Strategy for Hyperparameter Optimization

Hyperparameter Optimization (HO) is one of the most difficult problems for DL models due to two major points: large number of parameters involved and expansive computational training. These points dramatically complicate testing all the possible values of the search space. In this paper, ES meta-heuristic method is considered the key solution to tackle the HO problem with higher efficiency. The ES is a meta-heuristic search technique inspired by the natural biological evolution. It consists of mutations and combinations of the best individuals ( σ ) of a certain population of solutions ( λ ) [39]. According to fitness values, natural selection empowers the best candidate solutions for mutation and reproduction [40]. These solutions dictate the distribution of future generations. Reciprocally, the weak candidates are removed from the population. The mechanism of the ES is presented in Figure 3.
Regarding the flowchart in Figure 3. the ES algorithm starts with a random population of σ parents. Successive recombination and mutation take place to compute the fitness values. An assessment of the fitness values to quantify the solution effectiveness. In case there is an area of improvement, the ES algorithm generates a λ children to improve the objective function. This process is done by the Gaussian distributed random modifications G ( x i P , λ i S ), where x i P and λ i S are the mean and derivation parameters, respectively. i ( 1 i n ) , P ( 1 i σ ) , and S ( 1 S μ ) are the parameter index predecessor index and successor index, respectively. The derivation parameter is calculated as follows [40]:
σ i S = σ i P e x p ( G ( 0 , τ 1 ) + G ( 0 , τ 2 ) )
with τ 1 = 0.1 and τ 2 = 0.2 are the standard derivations [40]. The whole multiplicative σ S mutation operation determine vector of strategy parameters σ S = ( σ i S , . . . , σ N S ) [41]. Each vector component of σ S is mutated individually. The final σ S is mutatively scaled in order to learn axes-parallel mutation ellipsoids [41]. The evolutionary dynamics could lead to undesirable fluctuations which require the removal to keep the high optimization performance of ES method [41]. The individual parameters are calculated as [40]:
x i S = G ( x i P , σ i S )
From Equation (22), the individual parameters x i are modified in order to create a λ solutions. The mutation operation of ES is conducted through the Gaussian distributed random modifications between x i P and σ i S . The best individuals are selected to be used as parents for the next iterations. The iterative process stops when the new individuals have minimal objective function value from the whole population. In this paper, the Evolution strategy is used as a key solution for the global optimization of the complex CBiLSTM black-box function. After fixing the search space, ES uses the batch-sequential process to find the most suitable hyperparameters of a specific model [42]. The ES presents a robust algorithm with a highly parallelizable architecture. Hence, the ES could face the risk of being trapped in a local minimum in a large search space environment.

3.7. Proposed Architecture

The proposed SG-CBilSTM is a cupola of models defined in three major stages: data engineering and preprocessing, object determination, model construction and model evaluation as illustrated in Figure 4.
According to Figure 4, the proposed methodology is conducted using four main stages: (1) Data processing and feature engineering, object determination, model construction and evaluation stages, for further explanation of the proposed methodology, a complete flow chart is illustrated in Figure 5.
According to Figure 5, the proposed methodology is presented as follows: The data is acquired from an open-sourced portal and cleaned from misleading values in the data preprocessing and feature engineering stage. Then, a data augmentation strategy is adopted to enhance the significance of the data and add value to the learning ability of the whole system. Further, data transformation is conducted to convert all the data to its numeric form using on-hot encoder function. Next, data smoothing-based SG filter and data normalization are conducted to remove the noise from the load demand and standardize the data under a defined interval between 0 and 1. At this stage, SG filter is able to reject the noise efficiently along with the least distortion from the original load data. All the feature inputs are given probability values to assess their relevance to the load demand determination. Above a specific threshold, the features are kept and the rest are neglected. Afterwards, the object determination stage split the data into training, validation, and testing intervals. The hyperparameter tuning is mandatory for CBiLSTM model to generate more accurate results. The final stage is the evaluation of the model. In this stage, the point forecast assessment is conducted using score metrics. Moreover, K-fold cross validation technique is used in this paper as a reliable means to evaluate the proposed model accuracy to an unknown testing data [43]. K-fold cross validation schemes is widely used in regression problems [44,45]. The methodology of K-fold cross validation is illustrated in Figure 6.
Regarding Figure 6, the procedure of K-fold cross validation is conducted as follows. Firstly, k equal sized sub-samples are partitioned from the original dataset. Ten-fold cross-validation (10-CV) is preferred to verify the model robustness with cross-validated data. The ten parts evaluate the model generalization to an independent set. For each 10-CV round, a random data separation into k folds { D 1 , D 2 , , D k } is constructed. The proposed model is repeatedly trained by nine training set folds and tested against the remaining testing k 1 set fold. The holdout method is repeated 10 times. Consequently, the bias and variance are remarkably reduced due to the use of the original data set in the fitting and validation procedure in the same time. For checking the model competitiveness, the proposed SG-CBilSTM is compared to a bench of DL models to validate its performance superiority. The comprehensive model design and the implementation details of each part are given in the following section.

4. Simulation Results

In this section, the validation of the proposed SG-CBiLSTM for STLF is demonstrated through extensive experimental studies with the an open source dataset. The evaluation procedure includes dataset description, model construction stages, score metrics, and benchmarks comparison analysis.

4.1. Data Pre-Processing and Feature Engineering

The experimental settings mainly uses the ISO New England incorporation (ISO-NE) dataset [46]. The ISO-NE is non-profit incorporation to manage day-to-day operations with more than 200 market participants in New England [46]. The publicly available data covers the hourly temperature and load data for the public and private electricity market of New England from 2003–2014 as plotted in Figure 7.
The total data samples collected consists of 103,753 rows. The used data is cleaned from erroneous and Not a Number (NAN) samples. The cleaning stage is mandatory to feed the model with complete data samples and avoid any outliers or misleading values. Nevertheless, the hourly temperature for the ISO-NE data is the only indicator of the load demand. This data is insufficient for accurate prediction and may lead to underfitting. Therefore, data augmentation is conducted to identify and generate artificial features using Microsoft Excel software as shown in Table 1.
According to Table 1, the total features are the DateTime components (hour, day, month, year, workday, week number, weekend), season of the year, and hourly lagged load for the last 23 h. The Boolean features are converted to numerical values using hot encoder function. Then, the load demand is smoothed via SG filter to improve the pattern recognition of the forecasting system. The cleaned data is forwarded to the feature engineering phase. Four case scenarios were investigated according to the value of the polynomial function specifically 3, 5, 7, and 10. The load demand and its smoothed derivations are illustrated in Figure 8.
According to Figure 8, the increase of the probability values (P-values) results in a more smoothed signal. The generated signals are assessed together in the same conditions with the aim select the optimum SG polynomial value based on the less error produced. Further, the data normalization is conducted using MinMaxScaler function form Scikit-Learn library [47]. The normalized data from 0 to 1 to keep the values within limits of the activation function and to reduce the influence of different units on the forecasting model.
The selection of the most informative features in the dataset is an essential step of the forecasting system performance. This technique is exploited the P-values for each feature to map the domain representation, and assess the weights for each feature input. Therefore, the irrelevant and redundant inputs are discarded from the feature vectors. These artificial features are associated in tandem with the temperature to make a prediction. The data augmentation technique is commonly used for low dimensional systems to track the seasonality and trends of the target variation and acquire more information about domain knowledge. The ranking of the feature is essential at this stage to make the system more efficient with less computational time due to the high number of features. The PLS technique is used to reduce the dimension of the data. PLS is a straightforward dimensionality reduction technique that maps the variables in a new feature space with lower dimensions. The Variable Importance of load Patterns (VIP) for 32 features is shown in Figure 9.
Regarding Figure 9, the most important features are hour, workday, temperature and lagged load ( t x ) with x [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 11 , 12 , 13 , 17 , 18 , 19 , 20 , 21 , 22 , 23 ] . Thus, the selected threshold is VIP = 0.5. The rest of the simulation results adopted a feature representation including a 20 feature inputs. The data is split into three sets: training, validation, and testing sets with a proportion of 75%, 15%, and 10%, respectively. More precisely, the training, validation, and testing instances comprise 77,814, 15,562, and 10,376 samples, respectively.

4.2. Score Metrics

The assessment of the model performance is conducted using Percentage Error Measure (PEM). The PEM method is a percentage of the error regardless of the units of the measurements (scale-independent). In order to standardize the SDEM measures, particularly, the Rooted Mean Squared Error (RMSE) and Mean Absolute Error (MAE) measures, the Min-Max normalization method was kept for the evaluation process. In other words, the data is normalized with a magnitude range of [0, 1]. The data normalization is calculated as follows:
x ( t ) = x r x m i n x m a x x m i n
where x t denote the normalized value, x r is the real value. Here, x m i n and x m a x are the minimum and maximum values. The error measures in this work include the coefficient of determination (R2), normalized RMSE (nRMSE), normalized MAE (nMAE), Mean Squared Logarithmic Error (MSLE) and MAPE. The selection of the aforementioned scores is based on their popularity for regression-based techniques evaluation. The parametric equations of the score metrics are given as follows [48]:
n M A E = 1 N i = 0 N 1 | y i y i ^ |
n R M S E = 1 N i = 0 N 1 ( y i y i ^ ) 2
M A P E = 100 ( % ) N i = 0 N 1 | y i ^ y i y i |
R 2 = 1 i = 0 N 1 ( y i ^ y i ) 2 i = 0 N 1 ( y i ¯ y i ) 2 , y ¯ = i = 0 N 1 y i
M S L E = 1 N i = 0 N 1 ( l o g e ( y i + 1 ) l o g e ( y i ^ + 1 ) 2 )
where y ^ i and y i present the i t h forecasted values and the actual values. Here N denotes the total number of samples.

4.3. Model Construction and Parameter Sittings

In this part, the assessment of the proposed SG-CBiLSTM model is reported. The simulation results were conducted using PYTHON programming and Keras library [49]. Keras is a high-level framework specialized in DL. All simulations were run on a Lenovo Intel ®i7 ®Nvidia Geforce GTX 1650@ 2.30 GHz. The characteristics of the experiments in this study are resumed in Table 2.
The simulation analysis-based on trial and error method is firstly adopted to limit the search space as much as possible for the optimization process. This method is conducted by selecting arbitrary extreme values of BiLSTM and CNN layers and measuring the accuracy of the forecasting engine in terms of R 2 score. Next, a new random values are defined to further decrease the possibilities of hyperparmaters values. The final results of trial and error method shows that three convolutional layers with a kernel size equal to 1 × 1 associated with three successive BiLSTM layers achieve better results according the highest the R 2 criteria. For the rest of optimization work, ES is employed to tune the critical hyperparameters of the proposed computing framework, specifically, the batch size, the number of units, the number of training epochs [49,50]. To tune these hyperparameters, Hyperactive library is employed on open source [39]. It worth noting that the search space of the ES method only considers the key hyperparameters of the proposed C-BiLSTM to reduce the number of combinations and thus accelerate the computational time. In the model construction stage, the search space of the proposed CBiLSTM model is resumed in Table 3.
The best hyperparameters are selected as follows. Firstly, the population of ES is randomly initialized. Each candidate solution represents a possible configuration of the optimized CBiLSTM model. From the data training, the CBiLSTM model is trained with a specific structure and parameter sittings included in Table 3. Finally, the testing set is used to evaluate each population performance based on MAPE. Regarding the ES performance in Table 3, the ES algorithm is characterised by a fast convergence speed with 8.77 min for every iteration. From Table 3, the input layer consists of a 3 successive one-dimensional Convolutional layer (Conv1D), with 32, 64 and 64 filters of the convolution, respectively. The Conv1D layers are associated with a ReLU activation function. The ReLu function is employed to solve the vanishing gradient with fast convergence speed. The convolution layers are followed by a maximum one-dimensional sampling (MaxPooling1D) layer, three BiLSTM hidden layers, a dropout layer, and a dense layer. The one-dimensional CNN layer is mainly adopted for the sequential data processing of the load demand. The sampling size of MaxPooling1D is 1. The number of features of the BiLSTM hidden layers is 150, 50, and 32, respectively. The proposed CBiLSTM takes advantage of the nonlinear fitting ability form BiLSTM and the feature extraction ability from CNN. The final structure of the SG-CBiLSTM model is resumed in Table 4.
The general architecture of reference models is fixed through repeated training. While, the units’ number and the batch size and the activation function is selected using the ES method. The settings of reference models are given in Table 5.

4.4. Experimental Results

The SG-CBiLSTM evaluation is conducted in this section. The empirical results at this level are firstly conducted by presenting the performance of the proposed framework individually and with several case studies of different SG filter characteristics. Second, a comparative study is computed and presented where the simulation settings are unified for the sake of fair assessment. The conclusive remarks and interpretations are based on the actual results simulated in this work. During the validation experiments of this paper, the selection of the most appropriate Polynomial (P) degree is based on the hit-and-trial method of 4 p values. For the consistency and coherence of this section, the validation of our proposed SG-CBiLSTM model is firstly conducted compared to the original CBiLSTM using the proposed framework explained in the previous section. Secondly, the supremacy of the proposed method was verified compared to different benchmarks. In order to yield credible results, each case scenario is performed 10 times and the results are averaged. The simulation results as shown in Figure 10.
According to Figure 10, the proposed SG-CBiLSTM is generating a good result and follows the actual load demand with high precision. Further, the SG filter demonstrated its ability to improve the performance of the forecasting system. As can be seen, the proposed model performed best for p = 10 where the gap between the prediction curve and the original curve is the smallest compared to the original CBiLSTM and the other models for different p values. However, the polynomial degree variates the errors from a point to another. It could be concluded that the higher the signal is smoothed the better the model will be performing. A quantitative comparison of the proposed SG-CBiLSTM model performance under different p values is given in Table 6.
According to Table 6, the best performance has been attributed to 10 SG-CBiLSTM which achieves the highest results. The registered MSLE and MAPE for 10 SG-BiLSTM is equal to 0.93% and 56.66%. Compared to 3SG-CbiLSTM which reaches to RMSE = 1.36% and MAPE = 59.85%. This leads to conclude that the SG filters are a key element for achieving higher performance. Furthermore, the selection of the most suitable parameters is mandatory to achieve the desired results. For the rest of the work, it will be considered the 10 SG-CBiLSTM as a reference for the proposed forecasting system. For notation simplicity and flow of reading, the 10 polynomial SG-CBiLSTM is referred to as the SG-CBiLSTM hereinafter. The loss function is the training and validation of the model simulation is an essential criterion to follow the learning accumulation in the model construction. The parametric equation of the loss function is calculated as follows [51]:
L o s s ( y ^ , y ) = 1 B 1 O i = 1 B j = 1 O ( y ^ y ) 2 2
where B, O are the output matrix size. Here, the proposed model is trained with a training set stating from 2 March 2003 with the first 75% portion of the whole set. In order to track the loss values when the proposed SG-BiLSTM fetches the next batch of data. it has been used the MAPE as a loss function with an Adam optimizer [52]. The training process was fixed at 100 epochs with an early stopping function. This function is adopted to automatically track the best-trained model without searching for the most suitable epoch number. Figure 11 displays the variation of the training loss of the first 40 epochs for the training and testing sets.
As seen in Figure 11, the training loss value starts at 6.93% and continuously decreases until the 10th epoch starts to stabilize. At this epoch, the loss value reaches to 2.08%. At this level, the training loss value decreases slowly to achieve 1.59% at the 40th epoch. On the other side, the validation loss started at 4.415% to reach 2.407% in the 10th epoch where it decreases slowly aside with the training loss. This indicates the training and validation loss reaches their equilibrium at the same iteration. Moreover, it is verified that the convergence speed is quite fast. The training and validation loss are following the same behavior. The simulation results of the proposed method in tandem with its single components (CNN, BiLSTM, and CBiLSTM) with 10-CV have been computed and shown in Table 7.
According to Table 7, the 10-CV results clearly demonstrate that our proposed framework significantly outperform single models in terms of the error measures. The SG-CBiLSTM achieves a high mean testing result of R 2 = 99.18% compared to 86.96%, 89.43%, and 95.41% for CBiLSTM, CNN, BiLSTM, respectively. The overall increase of the performance for the DL models reported in Table 7 is explained by adopting the data augmentation strategy for the original dataset. Further, employing the S-G filter with the 10th polynomial fitting empowers the CBiLSTM the accuracy with a 12.22% in terms of R 2 . An illustration of the 10-CV results in terms of R 2 is shown in Figure 12 to highlight the performance improvement of the SG-BiLSTM model.
According to Figure 12, the SG-BiLSTM model clearly outperforms the single models in terms of R 2 . The SG smoothing technique helps to better see patterns and detects trends. According to the CV variation results, the proposed model is characterized by a high location independency. The achieved results indicate that the SG-BiLSTM could be applied for STLF at different weather and load conditions. The experimental results of the SG-BiLSTM compared to the original load values are shown in Figure 13.
According to Figure 13, the proposed system achieves high accuracy and follows the real curve with great precision. The proposed model is compared to nine other DL models trained at the same conditions for the integrity of the evaluation. The reference models assessed here include Extreme Learning Machines (ELM), BiLSTM, BiGRU, ConvLSTM, CNN, BiLSTM-Autoencoders (BiLSTM-AE), RNN, CNN, CBiLSTM to verify its competitiveness with the existing benchmarks. It should be noted that the used dataset and extracted features from the proposed model construction process are conserved for the comparison of all the models and the rest of the validation analysis. The comparative simulation results is shown in Figure 14.
Regarding Figure 14, the experimental results clearly show that the proposed SG-CBiLSTM outperforms the rest of the models for hourly STLF due to the low gap between the actual and generated forecasts. In order to quantify the quality of the forecasts, Table 8 presents the score errors of the proposed SG-CBiLSTM and reference models.
According to Table 8, the SG-CBiLSTM hybrid model is performing best comparing to the rest of his counterparts on the scale of the nRMSE, nMAE, MAPE, and R 2 performance measures. The R 2 reaches 99.22 % due to SG filtering. Furthermore, the model performance was slightly close to BiLSTM-AE model which achieves R 2 = 98.56 % . Due to the complex nonlinear relationship between the load demand and its drivers, the ELM model achieves the lowest forecasting performance accuracy with an R 2 = 83.98 % . Moreover, it is evident that the proposed model is preferable compared to numerous DL baseline and top-of-the-line models. The proposed SG-CBiLSTM is found more efficient to enhance the profitability form the electrical load operations.

4.5. Multi-Step Validation

In this study, the model performance is assessed for multi-step forecasting using multiple case scenarios. The multi-step validation is crucial to evaluate the model robustness under an extended time horizon. The investigated forecasting horizons include one hour, twelve hours, twenty four hours, and thirty-six hours ahead. In the simulations results, we conserve the same repartition of the data set with 75%, 10%, and 24% for the training, validation, and testing sets of the whole data. For multi-step-ahead forecasting, the structure of the proposed SG-CBiLSTM model conserves the shape of CNN layers compared to one-step forecasting. For the on-step-ahead, the n forecasted outputs x ^ i + 1 = f ( x i , , x i 1 , , x i n ) takes into consideration the real values of the load data. However, in multi-step-ahead forecasting, the previously generated forecasts are fed as inputs as x ^ i + 2 = f ( x ^ i + 1 , x i , x i 1 , , x i n ) . The experimental results of multi-step forecasting for ISO-NE data were repeated ten times and the results are averaged in Table 9. Furthermore, the error between the predicted and actual load demand according to the forecasting horizon are shown in Figure 15.
According to Table 9, it can be deduced that the most suitable scale is the hourly LF. Considering the scores indicators, the achieved results for one hour ahead reports an R 2 = 99.11 % and nRMSE = 0.54% compared to R 2 = 87.22 % and nRMSE = 3.59% for 36 h ahead. Thus, the lowest error measure is produced by single-step forecasting and the highest is generated by thirty-six step forecasting ahead as shown in Figure 15. In order to increase the model effectiveness, a state-of-the-art optimization approaches such as Coyote Optimization Algorithm (COA) and Covariance Matrix Adaptation Evolution Strategy (CMA-ES) could be employed as alternatives to ES for hyper parameter optimization procedure [53,54]. CMA-ES is an enhanced version of ES where the search individuals are generated according to multivariate normal random distribution. Consequently, the corresponding covariance matrix is updated and assessed in each iterations [53]. Similarly to ES, COA is a population-based algorithm inspired by the coyotes adaptation the environmental conditions [54]. In our work, the lowest performance is attributed to a sequence of thirty-six-steps ahead with a score R 2 = 87.22 % . This demonstrates the relative weak stability of the SG-CBiLSTM for longer forecasting horizon which needs further investigations.

5. Conclusions

The major contribution of this paper is to propose an end-to-end forecasting framework called SG-CBiLSTM model for hourly STLF. The proposed system implements a smoothing technique called Savistky Golay (SG) filter to enhance the extrapolation capability for convolutional neural network-bidirectional long short-term memory. The experimental results strongly reveal that the SG-CBiLSTM model is advantageous for coping with the sudden variation of the load demand and improving the overall accuracy to reach a mean R = 99.18%. The simulation results have demonstrated that the proposed methodology has the following characteristics: (1) The ten polynomial levels provide the most suitable form after the variation of the levels of the SG filter; (2) the proposed SG-CBiLSTM model acquires a higher extrapolation performance to achieve better results than the single CBiLSTM-AE; (3) the proposed model outperforms nine deep learning models used as benchmarks. The SG-CBiLSTM-AE is perfectly tailored to accurately predict the hourly load data. However, its advantages actually hide some limitations related to the poor performance under an extended forecasting horizon. Future work will broaden the scope to include an additional evaluation of the proposed SG-CBiLSTM model for other power systems and a variety of forecasting applications such as the electricity price and renewable energy prediction.

Author Contributions

Conceptualization, Methodology, Software, original draft preparation, M.M.; validation, writing—review & editing, project administration, funding acquisition, S.S.R.; validation, formal analysis, Data curation, Writing—review & editing, H.A.-R.; validation, writing—review & editing, I.C.; supervision, project administration, F.S.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This publication was made possible by NPRP grant [NPRP10-0101-170082] from the Qatar National Research Fund (a member of Qatar Foundation), the co-funding by IBERDROLA QSTPLLC. The statements made herein are solely the responsibility of the authors.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
STLFShort-Term Load Forecasting
ESEvolution Strategy
PLSPartial Least Square
BiLSTMBidirectional Long-Short Term Memory
DLDeep Learning
RNNRecurrent Neural Network
σ softmax function
SDEMScale-Dependent Error Measure
CNNConvolutional Neural Network
k1Kernel size = 1
PCAPrincipal Components Analysis
CVACanonical Variate Analysis
SDStandard derivation
CBiLSTMConvolutional Neural Network-Bidirectional Long Short-Term Memory
PEMPercentage Error Measure

References

  1. Mamun, A.A.; Sohel, M.; Mohammad, N.; Sunny, M.S.H.; Dipta, D.R.; Hossain, E. A Comprehensive Review of the Load Forecasting Techniques Using Single and Hybrid Predictive Models. IEEE Access 2020, 8, 134911–134939. [Google Scholar] [CrossRef]
  2. Hernandez, L.; Baladron, C.; Aguiar, J.M.; Carro, B.; Sanchez-Esguevillas, A.J.; Lloret, J.; Massana, J. A survey on electric power demand forecasting: Future trends in smart grids, microgrids and smart buildings. IEEE Commun. Surv. Tutor. 2014, 16, 1460–1495. [Google Scholar] [CrossRef]
  3. Caro, E.; Juan, J. Short-Term Load Forecasting for Spanish Insular Electric Systems. Energies 2020, 13, 3645. [Google Scholar] [CrossRef]
  4. Park, R.J.; Song, K.B.; Kwon, B.S. Short-Term Load Forecasting Algorithm Using a Similar Day Selection Method Based on Reinforcement Learning. Energies 2020, 13, 2640. [Google Scholar] [CrossRef]
  5. Trierweiler Ribeiro, G.; Guilherme Sauer, J.; Fraccanabbia, N.; Cocco Mariani, V.; dos Santos Coelho, L. Bayesian Optimized Echo State Network Applied to Short-Term Load Forecasting. Energies 2020, 13, 2390. [Google Scholar] [CrossRef]
  6. Dong, Y.; Zhang, Z.; Hong, W.C. A hybrid seasonal mechanism with a chaotic cuckoo search algorithm with a support vector regression model for electric load forecasting. Energies 2018, 11, 1009. [Google Scholar] [CrossRef] [Green Version]
  7. Kuster, C.; Rezgui, Y.; Mourshed, M. Electrical load forecasting models: A critical systematic review. Sustain. Cities Soc. 2017, 35, 257–270. [Google Scholar] [CrossRef]
  8. Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  9. Lago, J.; De Ridder, F.; De Schutter, B. Forecasting spot electricity prices: Deep learning approaches and empirical comparison of traditional algorithms. Appl. Energy 2018, 221, 386–405. [Google Scholar] [CrossRef]
  10. Khwaja, A.; Anpalagan, A.; Naeem, M.; Venkatesh, B. Joint bagged-boosted artificial neural networks: Using ensemble machine learning to improve short-term electricity load forecasting. Electr. Power Syst. Res. 2020, 179, 106080. [Google Scholar] [CrossRef]
  11. Heydari, A.; Nezhad, M.M.; Pirshayan, E.; Garcia, D.A.; Keynia, F.; De Santoli, L. Short-term electricity price and load forecasting in isolated power grids based on composite neural network and gravitational search optimization algorithm. Appl. Energy 2020, 277, 115503. [Google Scholar] [CrossRef]
  12. Haq, M.R.; Ni, Z. A new hybrid model for short-term electricity load forecasting. IEEE Access 2019, 7, 125413–125423. [Google Scholar] [CrossRef]
  13. Massaoudi, M.; Refaat, S.S.; Chihi, I.; Trabelsi, M.; Oueslati, F.; Abu-Rub, H. A Novel Stacked Generalization Ensemble-Based Hybrid LGBM-XGB-MLP Model for Short-Term Load Forecasting. Energy 2020, 214, 118874. [Google Scholar] [CrossRef]
  14. Kim, T.Y.; Cho, S.B. Predicting residential energy consumption using CNN-LSTM neural networks. Energy 2019, 182, 72–81. [Google Scholar] [CrossRef]
  15. Tan, M.; Yuan, S.; Li, S.; Su, Y.; Li, H.; He, F. Ultra-Short-Term Industrial Power Demand Forecasting Using LSTM Based Hybrid Ensemble Learning. IEEE Trans. Power Syst. 2020, 35, 2937–2948. [Google Scholar] [CrossRef]
  16. Deng, Z.; Wang, B.; Xu, Y.; Xu, T.; Liu, C.; Zhu, Z. Multi-scale convolutional neural network with time-cognition for multi-step short-term load forecasting. IEEE Access 2019, 7, 88058–88071. [Google Scholar] [CrossRef]
  17. Moreno, S.R.; da Silva, R.G.; Mariani, V.C.; dos Santos Coelho, L. Multi-step wind speed forecasting based on hybrid multi-stage decomposition model and long short-term memory neural network. Energy Convers. Manag. 2020, 213, 112869. [Google Scholar] [CrossRef]
  18. Chen, K.; Chen, K.; Wang, Q.; He, Z.; Hu, J.; He, J. Short-Term Load Forecasting with Deep Residual Networks. IEEE Trans. Smart Grid 2019, 10, 3943–3952. [Google Scholar] [CrossRef] [Green Version]
  19. Kong, W.; Dong, Z.Y.; Jia, Y.; Hill, D.J.; Xu, Y.; Zhang, Y. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Trans. Smart Grid 2017, 10, 841–851. [Google Scholar] [CrossRef]
  20. Yang, Y.; Shang, Z.; Chen, Y.; Chen, Y. Multi-Objective Particle Swarm Optimization Algorithm for Multi-Step Electric Load Forecasting. Energies 2020, 13, 532. [Google Scholar] [CrossRef] [Green Version]
  21. Aprillia, H.; Yang, H.T.; Huang, C.M. Optimal decomposition and reconstruction of discrete wavelet transformation for short-term load forecasting. Energies 2019, 12, 4654. [Google Scholar] [CrossRef] [Green Version]
  22. Hong, T.; Wilson, J.; Xie, J. Long term probabilistic load forecasting and normalization with hourly information. IEEE Trans. Smart Grid 2014, 5, 456–462. [Google Scholar] [CrossRef]
  23. Acharya, S.K.; Wi, Y.M.; Lee, J. Short-term load forecasting for a single household based on convolution neural networks using data augmentation. Energies 2019, 12, 3560. [Google Scholar] [CrossRef] [Green Version]
  24. Wen, Q.; Sun, L.; Song, X.; Gao, J.; Wang, X.; Xu, H. Time Series Data Augmentation for Deep Learning: A Survey. arXiv 2020, arXiv:2002.12478. [Google Scholar]
  25. Bianchi, F.M.; De Santis, E.; Rizzi, A.; Sadeghian, A. Short-term electric load forecasting using echo state networks and PCA decomposition. IEEE Access 2015, 3, 1931–1943. [Google Scholar] [CrossRef]
  26. Liu, L.; Ding, F.; Xu, L.; Pan, J.; Alsaedi, A.; Hayat, T. Maximum likelihood recursive identification for the multivariate equation-error autoregressive moving average systems using the data filtering. IEEE Access 2019, 7, 41154–41163. [Google Scholar] [CrossRef]
  27. Rahim, R.A.; Jamaluddin, F.; Ibrahim, H.; Khan, S.K.N. A review on smoothing techniques in Markov chains methods. AIP Conf. Proc. 2014, 1635, 195–200. [Google Scholar]
  28. Ryu, S.; Noh, J.; Kim, H. Deep neural network based demand side short term load forecasting. Energies 2017, 10, 3. [Google Scholar] [CrossRef]
  29. Wu, J.M.T.; Tsai, M.H.; Huang, Y.Z.; Islam, S.H.; Hassan, M.M.; Alelaiwi, A.; Fortino, G. Applying an ensemble convolutional neural network with Savitzky–Golay filter to construct a phonocardiogram prediction model. Appl. Soft Comput. 2019, 78, 29–40. [Google Scholar] [CrossRef]
  30. Dong, Q.; Lin, Y.; Bi, J.; Yuan, H. An Integrated Deep Neural Network Approach for Large-Scale Water Quality Time Series Prediction. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3537–3542. [Google Scholar]
  31. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  32. Liu, Z.X.; Zhang, D.G.; Luo, G.Z.; Lian, M.; Liu, B. A new method of emotional analysis based on CNN–BiLSTM hybrid neural network. Clust. Comput. 2020, 122, 1–13. [Google Scholar] [CrossRef]
  33. Ju, Y.; Sun, G.; Chen, Q.; Zhang, M.; Zhu, H.; Rehman, M.U. A model combining convolutional neural network and LightGBM algorithm for ultra-short-term wind power forecasting. IEEE Access 2019, 7, 28309–28318. [Google Scholar] [CrossRef]
  34. Acharya, D.; Rani, A.; Agarwal, S.; Singh, V. Application of adaptive Savitzky–Golay filter for EEG signal processing. Perspect. Sci. 2016, 8, 677–679. [Google Scholar] [CrossRef] [Green Version]
  35. Atif, A.; Khalid, M. Saviztky-Golay Filtering for Solar Power Smoothing and Ramp Rate Reduction Based on Controlled Battery Energy Storage. IEEE Access 2020, 8, 33806–33817. [Google Scholar] [CrossRef]
  36. Seo, J.; Ma, H.; Saha, T.K. On Savitzky–Golay filtering for online condition monitoring of transformer on-load tap changer. IEEE Trans. Power Deliv. 2017, 33, 1689–1698. [Google Scholar] [CrossRef]
  37. Nguyen, D.V.; Rocke, D.M. On partial least squares dimension reduction for microarray-based classification: A simulation study. Comput. Stat. Data Anal. 2004, 46, 407–425. [Google Scholar] [CrossRef]
  38. Sun, W.; Tang, J.; Bai, C. Evaluation of university project based on partial least squares and dynamic back propagation neural network group. IEEE Access 2019, 7, 69494–69503. [Google Scholar] [CrossRef]
  39. Simon Blanke. Hyperactive: A Hyperparameter Optimization and Meta-Learning Toolbox for Machine-/ Deep-Learning Models. Since 2019. Available online: https://github.com/SimonBlanke (accessed on 20 August 2020).
  40. Eichardt, R.; Haueisen, J.; Knösche, T.R.; Schukat-Talamazzini, E.G. Reconstruction of multiple neuromagnetic sources using augmented evolution strategies—A comparative study. IEEE Trans. Biomed. Eng. 2008, 55, 703–712. [Google Scholar] [CrossRef]
  41. Beyer, H.G.; Schwefel, H.P. Evolution strategies—A comprehensive introduction. Nat. Comput. 2002, 1, 3–52. [Google Scholar] [CrossRef]
  42. Loshchilov, I.; Hutter, F. CMA-ES for Hyperparameter Optimization of Deep Neural Networks. arXiv 2016, arXiv:1604.07269. [Google Scholar]
  43. Wong, T.T.; Yeh, P.Y. Reliable accuracy estimates from k-fold cross validation. IEEE Trans. Knowl. Data Eng. 2019, 32, 1586–1594. [Google Scholar] [CrossRef]
  44. Massaoudi, M.; Refaat, S.S.; Abu-Rub, H.; Chihi, I.; Wesleti, F.S. A Hybrid Bayesian Ridge Regression-CWT- Catboost Model For PV Power Forecasting. In Proceedings of the 2020 IEEE Kansas Power and Energy Conference (KPEC), Manhattan, KS, USA, 13–14 July 2020; pp. 1–5. [Google Scholar]
  45. Massaoudi, M.; Chihi, I.; Sidhom, L.; Trabelsi, M.; Oueslati, F.S. Medium and Long-Term Parametric Temperature Forecasting using Real Meteorological Data. In Proceedings of the IECON 2019-45th Annual Conference of the IEEE Industrial Electronics Society, Lisbon, Portugal, 14–17 September 2019; Volume 1, pp. 2402–2407. [Google Scholar]
  46. ISO New England Web Page of Pricing Reports, Ancillary Services—Final Hourly Regulation Clearing Prices—Historical Data Section. Available online: https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/day-ahead-energy-offer-data (accessed on 30 July 2020).
  47. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  48. Botchkarev, A. Performance metrics (error measures) in machine learning regression, forecasting and prognostics: Properties and typology. arXiv 2018, arXiv:1809.03006. [Google Scholar]
  49. Chollet, F.; Keras. Available online: https://github.com/fchollet/keras (accessed on 15 June 2020).
  50. Xie, H.; Zhang, L.; Lim, C.P. Evolving CNN-LSTM Models for Time Series Prediction Using Enhanced Grey Wolf Optimizer. IEEE Access 2020, 8, 161519–161541. [Google Scholar] [CrossRef]
  51. Shi, H.; Xu, M.; Li, R. Deep Learning for Household Load Forecasting-A Novel Pooling Deep RNN. IEEE Trans. Smart Grid 2018, 9, 5271–5280. [Google Scholar] [CrossRef]
  52. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  53. Hansen, N.; Müller, S.D.; Koumoutsakos, P. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evol. Comput. 2003, 11, 1–18. [Google Scholar] [CrossRef]
  54. Pierezan, J.; Coelho, L.D.S. Coyote optimization algorithm: A new metaheuristic for global optimization problems. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
Figure 1. Bidirectional Long Short-Term Memory (BiLSTM) model structure with σ ( x ) = 1 1 + e x .
Figure 1. Bidirectional Long Short-Term Memory (BiLSTM) model structure with σ ( x ) = 1 1 + e x .
Energies 13 05464 g001
Figure 2. Flowchart of the proposed Convolutional neural network-Bidirectional Long Short-Term Memory (CBiLSTM) model.
Figure 2. Flowchart of the proposed Convolutional neural network-Bidirectional Long Short-Term Memory (CBiLSTM) model.
Energies 13 05464 g002
Figure 3. Flow chart of the evolution strategy algorithm.
Figure 3. Flow chart of the evolution strategy algorithm.
Energies 13 05464 g003
Figure 4. The adopted methodology for the proposed Savitzky–Golay (SG)-CBiLSTM model.
Figure 4. The adopted methodology for the proposed Savitzky–Golay (SG)-CBiLSTM model.
Energies 13 05464 g004
Figure 5. Flow chart of the proposed architecture.
Figure 5. Flow chart of the proposed architecture.
Energies 13 05464 g005
Figure 6. Ten-fold cross validation procedure where E denote the mean cross-validation error ( E i ) of the i th iteration. So, the total effectiveness of the model can be evaluated.
Figure 6. Ten-fold cross validation procedure where E denote the mean cross-validation error ( E i ) of the i th iteration. So, the total effectiveness of the model can be evaluated.
Energies 13 05464 g006
Figure 7. The overall system level load and temperature between 2003 and 2011 (103,753 samples) for the ISO-NE dataset.
Figure 7. The overall system level load and temperature between 2003 and 2011 (103,753 samples) for the ISO-NE dataset.
Energies 13 05464 g007
Figure 8. Comparative results of SG-CBiLSTM model with different polynomial functions.
Figure 8. Comparative results of SG-CBiLSTM model with different polynomial functions.
Energies 13 05464 g008
Figure 9. Variable importance of load patterns with a threshold value (red line) equal to Variable Importance of load Patterns (VIP) = 0.5.
Figure 9. Variable importance of load patterns with a threshold value (red line) equal to Variable Importance of load Patterns (VIP) = 0.5.
Energies 13 05464 g009
Figure 10. Performance comparison of SG-CBiLSTM model with a variety of polynomial filters p = {3,5,7,10}.
Figure 10. Performance comparison of SG-CBiLSTM model with a variety of polynomial filters p = {3,5,7,10}.
Energies 13 05464 g010
Figure 11. Training and testing loss variation for epochs.
Figure 11. Training and testing loss variation for epochs.
Energies 13 05464 g011
Figure 12. Box plot of ten-fold cross-validation (CV) of four models in terms of R 2 ( % ) .
Figure 12. Box plot of ten-fold cross-validation (CV) of four models in terms of R 2 ( % ) .
Energies 13 05464 g012
Figure 13. Hourly predictions of the forecasting system.
Figure 13. Hourly predictions of the forecasting system.
Energies 13 05464 g013
Figure 14. Multiple tests of different Deep Learning (DL) models using variant timesets. Despite the high competitiveness of BiLSTM-autoencoders (AE) and the proposed model for their close performance, the proposed SG-CBiLSTM model outperforms BiLSTM-AE model in most cases.
Figure 14. Multiple tests of different Deep Learning (DL) models using variant timesets. Despite the high competitiveness of BiLSTM-autoencoders (AE) and the proposed model for their close performance, the proposed SG-CBiLSTM model outperforms BiLSTM-AE model in most cases.
Energies 13 05464 g014
Figure 15. Multi-step Short-Term Load Forecasting (STLF) error variation with the Error (%) = 100 × ( y ^ n o r m y n o r m ) .
Figure 15. Multi-step Short-Term Load Forecasting (STLF) error variation with the Error (%) = 100 × ( y ^ n o r m y n o r m ) .
Energies 13 05464 g015
Table 1. Generated feature vectors.
Table 1. Generated feature vectors.
Data FoldAugmented DataFormulaUnit/Range
Load forecastLoad(t − 1), …, Load(t − 23) L ( t 1 ) , , L ( t 23 ) MW
Load demand L t MW
DatetimeYear Y t = YEAR (DateTime)[2003, 2014]
Month M t = MONTH (DateTime)[1, 12]
DayD = DAY (DateTime)[1, 31]
Weekday W d = WORKDAY (Datetime)[1, 7]
Hour H t = HOUR (DateTime)[1, 24]
Season of the yearS = SEASON (Datetime)[0, 3]
Week number W n = WEEKNUMBER (DateTime)[1, 52]
Weekend W n d = IF( W d > 5 Then Yes , Else, No)Boolean
Weather dataTemperature T t °C
Table 2. Exprimental environment.
Table 2. Exprimental environment.
Experimental EnvironmentProprieties
Operating systemWindows 10
ProcessorIntel Core i7 9th Gen
Graphics cardNVIDIA GeForce GTX 1650
Programming languagePhython 3.6.7
Deep learning frameworkKeras 2.4.3
PlateformTensorflow 2.3.0
Supplimentory toolsSklearn 0.23.2
Table 3. The search space and Evolution Strategy (ES) results of the proposed SG-CBiLSTM.
Table 3. The search space and Evolution Strategy (ES) results of the proposed SG-CBiLSTM.
ModelConfiguration/Neurons/FiltersSearch SpaceES
CNNCNN Layer 1[32, 64, 72, 128]32
CNN layer 2[32, 64, 128]64
CNN layer 3[32, 64, 128]64
BiLSTMLSTM layer 1[100, 150, 250]150
LSTM layer 2[32,50, 150, 250]50
LSTM layer 3[32, 50, 150, 250]32
Batch size[32, 64, 72, 256, 512]512
Iterations number[20, 30, 70, 100]100
Time (min) 131.64
Time/iteration (min) 8.77
Table 4. Layer description of CBiLSTM.
Table 4. Layer description of CBiLSTM.
LayerConfiguration/Neurons/FiltersActivation Function
Input layerSequential-
Convolution 1-D32, k1 × k1ReLU
Convolution 1-D64, k1 × k1ReLU
Convolution 1-D64, k1 × k1ReLU
MaxPooling 1-Dk1 × k1ReLU
Flatten64-
BiLSTM150ReLU
BiLSTM50ELU
BiLSTM32ELU
Fully connected1Softmax
Table 5. Hyperparameter settings of nine methods.
Table 5. Hyperparameter settings of nine methods.
ModelHyperparameter Sittings
SG-CBiLSTMThe convolution layers are 3 with a number of filters of 32, 64, and 64, respectively;
the activation function is ReLU;
the BiLSTM layers are 3 with a number of neurons 150, 50, and 32, respectively;
the batch size is set to 512;
the dropout function between layers is fixed at 0.5.
BiGRUThe layers number is 4 with a neurons number of 128, 32, and 4 ended by a fully connected layer, respectively;
the dropout function between layers is fixed at 0.4, The optimizer function is Adam;
the activation function of the BiGRU layers is ReLU;
the batch size is set to 512
BiLSTMThe layers number is 7 with a neurons number of 256, 128, 64, 32, 16, 4
finished by a fully connected layer, respectively;
the dropout function between layers is fixed at 0.2, The optimizer is Adam;
the activation function of the BiLSTM layers is ReLU;
the batch size is set to 512.
CBiLSTMThe convolution layers are 3 with a number of filters of 32, 64, and 64;
the activation function is ReLU;
the BiLSTM layers are 3 with a number of neurons 150, 50, and 32, respectively;
the batch size is set to 512;
the dropout function between layers is fixed at 0.5.
BiLSTM-AEThe biLSTM layers are three with a number of neurons of 500, 400, and 300, respectively;
the repeat vector layers are two between the located between the BiLSTM layers;
the optimizer function is Adam.
ConvLSTM-AEThe convLSTM2D is the initial layer with a filters number of 64;
the kernel size is with a dimention of 1×1;
the activation function is ReLU;
the two next layers are a flatten layers and repeat vector layers;
the LSTM layers has a number of neurons equal to 200;
the number of time distributed layers is equal to 2;
the optimization function is Adam optimizer.
CNNThe Conv1D layer has a filter units of 64;
The kernel size is 2×2; The activation function is ReLU;
The max pooling layer has a pooling size of 2;
The third layer is a flatten layer;
The fully connected layers are two with a 50 units and 1 units;
the optimizer function is Adam.
RNNThe RNN layers are two with a number of neurons of 500 and 100, respectively;
the dropout function between layers is fixed at 0.2;
the Adam optimizer function is used;
the fully connected layer uses a tanh activation function.
ELMThe neuron number of the hidden layer is 560;
the activation function is tanh;
the alpha is set to 0.5.
Table 6. Performance variation of CBiLSTM and different SG filters.
Table 6. Performance variation of CBiLSTM and different SG filters.
ModelsMSLEMAPE(%)nRMSE(%)
3 SG-CBiLSTM9.9759.851.36
5 SG-CBiLSTM5.8260.141.04
7 SG-CBiLSTM6.8759.511.13
10 SG-CBiLSTM4.9256.660.93
Original CBiLSTM8.3861.171.23
Table 7. The error measures of 10-fold cross validation. The best result is marked in bold.
Table 7. The error measures of 10-fold cross validation. The best result is marked in bold.
ModelBiLSTMCNN
Fold R 2 ( % ) nMAE (%)nRMSE (%) R 2 ( % ) nMAE (%)nRMSE (%)
SetsTrainTestTrainTestTrainTestTrainTestTrainTestTrainTest
195.4795.233.113.222.072.1390.5690.623.113.223.213.21
295.4095.823.142.992.082.0286.3887.033.142.993.933.88
395.6795.893.052.951.991.9889.6789.763.052.953.343.31
495.5495.193.093.192.062.1288.5888.103.093.193.553.59
595.3495.233.163.202.122.1289.2788.863.163.203.473.51
695.5895.243.073.222.022.1089.5789.663.073.223.403.42
795.3195.473.173.112.122.1090.1090.113.173.113.243.25
895.2495.173.193.242.132.1590.5190.103.193.243.303.42
995.5195.433.113.102.082.0990.2790.243.113.103.263.26
1095.3795.363.153.142.142.1889.7789.793.153.143.303.30
Mean95.4595.413.133.142.092.1089.4789.433.133.143.413.42
SD0.130.250.040.100.050.061.181.050.040.100.200.19
Time(s)3238.383643.24
ModelCBiLSTMSG-CBiLSTM
Fold R 2 ( % ) nMAE (%)nRMSE (%) R 2 ( % ) nMAE (%)nRMSE(%)
SetsTrainTestTrainTestTrainTestTrainTestTrainTestTrainTest
188.5487.794.965.162.842.9599.1899.050.781.430.890.90
288.1188.465.054.973.003.9699.2799.090.731.380.650.66
385.8485.525.525.543.173.1899.0299.280.911.220.700.71
487.6387.795.165.092.922.9099.1299.050.771.430.700.70
586.5585.845.375.533.123.1899.2298.970.791.480.620.64
686.5585.025.715.733.243.2999.1699.330.731.200.680.69
784.8089.164.764.812.742.7799.2199.160.781.330.730.73
889.4786.95.235.342.953.0499.1399.290.731.230.670.67
987.2684.915.685.643.313.3399.2999.020.761.450.740.75
1085.1385.495.655.563.263.2699.3799.550.640.980.700.70
Mean86.8486.965.315.343.063.0999.2099.180.771.320.710.72
SD1.541.450.310.290.180.180.090.170.070.150.070.07
Time(s)2416.561808.4
Table 8. Comparison of the proposed SG-CBiLSTM with the reference models on the ISO New England (ISO-NE) dataset.
Table 8. Comparison of the proposed SG-CBiLSTM with the reference models on the ISO New England (ISO-NE) dataset.
Mean Values of 10 TimesnRMSE (%) ± SDnMAE (%) ± SD R 2 (%) ± SD
SG-CBiLSTM1.22 ± 0.200.89 ± 0.2399.22 ± 0.29
BiGRU2.71 ± 0.272.94 ± 0.3795.93 ± 1.10
BiLSTM-AE1.76 ± 0.111.16 ± 0.0598.56 ± 0.18
ConvLSTM-AE4.13 ± 0.313.13 ± 0.1185.21 ± 0.23
RNN3.77 ± 0.322.82 ± 0.2593.35 ± 1.06
ELM5.86 ± 0.304.45 ± 0.2483.98 ± 1.74
Table 9. Results of Multistep forecasting on ISO-NE dataset.
Table 9. Results of Multistep forecasting on ISO-NE dataset.
Mean Values of 10 TimesOne-StepTwelve-StepTwenty Four-StepThirty-Six-Step
nRMSE (%)0.543.32.263.59
nMAE (%)1.232.141.342.02
R 2 (%)99.1187.3994.2687.22
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Massaoudi, M.; S. Refaat, S.; Abu-Rub, H.; Chihi, I.; Oueslati, F.S. PLS-CNN-BiLSTM: An End-to-End Algorithm-Based Savitzky–Golay Smoothing and Evolution Strategy for Load Forecasting. Energies 2020, 13, 5464. https://0-doi-org.brum.beds.ac.uk/10.3390/en13205464

AMA Style

Massaoudi M, S. Refaat S, Abu-Rub H, Chihi I, Oueslati FS. PLS-CNN-BiLSTM: An End-to-End Algorithm-Based Savitzky–Golay Smoothing and Evolution Strategy for Load Forecasting. Energies. 2020; 13(20):5464. https://0-doi-org.brum.beds.ac.uk/10.3390/en13205464

Chicago/Turabian Style

Massaoudi, Mohamed, Shady S. Refaat, Haitham Abu-Rub, Ines Chihi, and Fakhreddine S. Oueslati. 2020. "PLS-CNN-BiLSTM: An End-to-End Algorithm-Based Savitzky–Golay Smoothing and Evolution Strategy for Load Forecasting" Energies 13, no. 20: 5464. https://0-doi-org.brum.beds.ac.uk/10.3390/en13205464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop