Next Article in Journal
Analyzing Spatial Location Preference of Urban Activities with Mode-Dependent Accessibility Using Integrated Land Use–Transport Models
Previous Article in Journal
Analysis on Spatial Characteristics of Supply–Demand Relationship of Amenities in Expanding Central Urban Areas—A Case Study of Huai’an, China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Intelligent Homogeneous Model Based on an Enhanced Weighted Kernel Self-Organizing Map for Forecasting House Prices

1
Department of Information Management, National Yunlin University of Science and Technology, Douliou 640301, Taiwan
2
Department of Business Administration, I-Shou University, Kaohsiung City 840301, Taiwan
*
Author to whom correspondence should be addressed.
Submission received: 9 July 2022 / Revised: 19 July 2022 / Accepted: 21 July 2022 / Published: 24 July 2022
(This article belongs to the Section Land Innovations – Data and Machine Learning)

Abstract

:
Accurately forecasting housing prices will enable investors to attain profits, and it can provide information to stakeholders that housing prices in the community are falling, stabilizing, or rising. Previous studies on housing price forecasting mostly used hedonic pricing and weighted regression methods, which led to the lack of consideration of the nonlinear relationship model and its explanatory power. Furthermore, the attribute data of housing price forecasts are a heterogeneous study, and they are difficult to forecast accurately. Therefore, this study proposes an intelligent homogeneous model based on an enhanced weighted kernel self-organizing map (EW-KSOM) for forecasting house prices; that is, this study proposes an EW-KSOM algorithm to cluster the collected data and then applies random forest, extra tree, multilayer perception, and support vector regression to forecast the house prices of full, district, and apartment complex data. In the experimental comparison, we compare the performance of the proposed enhanced weighted kernel self-organizing map with the listing clustering methods. The results show that the best forecast algorithm is the combined EW-KSOM and random forest under the root mean square error and root-relative square error, and the proposed method can effectively improve the forecast capability of housing prices and understand the influencing factors of housing prices in full and important districts. Furthermore, we obtain that the top five key factors influencing house prices are transferred land area, house age, building transfer total area, population percentage, and the total number of floors. Lastly, the research results can provide references for investors and related organizations.

1. Introduction

Housing is a basic human need, and everyone needs a place to live. Real estate is both a consumer product and an investment product [1], and it also has a significant impact on macroeconomics. Housing allocation is increasingly important for sustainable economic development and social stability because it is related to people’s lives, and the demand for residential land is also an important issue in real estate [2]. House price appraisal is one of the most important transaction decisions that affects the national real estate policy [3]. Hence, the stakeholders use various information about a house to measure the value-to-price ratio of it. Housing prices are affected by many factors, including the market economy, industrial policies, land price, income levels in different regions, living environment, public facilities, and the needs of the people [4]. Furthermore, the changes in housing prices cover various influencing factors, and important factors can be obtained through data analysis. Stakeholders understand how real estate buyers and sellers evaluate housing prices and produce better evaluation indicators, and then the government can propose corresponding policies to maintain the stability of the real estate market, including adjusting the land supply structure and buying houses.
Many people have expressed widespread concern about housing prices, and an effective tool for forecasting future housing price fluctuations is needed. Various methods have been applied to forecast housing prices. Multivariate statistical models can consider many factors that affect housing prices. However, they require some assumptions on data distribution, and the performance of these models is easily limited by sample size and data integrity. Compared with statistical models, intelligent forecast models are now becoming mainstream because they can identify the nonlinear relationship between variables and provide higher accuracy and deeper interpretation of the behavior of market interaction. The cost and efficiency of intelligent learning from external data and economic signals are no longer an obstacle, and more and more stakeholders are adopting intelligent predictions to forecast housing prices.
The data of housing price forecasts are part of a heterogeneous study. The most obvious heterogeneity of housing prices is that relatively low-priced properties are more likely to find buyers than those with higher prices. For an inexperienced agent, for a price higher than the inferred price, the price increase leads to a greater decrease in the probability of sales compared with a price lower than the inferred price. For any price, the more experienced agents are more likely to sell. It is very difficult to forecast housing prices correctly. However, we can use a large amount of collected data to reduce the forecast error by using the method of homogenizing the data, such as how a clustering algorithm can group heterogeneous data into similar clusters. Therefore, this research proposes an enhanced weighted kernel self-organizing map (EW-KSOM) to cluster the collected attribute data of real estate transactions and to forecast housing prices by using four intelligent models. The following are the contributions of this research:
  • Collect data from the real price registration of real estate information in Taiwan and the population statistics data from the Civil Affairs Bureau of the Taichung City government. The originally collected data have a total of 16,617 records with 26 attributes. It is also found from the relevant literature that the collected data have 12 important attributes that affect the changes in housing prices.
  • Use the proposed EW-KSOM to cluster the collected data and then compare it with the listing clustering algorithms.
  • Apply random forest, extra tree, multilayer perception regression, and support vector regression to forecast the house prices of full and important districts, and we compare the performance of the proposed intelligent homogenous model with the listing methods.
  • The results can be used as references for home buyers, real estate companies, and relevant government agencies.
The remaining sections of this paper are organized as follows. Section 2 introduces the related work, including the factors and forecasts of real estate prices, clustering techniques, and intelligent forecast methods. The proposed method and computational steps are described in Section 3. Section 4 explains the dataset and experimental environment. Section 5 illustrates the experimental results and comparisons. Section 6 presents the discussions, and Section 7 presents the conclusion.

2. Related Work

This section introduces the factors and forecasts of real estate prices, clustering techniques, and intelligent forecast methods.

2.1. Forecasts and Factors of Real Estate Prices

The research trend of real estate price forecasting in the past 10 years is based on hedonic regression and artificial intelligence methods to forecast real estate prices [3]. The location choice of dwellings and real estate prices are concerns of the public, and the location choice of buyers will affect real estate prices. These forecasting methods of real estate prices include spatial regression and hedonic models [5]. Kumar et al. [6] performed regression analysis on real estate prices based on various attributes of the residence (such as the number of rooms, bathrooms, and other attributes) to forecast future prices. Previous studies on housing prices are mostly based on the hedonistic method, which is a traditional statistical estimation method. The hedonic price model developed by Rosen [7] and its implicit prices can be derived from attributes that constitute goods or services. The hedonic price regression aims to select the living environment indicators based on the price that people are willing to pay in the living environment and use regression analysis to explore the relationship between the value of the real estate and the related attributes. The spatial regression method obtains spatial dependence in regression analysis, which avoids unstable statistical parameters and unreliable significance tests and provides information about the spatial relationship between variables. Taltavull de La Paz et al. [8] showed that there is a spatial autocorrelation between housing prices in the local market, and spatial factors also affect the growth of housing prices. Therefore, the spatial regression model can be applied to explore the influence of residential spatial factors on housing prices.
Park and Bae [3] pointed out that machine learning algorithms can improve housing price forecasts and compare the performance of various classifiers. Truong et al. [9] used various machine learning techniques to provide optimal results for housing price forecasts, and the results showed that random forest has the lowest error on the training dataset. Random forest can be used to explore the relationship between housing prices and their factors for helping to determine important attributes and forecast housing prices [10]. Rico-Juan and Taltavull de La Paz [11] used random forest and hedonic regression to evaluate the accuracy of housing price forecasts, and their results showed that the combination of technologies would increase the information about the nonlinear relationship between housing prices and housing attributes in the real estate market. Furthermore, Kumar et al. [6] used k-means to determine the central location of the house landmark and then selected the cluster closest to the household’s preference to select the best apartment. Furthermore, Kang et al. [12] used a deep learning model to extract features from street view images and house photos, and they used a machine learning model and two spatial scales of geographically weighted regression to establish a housing price model. The results showed that their framework could model the housing price appreciation with high accuracy.
According to a 2016 survey conducted by the National Association of Realtors, one-third of home buyers are first-time home buyers, and two-thirds are experienced home buyers [1]. Therefore, the vigorous development of the real estate industry has been regarded as an important economic industry and having a high degree of correlation with economic driving forces [2]. From an economic perspective, the potential factors affecting housing prices include four dimensions: structure (size, age, and decoration of a house), neighborhood (school, hospital, and parks), location (city center), and landscape. The location of the real estate in the city is one of the important factors, and it will affect the pricing standard for the best location [6,13]. Debrezion et al. [14] illustrated the positive impact of railway accessibility on housing prices from the sales data of three metropolitan areas in the Netherlands. We list some important housing price factors and methods in Table 1.

2.2. Clustering Techniques

Cluster analysis is a multivariate statistical method [17], and it involes the task of grouping data; that is, the data in the same clusters are more similar to the data in other clusters (e.g., the distance is closer). There are many clustering algorithms in the literature, such as k-means [18], kernel k-means [19], and self-organizing maps [20]. This section only introduces the relevant clustering algorithms used in this research, including k-means (KM), kernel k-means (KKM), self-organizing maps (SOMs), and kernel self-organizing maps (KSOMs).
1.
KM
KM is used to find outliers and group similar data groups, and it is a distance-based clustering algorithm. The smaller the distance to the object, the greater the similarity [21]. When the number of clusters is fixed to k, we find k cluster centers and assign the object to the nearest cluster center in order to minimize the squared distance from the cluster [22]. Determining the number of clusters is an important step in the KM clustering algorithm. We can use the elbow method [23] and the average silhouette method [24] to determine the number of clusters.
2.
KKM
The KM algorithm assumes that all data points are in a Euclidean space and that they cannot handle nonlinearly separable datasets. To solve this problem, the kernel is introduced into KM clustering [25], which is based on the kernel to cluster KM. It embeds data points in a high-dimensional nonlinear manifold and uses a nonlinear kernel distance function to define their similarity and capture the nonlinear structure in the dataset. KKM clustering can achieve a higher clustering quality than KM clustering, such as with a scalable KKM algorithm [26].
3.
SOM
SOM is unsupervised competitive learning developed by Kohonen [20], which performs clustering and nonlinear projection of datasets at the same time. SOM uses a set of neurons arranged according to a low-dimensional structure, and the original data are divided into as many homogeneous clusters as the model such that the close clusters contain close data points in the original space. The advantages of SOM are that it maps N-dimensional data to a two-dimensional space and keeps the topology of the data and it is not necessary to assign the number of clusters.
4.
KSOM
A kernel is a function which is a dot product operation of a high-dimensional feature space. MacDonald and Fyfe [27] proposed a k-means-based kernel SOM, and each mean is described as a weighted sum of the mapping function. Another KSOM is based on the normalized gradient descent method [28,29], and the algorithm aims to minimize the mean square error of the input and the corresponding prototype in the mapping space.

2.3. Intelligent Forecast Methods

1.
Random Forest (RF)
RF is a nonlinear computation method developed by Breiman [30] which is composed of many decision trees to solve regression and classification problems. RF is an ensemble learning method where when the number of trees increases, it can prevent overfitting. RF in regression is used to average the results of all independent trees, and the majority voting method is used to obtain the results in classification [31]. RF can handle both categorical and numerical variables in the forecast problem, and it can rank the importance of the variables from the most relevant to the least relevant for adding value to the attribute selection in data analysis [31,32].
2.
Extra Tree (ET)
ET is an ensemble learning algorithm, and it is an extension of the RF algorithm proposed by Geurts et al. [33]. ET increases the randomness of the RF algorithm, and it is very similar to RF in operation. The ET algorithm creates a series of original gradients or regression trees based on the traditional top-down procedure. The two main differences between ET and other tree-based methods are that it separates nodes by randomly selecting cut points and uses the entire learning sample to grow the tree. The ET algorithm increases randomization and the number of trees to control the stability of the performance and is immune to overfitting of the training dataset. Therefore, ET adds randomization but still has optimization to prevent overfitting and provide better performance [33,34].
3.
Multilayer Perceptron (MLP)
MLP is an artificial neural network which is basically composed of three layers: the input, hidden, and output layers. It is a feedforward neural network, and the neurons are connected together from the input layer to the hidden layer and finally to the output layer [35]. MLP uses the backpropagation method to adjust the error and weight of the network [36]. The error is generated by the neuron in the output layer, and the error is feedback. Finally, the weight is adjusted, and the process is repeated until the error is minimized. MLP uses activation functions and backpropagation methods to achieve a good learning model, and it is a nonlinear model with better fault tolerance. MLP can be used in regression forecast problems, namely MLP regression (MLPR), such as how Bin et al. [37] used MLP to build a real estate appraisal model which could find a more accurate forecast of house prices.
4.
Support Vector Regression (SVR)
SVM is a supervised learning algorithm for classification and regression forecasting developed by Vapnik et al. [38]. SVM is used for classification problems, called support vector machine (SVM), while in regression problems, it is called SVR. The main purpose of SVR is to find the best separating hyperplane to separate the clustered data, and it can solve a nonlinear problem. SVM can handle high-dimensional attributes for classification, but the disadvantage is that it will be very inefficient for a large number of samples and very sensitive to missing data. In addition, SVM needs to find suitable kernel functions, such as linear, polynomial, sigmoid, and radial basis functions. In the SVR application, Kamara et al. [39] used deep learning models, including SVR, to solve the forecast of the number of days on sale, and their results provide sellers the ability to adjust their prices in response to expected sales.

3. Proposed Model

This section introduces the proposed EW-KSOM and intelligent homogeneous model based on the EW-KSOM algorithm.

3.1. EW-KSOM Algorithm

Hung [40] applied k-means to cluster the heterogeneous data into the homogeneous data and presented a homogeneous attribute intelligent model for forecasting house prices. MacDonald and Fyfe [27] proposed a k-means-based KSOM, where each mean is described as a weighted sum of the mapping function. Based on the gradient descent method, Andras [29] proposed a kernel-Kohonen network, where the input and weights are transformed into the high-dimensional kernel Hilbert space. The Euclidean distance in the high-dimensional space is the updated weight. Another KSOM is also based on the normalized gradient descent method [28], in which the algorithm aims to minimize the mean square error of the input and the corresponding prototype in the mapping space. The SOM and KSOM are not necessarily assigned the number of clusters in advance. After training the weight coefficients by competitive learning, the center of each cluster is automatically obtained. Furthermore, Lau et al. [41] stated that these kernel extensions of the original SOM are to minimize the energy function, and a KSOM can provide slightly better performance than other methods. To enhance the clustering capability and reduce heterogeneity, this study proposes an EW-KSOM algorithm. First, we set the objective and enhanced the weighted function. Let i(X) be the index of the winning neuron with respect to the input X. Hence, the objective function is
i ( X ) = arg min E w ( j ) | | X W j | | 2
The weighted function is Ew(j) = exp(−‖XWj2/σ2), and the parameter σ = [0.1,1].
Second, the EW-KSOM algorithm is as follows:
1.
Define the input and output layer.
Assuming that an input dataset is X’ = [x1, x2, x3xn], which is an n-dimensional vector, and the number of neurons in the input layer is n. The neurons in the output layer are arranged in a matrix in a two-dimensional space, and each neuron has a weight vector.
2.
Normalize the input vector and weight:
X = X / | | X | |
w j = W j / | | W | |
where ‖X‖ and ‖W‖ are the Euclidean norms of the input and weight vector, respectively.
3.
Input the data into the neural network.
Use the dot product to calculate the normalized data and weight vector. The output neuron with the largest dot product value wins the competition.
4.
Update the neurons in the topological neighborhood of the winning neuron.
The updating weight of the winning neuron is
w j ( t + 1 ) = w j ( t ) + E w ( j ) η ( j ) h j J ( w j )
where J(wj)= ‖φ(X’) − φ(wj)‖2 = 2wj − 2φ(X’), φ is a kernel function, ∇ is gradient descent, η(j) is the learning rate function of the training step j, and the topological distance from the winning neuron is defined.
Furthermore, the winning neighborhood function is
h j = exp ( d 2 δ 2 ( j ) ) ,
where δ(t) = d × (jmaxj)/jmax is the width of the topological neighborhood function, d = ‖ri(x)rj2, rj is the location of neuron j, and ri(x) is the location of winning neuron.
5.
Update the learning rate.
Change the different learning rate η and the topological neighborhood N, and η(j) = 1/j generally takes the reciprocal of the number of iterations. When N increases, the time and the distance decrease.
6.
Determine whether to converge.
If the learning rate ηηmin or reaches the set number of iterations, then the algorithm ends.

3.2. Intelligent Homogeneous Model Based on the EW-KSOM Algorithm

The data of housing price forecasts are a heterogeneous study, and it is difficult to accurately forecast housing prices because there are many different data sources, including the category and numeric value. This study referred to Hung [40], who applied k-mean to cluster the heterogeneous data into the homogeneous data for improving housing price forecasting. Therefore, we proposed an EW-KSOM algorithm to cluster the collected data and named each cluster. Furthermore, we applied RF, ET, MLPR, and SVR to forecast the house prices of full and important districts, and we compared the performance of EW-KSOM with the listing clustering methods. The procedure of the proposed intelligent homogeneous model based on EW-KSOM for forecasting house prices is shown in Figure 1. The procedure contained five steps: data collection, data preprocessing, the proposed EW-KSOM algorithm, forecasting house prices, and evaluation and comparison. Steps 1 and 2 are described in Section 4, and steps 3 and 4 are explained in this section. Here, we only introduce the root mean square error (RMSE) and root relative squared error (RRSE) of step 5 as follows.
1.
RMSE
RMSE represents the sample standard error of the difference between the forecasted value and the observed value [42]. The RMSE is mainly used to aggregate the size of the error in the forecast with a value to show its forecast ability. The RMSE is a good measure of accuracy, but because it is related to the numerical range, it can only be used to compare the forecasted error of a specific variable between different models. The RMSE is defined as shown in Equation (6):
RMSE = i = 1 n | T ( i ) F ( i ) | 2 n
where T(i) represents the actual house prices, F(i) denotes the forecasted house prices, and n is the amount of data.
2.
RRSE
RRSE [43] takes the total square error and normalizes it by dividing the total square error of the actual values. By taking the square root of the relative square error, the error can be reduced to the same dimension as the forecasted amount. It is defined as shown in Equation (7):
i = 1 n | T ( i ) F ( i ) | 2 / i = 1 n | T ( i ) T ( i ) ¯ | 2
where T ( i ) ¯ = i = 1 n T ( i ) / n .

4. Dataset and Experimental Environment

Taichung City has a population of about 2.82 million and is the second-most populous city in Taiwan. Taichung City has 29 administrative districts, but the Heping and Daan districts have no house transaction records. Therefore, the collected data contained 27 administrative districts of Taichung City from the real price registration of real estate information in Taiwan (https://pip.moi.gov.tw/V3/Z/SCRZ0211.aspx?Mode=PN02 (accessed on 1 March 2021)), and the population statistics data were from the Civil Affairs Bureau of the Taichung City government (https://demographics.taichung.gov.tw/Demographic/index.html?s=12685860 (accessed on 1 March 2021)). The original data were collected from real estate registration data from 2019 to 2020, and we selected the residential transaction data with 26 attributes and 16,617 records. This study used the population percentage of 29 administrative districts and the average house prices (TWD/m2) in Taichung City to express the impact of community environment and economic development on housing prices, as shown in Table 2. After preprocessing, the retained attributes included the district population percentage, transfer floor, total floor, building construction type, house age, transferred land area, total area square meters of building transfer, bedroom, living room, bathroom, compartmented, management, and unit price per square meter. Then, we coded these categorical variables as digital for clustering and forecasting, and the descriptions and values of the 12 independent variables and 1 dependent variable (house prices) are shown in Table 3.
To visualize the data, we used the Geocoding API of the Google Maps service to obtain the latitude and longitude based on the locations (address) of the transaction houses, and then we applied the Choropleth and HeatMap of map visualizations in the Python tool to generate visual maps. Figure 2 only shows the 27 administrative districts with transaction records, and it is a map visualization of the average house prices/m2 for each district. The deeper color districts denote higher house prices. From Table 2 and Figure 2, we observe that the highest housing prices in the Taiping district were due to the new expressway (Taiwan Route 74), with convenient transportation close to the urban area, a large land area, and new residential buildings.
The experimental environment used Python 3.7.4 to compute all results and visualize data on an Intel i7-7700 with a Windows 10 operating system and a 3.6-GHz CPU. Four forecasting algorithms (RF, ET, MLPR, and SVR) were used to forecast the housing prices. The parameter settings of the clustering and forecast algorithms are shown in Table 4.

5. Experimental Results and Comparisons

This section illustrates the proposed EW-KSOM algorithm to cluster the collected data and then applies RF, ET, MLPR, and SVR to forecast house prices. Furthermore, we compare the performance of the proposed enhanced weighted kernel self-organizing map with KM, SOM, KKM, KSOM, and no clustering algorithms. The experimental data was partitioned into 66% for training and 34% for testing, and the average of repeated implementation 100 times was taken to present their results for finding the best algorithm. After data preprocessing, we followed steps 3–5 of the proposed procedure in Figure 1 of Section 3.2. First, this experiment used the SOM family algorithms because the SOM, KSOM, and EW-KSOM do not need to assign the number of clusters in advance, and then the best number of clusters was four from the SOM family algorithms. Second, we implemented KM family algorithms based on the best number of clusters k = 4, and then we applied RF, ET, MLPR, and SVR to forecast house prices. The comparative results of the full data are shown in Table 5, and it shows that the proposed EW-KSOM algorithm combined with RF was the best forecast result in terms of the RMSE and RRSE metrics.
To further compare the performance of the proposed EW-KSOM algorithm, we chose the Beitun and Xitun districts, having the largest population and transaction volumes, to forecast house prices. In addition, the Xitun district is the location of the Taichung City government and council, being the most prosperous commercial area. From the comparative results of the full data in Table 5, we only used RF to forecast the house prices of the Beitun and Xitun districts. The results are shown in Table 6, and they show that the forecast capability of the proposed EW-KSOM was better than the listing methods (including no clustering) in terms of the RMSE and RRSE metrics. Furthermore, KKM was the worst clustering method, with results poorer than no clustering.

6. Discussions

From the experimental results, the proposed EW-KSOM clustering algorithm was better than the listing algorithms based on the minimum RMSE and RRSE metrics. However, there are some findings and discussions, which are listed below:
1.
Key Factors
This study collected 13 factors (as Table 3) to forecast house prices, among which the population percentage was applied to express the importance of the geographical location of the region and its urban development. After the experimental results, we used the RF forecast model to rank the factors’ importance of full data, as shown in Figure 3. We observed that the top four positive factors are the following: transferred land area > house age > population percentage > total number of floors. The top four negative factors are as follows: building transfer total area > bedroom > living room > compartment. Our results are consistent with some literature: (1) housing prices in geographical locations with a large population would rise sharply [3,11], (2) the values of residential buildings on higher floors were relatively higher and has a positive impact [44], and (3) the house area was highly correlated with the house price [12]. Furthermore, we used apartment complex data (building construction type 5: all floors with the elevator, as in Table 3) from the collected data and ranked the factors’ importance of the apartment complex data, as shown in Figure 4. We observed that the four positive factors were the following: population percentage > building transfer total area > total number of floors > transfer floor. The top three negative factors were the following: house age > transferred land area > bedroom.
2.
Homogeneous Effect
To improve forecast capabilities, this study proposes an intelligent homogeneous model based on EW-KSOM clustering to forecast house prices. The results show that the proposed EW-KSOM algorithm combined with RF had the best forecast results in the full and district data, as shown in Table 5 and Table 6. Furthermore, Kumar et al. [6] used k-means clustering to identify the locations of real estate investment based on user preferences and selected condominium and apartment complex data to rank the influencing factors of an apartment complex. Therefore, we also selected apartment complex data from the collected data and applied the proposed intelligent homogeneous model based on EW-KSOM clustering to forecast house prices. Here, we only compared the proposed algorithm with KSOM because the proposed EW-KSOM algorithm is a modified form of KSOM clustering. The results show that the proposed EW-KSOM algorithm combined with RF produced the best forecast result, as shown in Table 7. In summary, the clustering allowed the data to be homogeneous, the forecast effect was better than no clustering, and the proposed method was better than the listing methods.
3.
Ensemble Forecast Algorithm
From the experimental results, as shown in Table 5, RF was the best forecast algorithm compared with the listing methods because RF is an ensemble learning and nonlinear algorithm to reduce overfitting, and it is very stable for categorical and continuous data [30]. Furthermore, RF in regression is used to average the results of all independent trees, and the majority voting method is used to obtain the results in classification [31]. RF can rank the importance of the variables from the most relevant to the least relevant for adding value to the attribute selection in data analysis [31,32]. In the experimental data of the full, district, and apartment complexes, the results show that the proposed EW-KSOM combined with RF produced the best forecast result.

7. Conclusions

Housing is the basic condition of people’s lives. When buying a house, one must have a variety of housing information, and the most important information that affects the purchase is the house’s price [6]. The changing trend of housing prices will be considered by a number of factors. This study collected 13 factors from the relevant literature and used the collected data of the 27 administrative districts of Taichung City to forecast house prices. Previous housing price forecasting mostly used hedonic pricing and weighted regression methods, which led to a lack of consideration of the nonlinear relationship model and its explanatory power. This research used the proposed EW-KSOM algorithm to homogenize the collected data, and then we applied an ensemble learning RF to forecast house prices. The results show that the proposed intelligent homogeneous model had better forecast capabilities, as shown in Table 5, Table 6 and Table 7. From the research results, the following summarizes the significance and contributions of this study:
  • Finding the key factors of full, district, and apartment complex data using ensemble learning RF;
  • Proposing EW-KSOM to cluster the full and apartment complex data and then compare their forecast capabilities with the listing methods;
  • Applying RF, ET, MLPR, and SVR to forecast the house prices of full, district, and apartment complex data and then comparing the performance of the proposed intelligent homogenous model with the listing clustering methods;
  • Providing the results to home buyers, real estate companies, and relevant government agencies for reference.
The limitation of this study is that the housing price factor is related to the sample used, and the important factors derived from the study will vary with different community samples. Therefore, we used two different datasets to describe their positive and negative factors: one was the full dataset, and the other was the complex apartment dataset. Furthermore, we refer to the work of Kazak et al. [45] to describe the influencing housing price factors. In the full dataset with 15,955 samples, the results show that the four important positive factors are transferred land area, house age, population percentage, and the total number of floors, while the four negative factors are the building transfer total area, bedroom, living room, and compartment, as shown in Figure 3. In terms of the complex apartment dataset with 6368 samples, the four positive factors are the population percentage, building transfer total area, total number of floors, and transfer floor, while the three negative factors are the house age, transferred land area, and bedroom, as shown in Figure 4.
In future work, we can collect a larger variety of city data to find the significant factors and enhance forecasting capabilities, such as those of the top six Taiwan cities. Furthermore, we can use text mining to build a knowledge graph of buyer behaviors.

Author Contributions

Conceptualization, C.-H.C. and M.-C.T.; methodology, C.-H.C. and M.-C.T.; software, M.-C.T.; validation, C.-H.C. and M.-C.T.; formal analysis, M.-C.T.; investigation, M.-C.T.; resources, C.-H.C.; data curation, C.-H.C.; writing—original draft preparation, C.-H.C.; writing—review and editing, M.-C.T.; visualization, M.-C.T.; supervision, C.-H.C. and M.-C.T.; project administration, M.-C.T.; funding acquisition, C.-H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The collected data are open data from the real price registration of real estate information in Taiwan (https://pip.moi.gov.tw/V3/Z/SCRZ0211.aspx?Mode=PN02 (accessed on 1 March 2021)) and from the Civil Affairs Bureau of the Taichung City government (https://demographics.taichung.gov.tw/Demographic/index.html?s=12685860 (accessed on 1 March 2021)).

Acknowledgments

The authors appreciate Duan-Zi Hung for his valuable insights and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Das, P.; Füss, R.; Hanle, B.; Russ, I.N. The cross-over effect of irrational sentiments in housing, commercial property, and stock markets. J. Bank. Financ. 2020, 114, 105799. [Google Scholar] [CrossRef]
  2. Li, Y.; Zhu, D.; Zhao, J.; Zheng, X.; Zhang, L. Effect of the housing purchase restriction policy on the Real Estate Market: Evidence from a typical suburb of Beijing, China. Land Use Policy 2020, 94, 104528. [Google Scholar] [CrossRef]
  3. Park, B.; Bae, J.K. Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data. Expert Syst. Appl. 2015, 42, 2928–2934. [Google Scholar] [CrossRef]
  4. Liu, L.; Wu, L. Predicting housing prices in China based on modified Holt’s exponential smoothing incorporating whale optimization algorithm. Socio Econ. Plan. Sci. 2020, 72, 100916. [Google Scholar] [CrossRef]
  5. Zhuge, C.; Shao, C.; Gao, J.; Dong, C.; Zhang, H. Agent-based joint model of residential location choice and real estate price for land use and transport model. Comput. Environ. Urban Syst. 2016, 57, 93–105. [Google Scholar] [CrossRef]
  6. Kumar, S.; Talasila, V.; Pasumarthy, R. A novel architecture to identify locations for Real Estate Investment. Int. J. Inf. Manag. 2019, 56, 102012. [Google Scholar] [CrossRef]
  7. Rosen, S. Hedonic prices and implicit markets: Product differentiation in pure competition. J. Political Econ. 1974, 82, 34–55. [Google Scholar] [CrossRef]
  8. Taltavull de La Paz, P.; López, E.; Juárez, F. Ripple effect on housing prices. Evidence from tourist markets in Alicante, Spain. Int. J. Strateg. Prop. Manag. 2017, 21, 1–14. [Google Scholar] [CrossRef]
  9. Truong, Q.; Nguyen, M.; Dang, H.; Mei, B. Housing Price Prediction via Improved Machine Learning Techniques. Procedia Comput. Sci. 2020, 174, 433–442. [Google Scholar] [CrossRef]
  10. Čeh, M.; Kilibarda, M.; Lisec, A.; Bajat, B. Estimating the performance of random forest versus multiple regression for predicting prices of the apartments. ISPRS Int. J. Geo Inf. 2018, 7, 168. [Google Scholar] [CrossRef] [Green Version]
  11. Rico-Juan, J.R.; Taltavull de La Paz, P. Machine learning with explainability or spatial hedonics tools? An analysis of the asking prices in the housing market in Alicante, Spain. Expert Syst. Appl. 2021, 171, 114590. [Google Scholar] [CrossRef]
  12. Kang, Y.; Zhang, F.; Peng, W.; Gao, S.; Rao, J.; Duarte, F.; Ratti, C. Understanding house price appreciation using multi-source big geo-data and machine learning. Land Use Policy 2020, 111, 104919. [Google Scholar] [CrossRef]
  13. D’Acci, L. Quality of urban area, distance from city centre, and housing value. Case study on real estate values in Turin. Cities 2019, 91, 71–92. [Google Scholar] [CrossRef]
  14. Debrezion, G.; Pels, E.; Rietveld, P. The impact of rail transport on real estate prices: An empirical analysis of the Dutch housing market. Urban Stud. 2011, 48, 997–1015. [Google Scholar] [CrossRef]
  15. Shih, M.; Chiang, Y.H.; Chang, H.B. Where does floating TDR land? An analysis of location attributes in real estate development in Taiwan. Land Use Policy 2019, 82, 832–840. [Google Scholar] [CrossRef]
  16. Tubadji, A.; Nijkamp, P. Green Online vs Green Offline preferences on local public goods trade-offs and house prices. Socio Econ. Plan. Sci. 2017, 58, 72–86. [Google Scholar] [CrossRef]
  17. Hair, J.F.; Anderson, R.E.; Black, W.C. Multivariate Data Analysis, 7th ed.; Pearson: Harlow, UK, 2014. [Google Scholar]
  18. Sinaga, K.P.; Yang, M.-S. Unsupervised K-Means Clustering Algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
  19. Dhillon, I.S.; Guan, Y.; Kulis, B. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 551–556. [Google Scholar] [CrossRef]
  20. Kohonen, T. Self-Organisation and Associative Memory; Springer: Berlin/Heidelberg, Germany, 1989. [Google Scholar]
  21. Zhu, C.; Idemudia, C.U.; Feng, W. Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Inform. Med. Unlocked 2019, 17, 100179. [Google Scholar] [CrossRef]
  22. Forgy, E.W. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 1965, 21, 768–769. [Google Scholar]
  23. Ketchen, D.J.; Shook, C.L. The application of cluster analysis in strategic management research: An analysis and critique. Strateg. Manag. J. 1996, 17, 441–458. [Google Scholar] [CrossRef]
  24. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
  25. Kim, D.-W.; Lee, K.Y.; Lee, D.; Lee, K.H. Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognit. 2005, 38, 607–611. [Google Scholar] [CrossRef]
  26. Wang, S.; Gittens, A.; Mahoney, M.W. Scalable kernel k-means clustering with Nyström approximation: Relative-error bounds. J. Mach. Learn. Res. 2019, 20, 431–479. [Google Scholar]
  27. MacDonald, D.; Fyfe, C. The kernel self-organising map. In Proceedings of the KES’2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516), Brighton, UK, 30 August–1 September 2000; Volume 1, pp. 317–320. [Google Scholar] [CrossRef]
  28. Pan, Z.S.; Chen, S.C.; Zhang, D.Q. A kernel-base SOM classifier in input space. Acta Electronica Sinica 2004, 32, 227–231. (In Chinese) [Google Scholar]
  29. Andras, P. Kernel–Kohonen networks. Int. J. Neural Syst. 2002, 12, 117–135. [Google Scholar] [CrossRef]
  30. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  31. Callens, A.; Morichon, D.; Abadie, S.; Delpey, M.; Liquet, B. Using Random forest and Gradient boosting trees to improve wave forecast at a specific location. Appl. Ocean. Res. 2020, 104, 102339. [Google Scholar] [CrossRef]
  32. Simsekler, M.C.E.; Qazi, A.; Alalami, M.A.; Ellahham, S.; Ozonoff, A. Evaluation of patient safety culture using a random forest algorithm. Reliab. Eng. Syst. Saf. 2020, 204, 107186. [Google Scholar] [CrossRef]
  33. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
  34. Sagi, O.; Rokach, L. Explainable decision forest: Transforming a decision forest into an interpretable tree. Inf. Fusion 2020, 61, 124–138. [Google Scholar] [CrossRef]
  35. Relich, M.; Świć, A. Parametric Estimation and Constraint Programming-Based Planning and Simulation of Production Cost of a New Product. Appl. Sci. 2020, 10, 6330. [Google Scholar] [CrossRef]
  36. Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms; Spartan Books: Washington, DC, USA, 1961. [Google Scholar]
  37. Bin, J.; Gardiner, B.; Li, E.; Liu, Z. Multi-source urban data fusion for property value assessment: A case study in Philadelphia. Neurocomputing 2020, 404, 70–83. [Google Scholar] [CrossRef]
  38. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  39. Kamara, A.F.; Chen, E.; Liu, Q.; Pan, Z. A hybrid neural network for predicting Days on Market a measure of liquidity in real estate industry. Knowl. Based Syst. 2020, 208, 106417. [Google Scholar] [CrossRef]
  40. Hung, D.Z. A Homogeneous-Attribute Intelligent Model for Forecasting House Prices: Taking Taichung City as an Example. Master’s Thesis, Department of Information Management, National Yunlin University of Science and Technology, Yunlin, Taiwan, 2021. [Google Scholar]
  41. Lau, K.W.; Yin, H.; Hubbard, S. Kernel self-organising maps for classification. Neurocomputing 2006, 69, 2033–2040. [Google Scholar] [CrossRef]
  42. Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef] [Green Version]
  43. Rodrigues, É.O.; Pinheiro, V.H.A.; Liatsis, P.; Conci, A. Machine learning in the prediction of cardiac epicardial and mediastinal fat volumes. Comput. Biol. Med. 2017, 89, 520–529. [Google Scholar] [CrossRef]
  44. Yilmazer, S.; Kocaman, S. A mass appraisal assessment study using machine learning based on multiple regression and random forest. Land Use Policy 2020, 99, 104889. [Google Scholar] [CrossRef]
  45. Kazak, J.; Dziezyc, H.; Forys, I.; Szewranski, S. Indicator-based analysis of socially sensitive and territorially sustainable development in relation to household energy consumption. Eng. Rural Dev. 2018, 17, 1653–1661. [Google Scholar] [CrossRef]
Figure 1. Procedure of intelligent homogeneous model based on EW-KSOM for forecasting house prices.
Figure 1. Procedure of intelligent homogeneous model based on EW-KSOM for forecasting house prices.
Land 11 01138 g001
Figure 2. Average house price/m2 for each district.
Figure 2. Average house price/m2 for each district.
Land 11 01138 g002
Figure 3. Important factors of influencing house prices (full data). Note: the blue bars denote positive factors, and the red bars mean negative factors.
Figure 3. Important factors of influencing house prices (full data). Note: the blue bars denote positive factors, and the red bars mean negative factors.
Land 11 01138 g003
Figure 4. Important factors of influencing house prices (apartment complex data). Note: The blue bars denote positive factors, and the red bars mean negative factors.
Figure 4. Important factors of influencing house prices (apartment complex data). Note: The blue bars denote positive factors, and the red bars mean negative factors.
Land 11 01138 g004
Table 1. Some important housing price factors and methods.
Table 1. Some important housing price factors and methods.
FactorMethodReference
Residential typeMultiple regression, principal component analysis, and vector autoregressiveDas et al. [1]
PopulationMultiple regression and gradient boosting machineKang et al. [12]
Proximity to public facilitiesLogistic regressionShih et al. [15]
Living environmentMultiple regressionTubadji and Nijkamp [16]
LocationDecision tree, principal component analysis, and k-meansKumar et al. [6]
Floor of dwellingMultiple regressionD’Acci [13]
The number of bedroomsMultiple regressionD’Acci [13]
The number of bathroomsDecision tree, principal component analysis, and k-meansKumar et al. [6]
House ageDecision tree, principal component analysis, and k-meansKumar et al. [6]
Building construction typeDecision tree, principal component analysis, and k-meansKumar et al. [6]
Dining hall typeDecision tree, principal component analysis, and k-meansKumar et al. [6]
Area of the houseC4.5, RIPPER, naïve Bayesian, and AdaBoostPark et al. [3]
Table 2. The 29 administrative districts and average house prices in Taichung City.
Table 2. The 29 administrative districts and average house prices in Taichung City.
DistrictPopulation Percentage (%)The Number of TransactionsTransaction Percentage (%)Average House Prices (Taiwan Dollars/m2)
Beitun10.29295418.5159,495.57
Xitun8.21334920.9965,704.80
Dali7.542491.5664,986.94
Taiping6.942081.3081,757.73
Nantun6.27203712.7768,130.58
Fengyuan5.875383.3754,178.21
North5.15228514.3256,040.20
South4.499145.7350,818.44
West4.0310456.5559,443.05
Tanzi3.876043.7952,382.98
Daya3.402581.6249,804.59
Shalu3.392381.4949,960.79
Qingshui3.141480.9343,729.76
Longjing2.772101.3244,988.53
Wuri2.73720.4564,658.75
Dajia2.69980.6142,124.94
East2.682671.6751,379.72
Shengang2.31870.5547,095.86
Wufeng2.30380.2454,687.49
Wuqi2.11750.4753,980.48
Dadu2.01760.4836,590.17
Houli1.92600.3856,764.40
Dongshi1.73750.4735,527.01
Waipu1.13180.1134,527.22
Xinshe0.8580.0536,796.75
Daan0.6600N/A
Central0.63400.2536,400.72
Shigang0.5140.0336,313.00
Heping0.3800N/A
Note: Taichung’s population is 2,818,139, and the total number of transactions was 15,955.
Table 3. Operational definitions of the dependent and independent variables.
Table 3. Operational definitions of the dependent and independent variables.
VariableDescriptionValue
Population percentagePopulation percentage of 29 administrative districts in Taichung CityNumeric value
Transfer floorTransaction floor status:
“0” denotes full floor,
“i” represents transfer the i-th floor
Categorical value: 0, 1, 2… 22.
Total number of floorsThe total number of floors in this houseCategorical value: 1, 2… 38.
Building construction typeHousing type:
“1” denotes a small studio apartment, “2” represents an old condominium (below five floors and no elevator), “3” denotes house (townhouse, villa, mansion), “4” represents apartment complex (below 10 floors with elevator), and “5” means apartment complex (all floors with elevator)
Categorical value: 1, 2, 3, 4, and 5.
House ageThe year and month of residential transaction minus the year and month of residential construction.Numeric value
Transferred land area (m2)The land area of the houseNumeric value
Building transfer total area (m2)Building transfer total areaNumeric value
BedroomThe number of bedrooms in the houseInteger value
Living roomThe number of living rooms in the houseInteger value
BathroomThe number of bathrooms in the houseInteger value
CompartmentedDoes the house have compartments?“1” denotes yes, and “0” means no
ManagementDoes the house have a housing management committee?“1” denotes yes, and “0” means no
Unit price per square meterTransaction price per square meter (Taiwan dollars)Numeric value
Table 4. Algorithms and their parameter settings.
Table 4. Algorithms and their parameter settings.
AlgorithmParameterReference
KMn_clusters: k = 4, maxiteration = 500, and tolerance = 0.0001Sinaga and Yang [18]
KKMKernel function: RBF, n_clusters: k = 4, maxiteration = 500, and tolerance = 0.0001Dhillon et al. [19]
SOMDimension = 2, iterations = 3000, neighborhood: Guassian, topol shape: hexa and rectKohonen [20]
KSOMIterations = 3000, neighborhood: Guassian, kernel function: RBFPan et al. [28]
EW-KSOMIterations = 3000, neighborhood: Guassian, kernel function: RBFProposed
RFIterations = 100, batch size = 100Breiman [30]
ETIterations = 10, batch size = 100Geurts et al. [33]
SVRKernel function: RBF; epsilon = 0.001; gamma = 1/n (n is number of variables); C = 1Cortes and Vapnik [38]
MLPRactivation function: sigmoid; loss function: square error; learning rate = 0.001; hidden layer sizes = 200Relich and Świć [35]
Table 5. Forecast results of different clustering methods.
Table 5. Forecast results of different clustering methods.
MetricMethodClusterRFETMLPRSVR
RMSESOMCluster_08635.499093.2911,452.9711,874.24
Cluster_139,250.3740,342.3058,133.1546,796.35
Cluster_28473.348884.6612,011.5410,951.73
Cluster_39105.389577.4711,889.3312,264.61
Mean16,366.1516,974.4323,371.7520,471.73
KMCluster_038,337.5240,936.5259,874.5946,798.05
Cluster_19156.019766.1812,915.2712,706.98
Cluster_28548.919089.4411,476.5710,512.94
Cluster_38631.889119.0414,483.7511,830.69
Mean16,168.5817,227.8024,687.5520,462.17
KSOMCluster_08635.499093.2911,452.9711,874.24
Cluster_139,250.3740,342.3058,133.1546,796.35
Cluster_28473.348884.6612,011.5410,951.73
Cluster_39105.389577.4711,889.3312,264.61
Mean14,207.2716,974.4323,371.7520,471.73
KKMCluster_07421.887773.3210,283.389818.09
Cluster_159,165.8964,166.7112,1929.0473,348.07
Cluster_280,135.1886,157.6213,4955.3883,812.93
Cluster_37496.007966.278397.178295.41
Mean38,554.7441,515.9868,891.2443,818.63
EW-KSOMCluster_029,146.8431,800.9064,375.4137,768.25
Cluster_15898.416159.446809.366684.28
Cluster_26111.406461.767895.547085.51
Cluster_39488.519978.7614,927.5012,803.43
Mean12,661.2913,600.2623,501.9516,085.37
No clustering22,547.1624,212.5933,442.4430,682.63
RRSESOMCluster_066.8470.4088.6491.89
Cluster_167.0769.10100.6880.32
Cluster_256.5259.2780.1573.06
Cluster_366.8970.3987.4290.18
Mean64.3367.2989.2283.86
KMCluster_069.1473.87109.1084.75
Cluster_165.7170.1292.8591.10
Cluster_258.3962.0978.4471.81
Cluster_356.6659.8695.1077.65
Mean62.4866.4993.8781.33
KSOMCluster_066.8470.4088.6491.89
Cluster_167.0769.10100.6880.32
Cluster_256.5259.2780.1573.06
Cluster_366.8970.3987.4290.18
Mean64.3367.2989.2283.86
KKMCluster_074.5178.04103.1998.56
Cluster_195.73104.14190.42119.51
Cluster_290.7598.77159.1190.39
Cluster_385.4190.7795.6994.52
Mean67.9792.93137.10100.75
EW-KSOMCluster_060.6066.47136.1979.00
Cluster_142.3244.2148.9147.98
Cluster_238.3440.5449.6044.46
Cluster_353.8556.7284.4672.58
Mean48.7851.9979.7961.01
No clustering64.6269.6495.3788.37
Note: The bold digits denote the lowest error among the combined five clustering algorithms and four forecast methods in RMSE and RRSE metrics.
Table 6. District forecast results of proposed clustering and no clustering.
Table 6. District forecast results of proposed clustering and no clustering.
DistrictMetricNo ClusteringKM + RFKKM + RFSOM + RFKSOM + RFProposed + RF
BeitunRMSE17,084.5821,061.8231,480.3720,388.5811,269.0311,121.00
RRSE56.1860.8386.7460.4150.9750.33
XitunRMSE33,350.0528,101.7854,364.4325,409.7914,667.8514,288.89
RRSE68.2566.7083.0864.3450.7550.27
Note: The bold digits denote the lowest error among proposed clustering and the listing methods in RMSE and RRSE metrics.
Table 7. District comparison results of proposed clustering with no clustering.
Table 7. District comparison results of proposed clustering with no clustering.
DataMetricClusterKSOM + RFEW-KSOM + RF
Apartment complex
(floors with elevator)
RMSECluster_07496.877460.05
Cluster_15227.515361.11
Cluster_25966.735928.15
Cluster_37452.877194.46
Mean6536.006485.98
No clustering 8225.22
RRSECluster_051.7852.91
Cluster_142.1943.37
Cluster_237.6339.37
Cluster_342.6638.37
Mean43.5743.50
No clustering 53.88
Note: The bold digits denote the lowest error among proposed and listing methods in RMSE and RRSE metrics.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Cheng, C.-H.; Tsai, M.-C. An Intelligent Homogeneous Model Based on an Enhanced Weighted Kernel Self-Organizing Map for Forecasting House Prices. Land 2022, 11, 1138. https://0-doi-org.brum.beds.ac.uk/10.3390/land11081138

AMA Style

Cheng C-H, Tsai M-C. An Intelligent Homogeneous Model Based on an Enhanced Weighted Kernel Self-Organizing Map for Forecasting House Prices. Land. 2022; 11(8):1138. https://0-doi-org.brum.beds.ac.uk/10.3390/land11081138

Chicago/Turabian Style

Cheng, Ching-Hsue, and Ming-Chi Tsai. 2022. "An Intelligent Homogeneous Model Based on an Enhanced Weighted Kernel Self-Organizing Map for Forecasting House Prices" Land 11, no. 8: 1138. https://0-doi-org.brum.beds.ac.uk/10.3390/land11081138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop