With the advent of the “Web 3.0” era [1
], the Internet users’ role has transformed from mere information receivers to producers and interactors of information. A large amount of data containing geographical location has been spontaneously generated by users, including social media check-in data, geotagged photos, etc. These data have gradually augmented or replaced the role of geographic data collected in traditional ways in geography research, including tourism geography research. According to the World Travel & Tourism Council and the World Tourism Organization statistics, the tourism industry accounts for over ten percent of global GDP [3
]. Furthermore, the trip volume increases year by year, showing that the tourism industry plays an increasingly important role in the global economy [4
]. In addition to the increasing scale, the tourism mode is also gradually changing. Independent travel has become the mainstream mode [5
], which created tourists’ demand for personalized and intelligent travel.
New tourism demand has also promoted the transformation of the data sources and research goals in tourism geography. Specifically, applying geotagged photos to these studies is also a reflection of acclimating to such a trend. Data of geotagged photos have the advantages of containing a large amount of tourism information and reflecting tourists’ real preferences more directly [6
]. Besides, many studies on tourist attraction recommendation systems have emerged, which aims to meet tourists’ increasing demand for intelligent and personalized tourism and solve the problem of tourist information overload [8
]. The recommendation methods are generally divided into content-based and collaborative filtering (CF) methods. The content-based method uses the attributes of the items that users prefer to recommend users similar items [9
]. Such a method is robust against the cold-start problem—the cold-start problem means the recommendation system can hardly make accurate recommendations when encountering new users or items [10
]. Nevertheless, it relies heavily on structured and accurate features, and the accuracy of the recommendation result is comparatively low [11
]. The CF-based method collects other users’ feedback to filter or rate the recommended items [10
]. It has the advantages of fast speed and high accuracy, and thus it is widely used in recommendation systems. However, it cannot handle the cold-start and data sparsity problem well. It can be concluded that both of the recommendation methods have their disadvantages, leading to problems of insufficient recommending accuracy in some scenarios. Therefore, the hybrid recommendation methods that fuse both methods’ advantages have gradually become a trend [12
]. Besides, the machine learning field’s embedding models have gradually emerged and developed in the research of recommendation algorithms. Using such a simple and efficient method to fuse content and contextual information in tourist attraction recommendations means that they can learn from each other and improve the recommendation accuracy.
New data sources and new methods have brought new opportunities to research tourist attraction recommendation methods, but they have also brought some challenges. For instance, how to select and represent the appropriate contextual and content information is a question worth considering, especially the visual information of tourist attractions, which is a kind of information that is easily ignored and difficult to extract to a certain extent because of the existence of noisy and redundant photos in geotagged photos. Therefore, we propose a tourist attraction recommendation model fusing spatial, temporal, and visual embeddings (STVE) for geotagged photos. We leverage Flickr-geotagged photos as the dataset to validate our model. The STVE model is built after some preprocessing steps, and it mainly consists of two parts: the embeddings of temporal and spatial constraint information and the embeddings of visual information. The embeddings of temporal and spatial constraint information are obtained by the negative sampling strategy of Word2Vec; then, we use matrix factorization and Bayesian Personalized Ranking and combine the embeddings of the above representative images results to get the interaction between user and visual embeddings. The gradient ascent method is used to train and update the parameters. The comparison with several other recommendation methods demonstrates that STVE has better results in recommendation quality and ranking indicators. The experiment also analyzes how the components and main parameters of STVE influence the recommendation results. The main contributions of our study are summarized below:
Given the CF-based models’ cold-start problems and the content-based models’ low accuracy problems, we propose a hybrid recommendation model for tourist attractions that fuses spatial, temporal, and visual embeddings (STVE).
We modify Skip-gram’s objective function to model the sequential factors in STVE, which takes advantage of Skip-gram’s characteristics that handle the sequential data well and is more in line with the actual tourist attraction recommendation scenario.
Given the problems that the noisy and redundant photos may exert a bad influence on the extraction of visual embeddings and the recommendation results, we propose a framework that can automatically remove the noisy and redundant photos and select representative images to extract visual embeddings of the tourist attractions for further use.
The remainder of the paper is organized as follows. Section 2
reviews the related work on tourist attraction recommendations for social media data. Section 3
introduces the preliminary and the overall framework of the study, including data acquisition, data preprocessing, and model building and training steps. Section 4
presents the performance compared with other methods, the parameter sensitivity analysis, and the component-wise study. Section 5
summarizes this paper and discusses further study.
2. Related Work
Tourist attraction recommendation can be regarded as a type of location recommendation research. Similar to recommendation methods in other fields, location recommendation methods for social media data are comprised of content- and CF-based methods. Nevertheless, with the development of recommendation system techniques, an increasing number of methods are improved by combining both methods, incorporating context and content into CF, or fusing advanced machine learning methods. Such methods can no longer be classified into content-based or CF methods and can be collectively known as hybrid methods. The selection of contextual and content information for these methods has become a nontrivial issue in location recommendation research.
Regarding contextual information in location recommendation methods, sequential information is one of the commonly considered information. It is generally modeled based on the Markov model and its variations, which calculates the probability and makes recommendations according to the transition matrix from one location to another [14
]. In recent years, plenty of researchers leveraged embedding methods to model sequential information due to embedding methods. For instance, Xie et al. learned the transition from one point of interest (POI) to another with Large Information Network Embedding (LINE) [17
] and generated the embedding of each POI to recommend the next POI [18
]. Zhao et al. leveraged Skip-Gram to model the POI visiting trajectory [19
]. Other important contextual information is the geographical distance, as one of the typical characteristics of location recommendation is that it is constrained by geographical distance. There were two major ways to model geographical distance constraints in previous studies. One is to establish a simple inverse relationship between user’s preference and geographical distance among locations, for instance, the power-law function [20
], the Gaussian Model [22
], and other reverse functions [24
]. The other is to set a cutoff distance, and those locations whose distance from the current visiting location is larger than the cutoff distance would be filtered [15
]. Apart from the sequential and geographical factors, other factors have also been considered in the location recommendation research, including temporal factors [25
], the category of the locations [27
], etc. The studies above considered one or two factors in their recommendation models, but few have fully integrated various factors that may affect the recommendation accuracy, not to mention the combination of content information.
The content information includes user characteristics [28
], tags [31
], and visual information. Visual information is relatively less considered because of the difficulty of extracting accurate visual information and noisy visual content in user-generated photos. Some researchers leveraged Scale-Invariant Feature Transform (SIFT) or color histograms to extract visual information [33
], but these hand-crafted features limit the accuracy of visual information extraction to a great extent. The rise of the Convolutional neural network greatly improves visual information representation and has been applied in recommendation methods with visual content [21
]. However, the imbalance of the number of photos in each tourist attraction and the noise and redundancy in photos still affect visual information’s representativeness. The recommendation accuracy of solely using recommendation methods based on visual content is relatively low, and the combination with other contextual information is still needed.
In this paper, we propose a hybrid tourist attraction recommendation model that fuses spatial, temporal, and visual embeddings for Flickr-geotagged photos (STVE). In the preprocessing steps, we leverage a framework to automatically filter the noisy and redundant photos and select representative images of tourist attractions to extract visual embeddings as accurately as possible. To build the STVE model, we modify Skip-gram’s objective function and leverage Word2Vec’s negative sampling strategy to model the spatial and temporal factors. Then we use Matrix Factorization to fuse the tourist attractions’ visual embeddings and train with Visual Bayesian Personalized Ranking. We select Tokyo as the study area to evaluate our STVE model.
The comparison results show that our STVE model can relieve the low accuracy issue of content-based methods and the cold-start issue of CF-based methods. We also analyzed the sensitivity of the main parameters and explore how each component influences the recommendation results. The series of results demonstrate the superiority of STVE in providing a recommendation of high accuracy and provide us with further motivation to pursue our research. In future work, we will continue to improve our recommendation models by adding more contextual information (such as weather and season) and user attributes (such as age and gender). Furthermore, we will try to implement our model in web-based applications or other platforms for actual use.