Next Article in Journal
The Road Map to Classify the Potential Risk of Wind Erosion
Previous Article in Journal
TATSSI: A Free and Open-Source Platform for Analyzing Earth Observation Products with Quality Data Assessment
Article

Modeling the Distribution of Human Mobility Metrics with Online Car-Hailing Data—An Empirical Study in Xi’an, China

1
School of Civil and Hydraulic Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
2
Shenzhen Key Laboratory of Spatial Smart Sensing and Services, Shenzhen University, Shenzhen 518060, China
3
School of Architecture and Urban Planning, Huazhong University of Science and Technology, Wuhan 430074, China
4
Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources, Shenzhen 518034, China
5
School of Geography and Tourism, Shaanxi Normal University, Xi’an 710119, China
*
Author to whom correspondence should be addressed.
Academic Editor: Wolfgang Kainz
ISPRS Int. J. Geo-Inf. 2021, 10(4), 268; https://0-doi-org.brum.beds.ac.uk/10.3390/ijgi10040268
Received: 27 February 2021 / Revised: 31 March 2021 / Accepted: 14 April 2021 / Published: 17 April 2021

Abstract

Modeling the distribution of daily and hourly human mobility metrics is beneficial for studying underlying human travel patterns. In previous studies, some probability distribution functions were employed in order to establish a base for human mobility research. However, the selection of the most suitable distribution is still a challenging task. In this paper, we focus on modeling the distributions of travel distance, travel time, and travel speed. The daily and hourly trip data are fitted with several candidate distributions, and the best one is selected based on the Bayesian information criterion. A case study with online car-hailing data in Xi’an, China, is presented to demonstrate and evaluate the model fit. The results indicate that travel distance and travel time of daily and hourly human mobility tend to follow Gamma distribution, and travel speed can be approximated by Burr distribution. These results can contribute to a better understanding of online car-hailing travel patterns and establish a base for human mobility research.
Keywords: mobility metrics; distribution fitting; Gamma distribution; Burr distribution; online car-hailing mobility metrics; distribution fitting; Gamma distribution; Burr distribution; online car-hailing

1. Introduction

The modeling of human mobility is an emergent research area. Studying the regularity and characteristics of human spatiotemporal mobility is of great significance in many fields, such as urban planning [1,2], traffic forecasting [3], and epidemic prevention [4,5].
When modeling human mobility, it is common to consider the probability distribution function (PDF) of its metrics (e.g., travel distance, travel time, and travel speed). It is generally accepted that daily and hourly human mobility metrics have a representative distribution [6]. Modeling the distributions of these metrics is fundamental, necessary, and beneficial for studying underlying travel patterns and establishing a base for human mobility research.
Recently, with the rapid development of information and communication technique (ICT) and location-based service (LBS) applications, online car-hailing equipped with Global Positioning System (GPS) plays an increasingly important role in people’s daily travel activities. As an important data source, online car-hailing platforms (e.g., Uber, Lyft, and Didi Chuxing) generate a large amount of accurate location data. Unlike traditional survey data, cell phone datasets, wireless network traces, and taxi locations data, online car-hailing data are characterized by high-quality, high-resolution and a large-scale, which reflect the detailed spatiotemporal trajectory and actual origin and destination of people’s travels. Therefore, online car-hailing has established a rich and solid data foundation for distribution modeling, thus bringing new opportunities and challenges to further understand people’s travel behavior and intra-urban mobility.
To our knowledge, previous studies have proposed quite a few patterns of human mobility, such as levy flight model [7,8], power law distribution [9,10,11,12,13], exponential distribution [14,15,16,17,18], lognormal distribution [19,20,21,22], Weibull distribution [23], Pareto distribution [24], and Gamma distribution [22]. Brockmann et al. observed that human travel distance exhibits a power law distribution by analyzing the statistical properties of bank notes’ circulation, and human travel trajectories may be approximated as Lévy flights (heavy tailed random walk) [7]. This observation was confirmed by Rhee et al. using Global Positioning System (GPS) traces collected from volunteers, showing a non-negligible probability of high displacement trips and a long pause-time between trips [8]. Despite the randomness indicated by Lévy flight models, a power law with an exponential cutoff can be used to approximate the displacement distribution of human trajectories obtained from mobile phone datasets [9,11], GPS traces [12] and online location-based social networks [13]. However, Kang et al., Jiang et al., and Liang et al. pointed out that an exponential distribution can be used to approximate taxis’ travel displacement and travel time, instead of power law [14,15,16,17,18]. Furthermore, by analyzing a taxi-trace dataset, Wang et al. found that displacement tends to follow exponential distribution, and travel time is approximated by lognormal distribution [19].
Although the findings mentioned above provide a beneficial reference on mining human mobility, they mostly focus on a single model, which may not fit all data well. Zheng et al. found that a fusion function, based on exponential power law and a truncated Pareto distribution, represents travel time distribution best [24]. Bazzani et al. studied the GPS data of private cars in Florence, Italy, and found that the single-trip length follows an exponential behavior in short distance scale but favors a power law distribution for trips longer than 30km [18]. Csáji et al. and Zhang et al. found that exponential distribution is not appropriate for travel distances, and log-normal distribution provides reasonable fits [20,21]. Plötz et al. used Weibull, Gamma, and lognormal distributions to fit individual daily driving distances, and found that Weibull and lognormal most often perform better than Gamma, and the Weibull distribution fits most data but not all [23]. Kou and Cai analyzed the distributions of travel distance and travel time, and found that both of them follow a lognormal distribution in larger bike sharing systems, while the distribution for smaller systems varies among Weibull, Gamma, and lognormal [22].
To sum up, according to various datasets, many empirical studies have demonstrated that mobility metrics may be fitted with several meaningful distributions, such as Lévy flight models, exponential, power-law, lognormal, Gamma, Weibull, and Rayleigh. However, based on a real large-scale car-hailing trajectory dataset, can a single or mixed model achieve a good fit for all data? It remains to be further explored. In addition, the above studies focused on modeling the distribution of the human mobility with simple or overall data, while ignoring the variability of PDF along with day of week and time of day. Is the distribution type of mobility metrics in different time granularity different from that of overall data? If yes, how will it vary, and can it be described by a general distribution? This has aroused the interest of many scholars. Therefore, our research, based on a real large-scale car-hailing trajectory dataset, is indispensable, and gains valuable insight into human mobility patterns.
To fill this gap, this paper aims to model the distribution of human mobility metrics in different time granularity. Specifically, three metrics (travel distance, travel time, and travel speed) are introduced to explore massive trajectory data collected in Xi’an, China. For each mobility metric, several candidate distributions are compared based on model selection criteria, and the best one is selected. The statistical distributions of daily trip data are analyzed first, and they show the characteristic of skewed distribution. More granularly, hourly distributions are further evaluated, and a general distribution for each mobility metric is determined.
The remainder of this paper is organized as follows. Section 2 briefly introduces the online car-hailing dataset and carries out the basic analysis. Section 3 describes the trip metrics, the fitting distributions, and the method of model selection. Section 4 presents the result and analysis. Section 5 discusses the findings. Finally, Section 6 provides conclusions and recommendations for further research.

2. Data Collection and Basic Analysis

2.1. Data Description

The adopted trajectory data were generated by about 18,000 online car-hailing trips in Xi’an, China, from 1 October 2016 to 30 November 2016. Vehicle trajectories were composed of high-resolution GPS points, which were recorded every 2–4 s. Accordingly, an online car-hailing trajectory is a sequence of GPS sampling points with five fields. The vehicle ID and order ID were desensitized to protect privacy. The ‘‘Timestamp’’ indicated when the data would be recorded, which was the UTC time. ‘‘Latitude’’ and ‘‘Longitude’’ provided location information of online car-hailing.
Let T r i j = ( p 1 i , j , p 2 i , j , , p N i , j ) denote the trajectory of the j t h trip of vehicle i , where p n i , j = ( x , y , t ) n i , j is the n t h point of the sequence ( n = 1 , 2 , , N ). ( x , y ) n i , j denotes the location and t n i , j the timestamp, respectively. Given a trajectory, t 1 i , j < t 2 i , j < < t N i , j . For a vehicle, the origin and destination (OD) locations are the first and last sampling points of a trip. It makes sense to define p O i , j = p 1 i , j and p D i , j = p N i , j . Hence, each OD trip can be simplified to be a vector from p O i , j to p D i , j .
A road network consists of a set of nodes, directed links, and allowed movements. Each node is a geographical location representing a network intersection, which can be either signalized or non-signalized. A link is defined to be the road section from its tail node to head node. The relative position denotes the ratio of a sampling point relative to the link start node, which ranges [ 0 ,   1 ] . For example, the value 0, 0.5, and 1 of the relative position represent the beginning, middle and end of a link.

2.2. Data Precessing

To model the travel distance distribution, the map matching (MM) and the path inference algorithm proposed by Chen et al. were first used [25]. As shown in Table 1, the original latitude and longitude were converted into geodetic coordinates, which could be used directly to calculate the travel displacement. Secondly, the relative position of the sampling point on the link was also calculated, and the UTC time was converted to the time of day (0-86400 s). Thirdly, the pick-up and drop-off points were extracted to calculate the trip metrics (e.g., travel time, travel displacement, travel distance).
Finally, data cleaning is an essential task, because some of the trip records were not suitable for use in this study. Considering travel costs, few passengers travel by online car-hailing when travel time and distance are very short or long [26,27]. In addition, travel speed should be within a reasonable range. Therefore, the following conditions resulted in exclusion of trip records from the study data: (1) travel distance and displacement between origin and destination of less than 300m; (2) travel time less than 1min or longer than 2 h; (3) average travel speed below 5 km/h or in excess of 80km/h [28].
In terms of the total trips of two months, 6,203,848 trips were obtained from 6,584,397 original trips after data cleaning, which means that about 6% of the trips were filtered out. From the perspective of daily trips, the average order availability was 94.22%, which fluctuated between 93.21% and 94.85%. More commonly, the study period was discretized into 1464 (24*61) 1 h intervals for further analysis of residents’ hourly trips. The hourly trip quantity ranged from 192 to 8636, as shown in Figure 1. On the whole, the number of trips during the day was much higher than at night, which is in line with human mobility. After all, human mobility during the day is more active, important, and meaningful. In addition, the number of hourly trips between 00:00 and 07:00 may be less than 2000, but it was sufficient for distribution fitting.

3. Trip Metrics and Data Fitting

3.1. Trip Metrics

In the existing studies, due to the lack of map matching, travel distance is usually replaced by displacement or Manhattan distance of the origin and destination. However, travel distance here refers to the length of the actual path traveled by the OD trip in road-networks. A path is composed of a series of successive links, and its length is the sum of length of links included in the OD trip. It is worth noting that the links where the origin and destination (OD) are located may not be traveled through. Based on the map matching and path inference results, the travel distance d i j (TD) is calculated as:
d i j = ( 1 r O i , j ) d O i , j + k = 2 M 1 d k i , j + r D i , j d D i , j
where M is the number of links included in the trip. r O i , j and r D i , j denote the ratio of trip OD relative to the link start node, which ranges [ 0 , 1 ] . d O i , j and d D i , j are the link length where the trip OD are located.
Travel time is another important metric and is closely tied to travel distance. Travel time means time elapsed from the origin to destination, and is influenced by real-time traffic conditions, weather conditions, driver’s driving habits, etc. As an important indicator for analyzing human mobility, travel time reflects accessibility and traffic conditions. For a trajectory of the trip j of vehicle i , travel time t i j (TT) is defined as:
t i j = t N i , j t 1 i , j = t D i , j t O i , j
To understand the relation between the metrics described above, travel speed is another important feature. The average travel speed v i j (TS) is defined as:
v i j = d i j / t i j

3.2. The Fitting Distribution

The fitting function selection is to identify the most appropriate distribution which is supported by the actual trip data. In the existing literature, exponential, (truncated) power-law, lognormal, Gamma, Weibull, and Burr distributions are commonly used distributions for fitting the above trip metrics [6,11,15,19,21,29,30,31,32]. However, not all of the above distributions apply to the data in this study. In order to narrow the range of candidate distributions, one day of the two-month data was randomly selected to analyze the characteristics of trip metrics. Figure 2 shows the frequency distribution histograms and the cumulative distribution functions (CDF) of travel distance, travel time, and travel speed, respectively. Based on the shape of these histograms, it can be found that these data show a significant right skew, which is also proved by the skewness calculation (i.e., 1.05, 1.60, 0.95). In addition, the CDF is very close to one before reaching the maximum. For example, the probability of travel distance within 15km is as high as 99.96%, and 99.93% trips have a travel time less than 1h. The excessive kurtosis indicates that the data are too concentrated, possibly due to the existence of extreme values.
To further understand the shape of the data in detail, the skewness and kurtosis of the daily data are shown in Figure 3. The daily skewness is greater than 0.6, ranging from 0.75 to 2.1. This indicates that all data show a right skew, especially travel time data. Meanwhile, kurtosis greater than three indicates the steepness of the distribution, which proves the possibility of narrowing the spread range of trip metrics. More commonly, hourly skewness and kurtosis are, respectively, 1.07 and 5.83, which also indicate that the hourly trip data are highly likely to conform to the skewed distribution.
Based on the above analysis, five common skewed distributions—lognormal, Gamma, Weibull, Burr, and Rayleigh—are selected as candidate distributions to fit the daily and hourly data. The probability density functions (PDF) of these distributions are defined in the following formulas. Assuming that the variable x follows a lognormal distribution, its PDF can be expressed as
f ( x ) = 1 x σ 2 π e ( ( ln x μ ) 2 2 σ 2 )
where μ and σ denote the mean and standard deviation of the natural logarithm of the variable x . The mathematical expectation and the variance are respectively E ( x ) = e μ + σ 2 / 2 and V a r ( x ) = ( e σ 2 1 ) e 2 μ + σ 2 .
The Gamma distribution with shape and scale parameters α and β is
f ( x ) = 1 β α Γ ( α ) x α 1 e x β
where Γ ( ) presents the Gamma function, and Γ ( 1 ) = 1 . The mean and variance are calculated with E ( x ) = α β , V a r ( x ) = α β 2 . It should be noted here that the exponential and χ 2 distributions are special cases of the Gamma distribution. For example, when the shape parameter α is 1, the Gamma distribution is an exponential distribution with the parameter 1 / β .
The Weibull distribution is defined as
f ( x ) = k λ ( x λ ) k 1 e ( x λ ) k
where k > 0 , λ > 0 are the shape and scale parameters, respectively. The mean and variance are E ( x ) = λ Γ ( 1 + 1 k ) and V a r ( x ) = λ 2 [ Γ ( 1 + 2 k ) Γ ( 1 + 1 k ) 2 ] . The PDF of 3-parameter version of the Burr distribution is
f ( x ) = c k α ( x α ) c 1 ( 1 + ( x α ) c ) ( k + 1 )
where α > 0 is a scaling parameter, c > 0 and k > 0 are shape parameters. The mean and variance are calculated by E ( x ) = α k Γ ( k 1 c ) Γ ( 1 + 1 c ) Γ ( 1 + k ) and V a r ( x ) = α 2 k Γ ( k 2 c ) Γ ( 1 + 2 c ) Γ ( 1 + k ) ( E ( x ) ) 2 .
Rayleigh’s PDF has merely one parameter to estimate, making it popular for representing the distribution of trip metrics due to its simplicity. Its distribution is a special case of the Weibull distribution in which the value of the shape parameter is 2. The PDF of Rayleigh distribution with a parameter α is
f ( x ) = x α 2 e x 2 2 α 2
The mean and variance are E ( x ) = π 2 α and V a r ( x ) = 4 π 2 α 2 , respectively. In the literature, the parameters are optimized by the maximum likelihood estimation (MLE), and detailed inference can refer to Clauset et al. [33].

3.3. Model Selection

In order to evaluate the fitted model between the actual data and candidate distribution, two fundamentally different methods, null hypothesis tests and model selection criteria, are frequently used to select appropriate model or the best model [34]. Theoretically, they can achieve both model fit and complexity. The Kolmogorov–Smirnov (K–S) test is commonly employed to evaluate the goodness of fit of the candidate distributions [6,22,29,35]. However, the K–S test is suitable for small samples. When the data are too large, the critical value for rejection is very small, and the result often rejects the null hypothesis. The K–S test may reject all the candidate distributions; however, it may also consider multiple distributions as acceptable. Considering the large quantity of daily and hourly trips shown in Figure 1, the K–S test is not suitable for this situation.
In addition, the Akaike information criterion (AIC) can provide another decision-making method [15,16,19]. The AIC score is a function of its maximized log-likelihood ( L i ) and the number of estimated parameters ( K i ) for each candidate model i , and is calculated by
A I C i = 2 ln L i + 2 K i
Generally, the model with the smallest AIC is preferred. In this study, small sample unbiased AIC is not considered due to the large number of daily or hourly trips. The number of hourly trips changes from a few hundred to nearly ten thousand, so it is important to find the distribution applicable to different periods. Therefore, the Bayesian information criterion (BIC, also referred to as Schwarz criterion SC) is further used for model selection. The BIC is structurally similar to the AIC, but includes a penalty term on sample size ( N ), and tends to favor simpler models, particularly as the sample size increases.
B I C i = 2 ln L i + K i ln N
The BIC is on a relative scale. The BIC difference Δ i = B I C i B I C min ( B I C min = min i { 1 , 2 , , n } { B I C i } ) allows for an immediate ranking of the n candidate models [36]. The larger the BIC difference for a model, the less probable that it is the best model. More specifically, the Akaike weight w i [37] represents the normalization of the relative likelihood (i.e., e Δ i / 2 ) of the models.
w i = e Δ i / 2 j = 1 n e Δ j / 2
The Akaike weights are very useful for assessing model selection uncertainty. The model with the largest Akaike weight should be selected as the best distribution.

3.4. Evaluation of Distribution Fitness

In order to show how close the theoretical and empirical frequency distributions are, accuracy is evaluated using the following statistical indicators: coefficient of determination (R2), mean absolute error (MAE), mean absolute percentage error (MAPE), probability outside the predicted interval (POPI), probability outside the observed interval (POOI), and POPI-POOI-based criterion (PPC).
For many transportation applications, it is meaningful to construct an interval at a given confidence level from the fitting distribution. The accuracy of confidence interval represents the integrated accuracy of both the predicted mean and STD. In addition, percentiles are also an effective method of evaluating the accuracy of mean and STD. In this study, mean absolute error (MAE) and mean absolute percentage error (MAPE) of percentiles were extended as follows,
M A P E = 100 % n i = 1 n | p t i o b s p t i f i t | p t i o b s
M A E = 1 n i = 1 n | p t i o b s p t i f i t |
where n is the number of percentiles, and p t i o b s , p t i f i t are the observed percentiles obtained from the field survey and fitting distribution. Smaller M A P E and M A E indicate higher accuracy of the fitting distribution.
Two metrics were adopted to evaluate the accuracy of estimated or predicted distribution: probability outside the predicted interval (POPI) and probability outside the observed interval (POOI) [38,39]. The POPI measures the percentage of observed data outside the predicted confidence interval, while the POOI measures the percentage of predicted distribution outside the observed confidence interval. Let l f i t = Φ f i t 1 ( α / 2 ) and u f i t = Φ f i t 1 ( 1 α / 2 ) be the lower and upper bounds of the fitting distribution, respectively, at confidence level 1 α , where Φ f i t 1 ( ) is the inverse cumulative distribution function (CDF) of the fitting distribution. Mathematically, POPI is defined as
P O P I = ( 1 i = 1 N c i / N ) × 100 %
where c i = 1 if sample [ l f i t , u f i t ] , otherwise c i = 0 . The POPI value ranges from 0 to 1. The smaller POPI indicates capture of larger proportion of observed data, i.e., higher accuracy of the fitting distribution. As noted by Shi et al. [38,39], this POPI metric is very useful, but tends to exhibit bias for situations of wide fitting intervals due to large STD errors.
As an alternative, POOI metric is the percentage of predicted distribution outside the observed travel time interval. Let l o b s and u o b s denote the lower and upper bounds of the observed interval, respectively, at confidence level 1 α . Then
P O O I = ( 1 ( Φ f i t ( u o b s ) Φ f i t ( l o b s ) ) ) × 100 %
Φ f i t ( ) denotes the CDF of the fitting distribution. POOI also ranges [0, 1], and larger POOI indicates lower fitting interval accuracy, because a larger proportion is outside the observed interval. Therefore, the POPI and POOI matrices are complementary for evaluating the accuracy of the fitting distribution. The closer the two to α , the better the model fit. The bigger the POPI, the smaller the corresponding POOI. Therefore, a measure is required to balance the deviation of POPI and POOI from α . With regard to this argument, the following POPI-POOI-based criterion (PPC) is proposed for the comprehensive evaluation of POPI and POOI:
P P C = 1 2 ( | P O P I α | + | P O O I α | ) × 100 %
Generally, the smallness of PPC is an indication of the goodness of the constructed confidence interval (simultaneously achieving high model fit).

4. Results of the Best-Fit Distribution

This section reports the fitting results of the trip metrics using the candidate skewed distributions. The best-fit distributions of daily trip metrics are first shown. Then, we further analyze the best-fit results of hourly trip metrics, such as travel distance, travel time, and travel speed. Finally, we attempt to identify a general distribution for each trip metric, and to estimate its parameters.

4.1. Best-Fit Distribution of Daily Trip Metrics

Figure 4 shows the best-fit distributions of daily trip metrics for 61 days. Overall, only two of the candidate distributions, Gamma and Burr distributions (represented by blue and red), are suitable for fitting these trip metrics. Gamma distribution performs best for travel distance, and can uniformly fit all daily data, as shown in blue in Figure 4. However, Gamma distribution does not fit all travel time or speed data well. In total, 47.54% (29 out of 61) of travel time data and 63.93% (39 out of 61) of travel speed data follow the Burr distribution (depicted in red).
The fitting results for weekdays and weekends in Figure 4 are distinguished by two different markers (dot and star). For travel distance, data on weekday and weekend can be well fitted with the same distribution. However, for travel time data at weekends, a third of them are subject to Gamma distribution, the rest to Burr distribution, and the weekend’s speed data are the opposite. Meanwhile, only 37.21% (16 out of 43) of travel speed data on weekdays obey Gamma distribution, which further decreases to 13.64% (3 out of 22) on November weekdays.
The above analysis also shows that, due to uniform and simple fitting distribution, travel distance is more straightforward for analyzing residents’ mobility patterns. In comparison, travel time and speed are relatively complex metrics due to uncertain distribution types.
In addition, the mean BIC weights of travel distance, travel time, and travel speed are 1, 0.9996, and 0.9960, respectively, which indicate low uncertainty of fitting results. The smallest BIC weight occurs in the travel speed data on October 13, and the fitting distributions and the detailed parameters are shown in Figure 5. It can be found from the figure that the Burr distribution is the most consistent with the speed data, followed by the Gamma distribution. It is noteworthy that the Burr distribution is more complex than the Gamma distribution. The likelihood of achieving a better fit of the more complex model is significantly greater than that of the simpler model, but the model fit and complexity should be considered comprehensively.
Table 2 shows more parameters in the model selection. The observed mean and standard variation (STD) are very similar to several estimates. The commonly used evaluation indices, such as the mean absolute percent error (MAPE) and the root mean square error (RMSE), are also difficult when it comes to identifying the dominant distribution. Furthermore, the log-likelihood of Burr distribution is slightly bigger than that of Gamma distribution, which quantitatively proves that the Burr distribution has a better model fit. When the BIC takes model complexity into account, the gap narrows to 2 ( Δ = 499,118 499,116 = 2 ), which means that the benefit of improved model fit outweighs the cost of added model complexity. This tiny advantage is clearly distinguished in the BIC weight, which makes the model selection more explicit. It can be determined that the best-fit model is Burr distribution.

4.2. Best-Fit Distribution of Hourly Travel Distance

In order to further investigate the hourly distributions of trip metrics, the study period is discretized into 1464 (24*61) 1 h intervals. Figure 6 shows the best-fit distributions of hourly travel distance. The best-fit distributions are distinguished by five different markers, of which the Gamma distribution accounts for 93.10%—far higher than the other four distributions. Between 07:00 and 24:00, the proportion of Gamma distribution rises to 99.32%, which further demonstrates the advantage of Gamma distribution in fitting travel distance data. During the times 00:00–07:00, the optimal distribution varies with hours and days, and is chaotically represented in five different markers. Only 77.99% of the data still obey Gamma distribution, while 14.52% for Weibull distribution, 6.79% for Rayleigh distribution, and less than 1% for lognormal distribution. This may be due to the small sample size at night.
In addition, the BIC weights represented in different colors range from 0.38 to 1. The darker the color, the smaller the BIC weight. It can be seen from Table 3 that most of the BIC weights between 07:00 and 24:00 are very close to 1. Their mean BIC weight is 0.9986, indicating the high reliability in the model selection. However, the mean BIC weight from 00:00 to 7:00 is 0.8540, which indicates that there is some uncertainty in the fitting results. More specifically, the mean BIC weights of the lognormal, Weibull, and Rayleigh distributions are 0.8362, 0.7094, and 0.7210, respectively. The high uncertainty may be caused by small sample size, because fewer people travel at night. Meanwhile, lower weights also mean that the best and suboptimal fitting distributions may both be applicable to the data. In conclusion, Gamma distribution may also be applicable to all the travel distance data.

4.3. Best-Fit Distribution of Hourly Travel Time

Figure 7 shows the best-fit distributions of hourly travel time, which are distinguished by different markers. As shown by the black circles and red squares, Gamma and Burr distributions are the dominant distribution types, with proportions of 76.23% and 20.56%, respectively. In Table 4, larger BIC weights (0.9430 and 0.8982) of these two distributions show a significant advantage over other three distributions when fitting 96.79% of the data. Meanwhile, the fitting results have higher reliability.
On the other hand, the other three distributions, shown by blue dots, green triangles, and red stars in Figure 7, account for less than 4%, and appear mainly at night (02:00–7:00). At the same time, their average BIC weights are only 0.7515, 0.6296, and 0.7853, respectively. These lower BIC weights indicate the higher uncertainty of the Lognormal, Weibull, and Rayleigh distributions in fitting the data, which are likely to be replaced by Gamma or Burr distribution.
During the National Day holiday, 82.74% of travel time data follow Gamma distribution with a mean BIC weight of 0.9541. However, only 13.69% obey the Burr distribution, with an average BIC weight of 0.9352. This means that travel time data for the National Day holiday are more inclined to the Gamma distribution than the Burr distribution.

4.4. Best-Fit Distribution of Hourly Travel Speed

Figure 8 shows the fitting results of hourly travel speed. The best-fit distributions are distinguished by rhombus, circle, square, and star. On the whole, Burr distribution accounts for 85.11% (1246 out of 1464) of the best-fit distributions, with a mean BIC weight of 0.9830, showing an absolute advantage and high reliability. Similarly, the lognormal and Gamma distributions have higher BIC weights but lower ratios. In contrast, the fitting results for only 6 1 h intervals (0.41%) are consistent with Weibull distribution, with an average BIC weight of 0.7007. This means that Weibull distribution is not suitable for speed data.
In addition, some clustering features can be found in non-dominant best-fit distributions. For example, 77.94% (53 out of 68) of the lognormal distributions occurs during the daily evening peak, with a mean BIC weight of 0.9986. About 40% of the Gamma distributions exists during the National Day holiday, with a mean BIC of 0.9246. In conclusion, Burr distribution is dominant in fitting travel speed data.

4.5. General Distribution Selection

Based on the above analysis, the Gamma distribution is dominated in the fitting of travel distance and travel time, while the Burr distribution is more suitable for travel speed. Trips following other distributions only makes up a very small portion of the total trips. Now, it may be questioned whether it is possible to fit all the corresponding data with the dominant distribution separately? If so, how much worse will the fit be? To answer this question, the K–S test, the BIC difference, mean absolute error (MAE), and mean absolute percentage error (MAPE) are analyzed separately.
As shown in Table 5, among the trip metrics, 101, 348, and 218 out of 1,464 1 h intervals are, respectively, replaced by Gamma or Burr distribution. For travel distance, the K–S test considers both the alternative distribution (Gamma) and the best-fit distribution as acceptable, respectively, for about 90% (93 or 90 out of 101) of the data. Meanwhile, the alternative distribution performs better for travel time because more data (76%) pass the K–S test. However, for the travel speed metric, the opposite is true, which needs to be further explained by other indicators. Moreover, selection of the more complex model indicates that the benefit of improved model fit outweighs the cost of added model complexity.
For the first two trip metrics, the BIC difference between the alternative distribution and the best-fit distribution is relatively small, both being less than 10. However, the BIC difference for travel speed is slightly larger, possibly due to the different magnitude of the BIC values between the metrics. The ratio ( Δ / B I C min ) of BIC difference to the BIC of the best-fit distribution is less than 0.5%, which also indicates that there is no significant difference between the two distributions from model selection based on the BIC. In addition, the MAPE and MAE of the fitted distribution and sample distribution at the 10th, 50th, and 90th percentiles are calculated by comprehensively considering the mean and variance. A MAPE of less than 4% further demonstrates the feasibility of fitting all data with a dominant distribution.

5. Discussion

Based on the analysis in the previous section, the direct statistics (i.e., travel distance, and travel time) of hourly trips all follow Gamma distribution, while the indirect statistic (i.e., travel speed) obeys Burr distribution. Table 6 lists the average indicators of the trip metrics, which reflects the performance of the Gamma and Burr distributions. The bigger the R2 is, the better the goodness-of-fit is. Alternatively, smaller MAE, MAPE, and PPC also indicate a better goodness-of-fit. Three key percentiles, 10th, 50th, and 90th, were adopted to calculate the MAE and MAPE. The confidence level is equal to 80% (i.e., α = 0.2 , 90 % 10 % = 80 % ) was adopted to construct the confidence interval.
According to Table 6, the mean R2 values of three trip metrics exceed 0.98, and even reach 0.9915 for travel time. This shows that the fitting distribution has a strong ability to interpret data, and this model is also good at fitting data. In general, a higher R2 indicates a stronger interpretation ability of the fitting model to the data; that is, a better fitting effect. Meanwhile, the MAPE and MAE indicators further prove this. The MAE of travel distance is less than 0.1km, lower than 0.2 min for travel time, and about 0.21 km/h for travel speed. Less than 3% of MAPEs further illustrate that the fitting distribution is quite consistent with the observed data.
The mean POPIs are slightly worse than the target (20%), which indicates that less than 80% of observation data are covered by the fitted confidence interval. Higher POPI means lower POOI. As mentioned earlier, the fitting distributions work best when both POPI and POOI are close to the target value (20%). The PPC values, 4.31%, 2.54%, and 3.11%, respectively, show a low deviation degree of POPI and POOI, which also declares that the fitting distributions of three trip metrics have good accuracy. In summary, these results indicate that the Gamma distribution fits direct trip metrics, such as travel distance and travel time, well, while the Burr distribution fits travel speed better.

6. Conclusions

This study models the distributions of human mobility metrics based on actual trajectory datasets, including about 18,000 online car-hailing rides, collected in Xi’an, China. Three trip metrics—travel distance, travel time, and travel speed—are highlighted in order to establish a base for human mobility research. Results of this study provided several new insights on relationships within human mobility.
First, the mobility metrics tend to right-skewed distribution rather than normal distribution based on online car-hailing trajectory data. By analyzing the daily and hourly trip data, five of the most widespread right-skewed distributions (i.e., Lognormal, Gamma, Weibull, Burr, and Rayleigh) in the scientific literature were selected as candidate distributions. By leveraging the Bayesian information criterion (BIC), we comprehensively analyzed the goodness of fit and complexity of the candidate distributions for each metric, thus acquiring the best fitting distribution and suitable parameters. The empirical results based on online car-hailing trajectory dataset in Xi’an, China, have provided strong evidence that the mobility metrics obey the right-skewed distribution.
Second, the distribution types of mobility metrics vary, along with day of week and time of day, which means that a single distribution cannot fit all the daily and hourly data well. Initially, the Gamma distribution performs best among all alternative distributions for travel distance and can uniformly fit all daily data. Then, the Gamma or Burr distribution can only achieve a good fit in part of the daily travel time or speed data. For the hourly data, the best-fit distributions vary among alternative distributions, especially at night. The Gamma distribution most often performs better than the other four distributions for both travel distance and travel time, while the Burr distribution performs best for travel speed.
Third, although uncertain distribution types exist in the daily and hourly data, a dominant distribution exists in each mobility metric. For example, the Gamma distribution can fit more than 90% of hourly travel distance data, and the Burr distribution can achieve a fit for 85% of hourly travel speed data. Further analysis shows that it is feasible to fit all hourly data with a dominant distribution, respectively.
It is expected that the findings from this study can promote understanding about intra-urban human mobility and lay a solid foundation for human mobility research. However, we do note several limitations of this research. Firstly, the candidate distributions are limited to five commonly used skewed distributions, and more may need to be considered. Secondly, only the distribution of daily and hourly trip data is fitted and analyzed; the more fine-grained distribution (e.g., 30 min, 15 min) is also of interest, which still needs further study. Last but not the least, distribution may vary with the different datasets; multi-source data need to be taken into account to confirm the conclusions.

Author Contributions

Conceptualization, Chaoyang Shi and Qingquan Li; methodology, Chaoyang Shi and Shiwei Lu; software, Chaoyang Shi; formal analysis, Shiwei Lu and Xiping Yang; writing—original draft preparation, Chaoyang Shi; writing—review and editing, Qingquan Li and Xiping Yang. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded in part by the National Natural Science Foundation of China under Grant 41901390 and 41901392, in part by the Fundamental Research Funds for the Central Universities under Grant 2019kfyXJJS142 (HUST), in part by the Natural Science Foundation of Hubei Province under Grant 2019CFB098, in part by Open Research Fund of State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University (No. 19S03), and in part by Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources KF-2020-05-005.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://gaia.didichuxing.com.

Acknowledgments

The authors are grateful to Didi Chuxing for providing the data used in this paper. The authors would like to thank the anonymous reviewers and editors for providing valuable comments and suggestions which helped improve the manuscripts greatly.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Pan, G.; Qi, G.D.; Wu, Z.; Zhang, D.Q.; Li, S.J. Land-use classification using taxi GPS traces. IEEE Trans. Intell. Transp. Syst. 2013, 14, 113–123. [Google Scholar] [CrossRef]
  2. Soliman, A.; Soltani, K.; Yin, J.J.; Padmanabhan, A.; Wang, S.W. Social sensing of urban land use based on analysis of twitter users’ mobility patterns. PLoS ONE 2017, 12, e0181657. [Google Scholar] [CrossRef]
  3. Jiang, B.; Yin, J.J.; Zhao, S.J. Characterizing the human mobility pattern in a large street network. Phys. Rev. E 2009, 80, 021136. [Google Scholar] [CrossRef]
  4. Paolo, B.; Chiara, P.; Ramasco, J.J.; Michele, T.; Vittoria, C.; Alessandro, V. Human mobility networks, travel restrictions, and the global spread of 2009 H1N1 pandemic. PLoS ONE 2011, 6, e16591. [Google Scholar]
  5. Wesolowski, A.; Eagle, N.; Tatem, A.J.; Smith, D.L.; Noor, A.M.; Snow, R.W.; Buckee, C.O. Quantifying the impact of human mobility on malaria. Science 2012, 338, 267–304. [Google Scholar] [CrossRef] [PubMed]
  6. Song, H.Y.; Lee, J.S. Finding a simple probability distribution for human mobile speed. Pervasive Mob. Comput. 2016, 25, 26–47. [Google Scholar] [CrossRef]
  7. Brockmann, D.; Hufnagel, L.; Geisel, T. The Scaling Laws of Human Travel. Nature 2006, 439, 462–465. [Google Scholar] [CrossRef]
  8. Rhee, I.; Shin, M.; Hong, S.; Lee, K.; Kim, S.J.; Chong, S. On the Levy-Walk Nature of Human Mobility. IEEE ACM Trans. Netw. 2011, 19, 630–643. [Google Scholar] [CrossRef]
  9. González, M.C.; Hidalgo, C.A.; Barabási, A.L. Understanding individual human mobility patterns. Nature 2008, 453, 779–782. [Google Scholar] [CrossRef]
  10. Zhao, K.; Musolesi, M.; Hui, P.; Rao, W.; Tarkoma, S. Explaining the power-law distribution of human mobility through transportation modality decomposition. Sci. Rep. 2014, 5, 9136. [Google Scholar] [CrossRef] [PubMed]
  11. Calabrese, F.; Diao, M.; Lorenzo, G.D.; Ferreira, J.; Ratti, C. Understanding individual mobility patterns from urban sensing data: A mobile phone trace example. Transp. Res. Part C Emerg. Technol. 2013, 26, 301–313. [Google Scholar] [CrossRef]
  12. Jia, T.; Jiang, B.; Carling, K.; Bolin, M.; Ban, Y. An empirical study on human mobility and its agent-based modeling. J. Stat. Mech. Theory Exp. 2012, 11, P11024. [Google Scholar] [CrossRef]
  13. Cho, E.; Myers, S.A.; Leskovec, J. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; p. 1082. [Google Scholar]
  14. Kang, C.G.; Ma, X.J.; Tong, D.Q.; Liu, Y. Intra-urban human mobility patterns: An urban morphology perspective. Phys. A Stat. Mech. Appl. 2012, 391, 1702–1717. [Google Scholar] [CrossRef]
  15. Jiang, S.X.; Guan, W.; Zhang, W.Y.; Chen, X.; Yang, L. Human mobility in space from three modes of public transportation. Phys. A Stat. Mech. Appl. 2017, 483, 227–238. [Google Scholar] [CrossRef]
  16. Liang, X.; Zheng, X.D.; Lv, W.F.; Zhu, T.Y.; Xu, K. The scaling of human mobility by taxis is exponential. Phys. A Stat. Mech. Appl. 2012, 391, 2135–2144. [Google Scholar] [CrossRef]
  17. Liang, X.; Zhao, J.; Dong, L.; Xu, K. Unraveling the origin of exponential law in intra-urban human mobility. Sci. Rep. 2013, 3, 2983. [Google Scholar] [CrossRef]
  18. Bazzani, A.; Giorgini, B.; Rambaldi, S.; Gallotti, R.; Giovannini, L. Statistical laws in urban mobility from microscopic GPS data in the area of Florence. J. Stat. Mech. Theory Exp. 2010, 2010, P05001. [Google Scholar] [CrossRef]
  19. Wang, W.J.; Pan, L.; Yuan, N.; Zhang, S.; Liu, D. A comparative analysis of intra-city human mobility by taxi. Phys. A Stat. Mech. Appl. 2015, 420, 134–147. [Google Scholar] [CrossRef]
  20. Csáji, B.C.; Browet, A.; Traag, V.; Delvenne, J.C.; Huens, E.; Van Dooren, P.M.; Smoreda, Z.; Blondel, V.D. Exploring the mobility of mobile phone users. Phys. A Stat. Mech. Appl. 2013, 392, 1459–1473. [Google Scholar] [CrossRef]
  21. Zhang, S.; Tang, J.J.; Wang, H.X.; Wang, Y.H.; An, S. Revealing intra-urban travel patterns and service ranges from taxi trajectories. J. Transp. Geogr. 2017, 61, 72–86. [Google Scholar] [CrossRef]
  22. Kou, Z.Y.; Cai, H. Understanding bike sharing travel patterns: An analysis of trip data from eight cities. Phys. A Stat. Mech. Appl. 2019, 515, 785–797. [Google Scholar] [CrossRef]
  23. Plötz, P.; Jakobsson, N.; Sprei, F. On the distribution of individual daily driving distances. Transp. Res. Part B Methodol. 2017, 101, 213–227. [Google Scholar] [CrossRef]
  24. Zheng, Z.; Rasouli, S.; Timmermans, H. Two-regime pattern in human mobility: Evidence from GPS taxi trajectory data. Geogr. Anal. 2016, 48, 157–175. [Google Scholar] [CrossRef]
  25. Chen, B.Y.; Yuan, H.; Li, Q.Q.; Lam, W.H.K.; Shaw, S.L. Map-matching algorithm for large-scale low-frequency floating car data. Int. J. Geogr. Inf. Sci. 2014, 28, 22–38. [Google Scholar] [CrossRef]
  26. Krause, C.M.; Zhang, L. Short-term travel behavior prediction with GPS, land use, and point of interest data. Transp. Res. Part B Methodol. 2019, 123, 349–361. [Google Scholar] [CrossRef]
  27. Xia, F.; Wang, J.Z.; Kong, X.J.; Wang, Z.B.; Li, J.X.; Liu, C.F. Exploring human mobility patterns in urban scenarios: A trajectory data perspective. IEEE Commun. Mag. 2018, 56, 142–149. [Google Scholar] [CrossRef]
  28. Zhang, B.; Chen, S.Y.; Ma, Y.F.; Li, T.Z.; Tang, K. Analysis on spatiotemporal urban mobility based on online car-hailing data. J. Transp. Geogr. 2020, 82, 102568. [Google Scholar] [CrossRef]
  29. Lin, M.; Hsu, W.J. Mining GPS data for mobility patterns: A survey. Pervasive Mob. Comput. 2014, 12, 1–16. [Google Scholar] [CrossRef]
  30. Liu, Y.; Kang, C.G.; Gao, S.; Xiao, Y.; Tian, Y. Understanding intra-urban trip patterns from taxi trajectory data. J. Geogr. Syst. 2012, 14, 463–483. [Google Scholar] [CrossRef]
  31. Tang, J.J.; Liu, F.; Wang, Y.H.; Wang, H. Uncovering urban human mobility from large scale taxi GPS data. Phys. A Stat. Mech. Appl. 2015, 438, 140–153. [Google Scholar] [CrossRef]
  32. Taylor, M.A.P. Fosgerau’s travel time reliability ratio and the burr distribution. Transp. Res. Part B Methodol. 2017, 97, 50–63. [Google Scholar] [CrossRef]
  33. Clauset, A.; Shalizi, C.R.; Newman, M.E.J. Power-law distributions in empirical data. SIAM Rev. 2007, 51, 661–703. [Google Scholar] [CrossRef]
  34. Johnson, J.; Omland, K. Model selection in ecology and evolution. Trends Ecol. Evol. 2004, 19, 101–108. [Google Scholar] [CrossRef]
  35. Zhang, K.P.; Ning, J.; Zheng, L.; Liu, Z.J. A novel generative adversarial network for estimation of trip travel time distribution with trajectory data. Transp. Res. Part C Emerg. Technol. 2019, 108, 223–244. [Google Scholar] [CrossRef]
  36. Posada, D.; Buckley, T.R. Model selection and model averaging in phylogenetics: Advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst. Biol. 2004, 53, 793–808. [Google Scholar] [CrossRef] [PubMed]
  37. Akaike, H. Information measures and model selection. Int. Stat. Inst. 1983, 50, 277–291. [Google Scholar]
  38. Shi, C.Y.; Chen, B.Y.; Lam, W.H.K.; Li, Q.Q. Heterogeneous data fusion method to estimate travel time distributions in congested road networks. Sensors 2017, 17, 2822. [Google Scholar] [CrossRef]
  39. Shi, C.Y.; Chen, B.Y.; Li, Q.Q. Estimation of travel time distributions in urban road networks using low-frequency floating car data. ISPRS Int. Geo Inf. 2017, 6, 253. [Google Scholar] [CrossRef]
Figure 1. Distribution of hourly trip quantity.
Figure 1. Distribution of hourly trip quantity.
Ijgi 10 00268 g001
Figure 2. Distribution of trip metrics on Thursday, November 3, 2016: (a) distribution of travel distance; (b) distribution of travel time; and (c) distribution of travel speed.
Figure 2. Distribution of trip metrics on Thursday, November 3, 2016: (a) distribution of travel distance; (b) distribution of travel time; and (c) distribution of travel speed.
Ijgi 10 00268 g002
Figure 3. Daily variation of skewness and kurtosis for 61 days: (a) skewness; and (b) kurtosis
Figure 3. Daily variation of skewness and kurtosis for 61 days: (a) skewness; and (b) kurtosis
Ijgi 10 00268 g003
Figure 4. Best-fit distribution of daily trip metrics for 61 days.
Figure 4. Best-fit distribution of daily trip metrics for 61 days.
Ijgi 10 00268 g004
Figure 5. Fitted distributions for the observed travel speed data on October 13.
Figure 5. Fitted distributions for the observed travel speed data on October 13.
Ijgi 10 00268 g005
Figure 6. The fitting distributions of hourly travel distance.
Figure 6. The fitting distributions of hourly travel distance.
Ijgi 10 00268 g006
Figure 7. The fitting distributions of hourly travel time.
Figure 7. The fitting distributions of hourly travel time.
Ijgi 10 00268 g007
Figure 8. The fitting distributions of daily travel speed.
Figure 8. The fitting distributions of daily travel speed.
Ijgi 10 00268 g008
Table 1. Data section after processing.
Table 1. Data section after processing.
Vehicle IDOrder IDTimestampX-CoordinateY-CoordinateLink IDRelative Position
10000012000000125,414491,832.853,787,627.44429150.3116
10000012000000125,417491,848.403,787,654.13429150.8538
10000012000000125,420491,876.343,787,700.57205580.5930
Table 2. Model selection for travel speed data on October 13 ( μ = 19.78 , σ = 7.48 km/h).
Table 2. Model selection for travel speed data on October 13 ( μ = 19.78 , σ = 7.48 km/h).
DistributionLog-Likelihood ( L i ) BICThe BIC weightMeanSTD
Lognormal−249,789499,601019.837.83
Gamma−249,548499,1180.241219.787.34
Weibull−253,174506,369019.787.74
Burr−249,541499,1160.758819.817.73
Rayleigh−259,304518,620018.759.80
Table 3. Statistics of best-fit distributions for travel distance.
Table 3. Statistics of best-fit distributions for travel distance.
PeriodIndexMeanLognormalGammaWeibullBurrRayleigh
00:00–24:00Percentage 0.20%93.10%4.44%0.27%1.98%
BIC weight0.95640.83620.97360.72070.72810.7210
00:00–07:00Percentage 0.70%77.99%14.52%06.79%
BIC weight0.85400.83620.89270.709400.7210
07:00–24:00Percentage 099.32%0.29%0.39%0
BIC weight0.998600.99980.95390.72810
Table 4. Statistics of best-fit distributions for travel time.
Table 4. Statistics of best-fit distributions for travel time.
PeriodIndexMeanLognormalGammaWeibullBurrRayleigh
00:00-24:00Percentage 0.68%76.23%0.55%20.56%1.98%
BIC weight0.92830.75150.94300.74460.89820.7853
02:00-07:00Percentage 2.34%71.43%1.17%18.27%6.79%
BIC weight0.85250.75150.88000.62960.86690.7853
National Day holidayPercentage 082.74%0.60%13.69%2.98%
BIC weight0.945900.95410.53420.93520.7854
Table 5. Fitting indicators of alternative distributions for three trip metrics.
Table 5. Fitting indicators of alternative distributions for three trip metrics.
IndicatorsTravel DistanceTravel TimeTravel Speed
Alternative distributionTypeGammaGammaBurr
Number101348218
K–S testBest-fit90 (89%)46 (13%)137 (63%)
Alternative93 (92%)264 (76%)0
The BIC difference B I C a l t e r n a t i v e B I C min 4.929.8847.13
Δ / B I C min 0.20%0.10%0.15%
MAPE (10th, 50th, 90th)2.35%2.51%0.96%
MAE (10th, 50th, 90th)0.100.240.17
Table 6. Evaluation of general distribution for three trip metrics.
Table 6. Evaluation of general distribution for three trip metrics.
IndicatorsTravel DistanceTravel TimeTravel Speed
General distributionGammaGammaBurr
R20.98590.99150.9864
MAPE (10th, 50th, 90th)1.66%2.15%0.89%
MAE (10th, 50th, 90th)0.05370.18870.2097
POPI20.85%20.00%20.12%
POOI19.21%19.95%19.86%
PPC4.31%2.54%3.11%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Back to TopTop