Travel Characteristics Identification Method for Expressway Passenger Cars Based on Electronic Toll Collection Data

Cai, Xiaoyu; Zhang, Yihan; Zhang, Xin; Peng, Bo

doi:10.3390/su151511619

Open AccessArticle

Travel Characteristics Identification Method for Expressway Passenger Cars Based on Electronic Toll Collection Data

by

Xiaoyu Cai

^1,2,*,

Yihan Zhang

³,

Xin Zhang

⁴ and

Bo Peng

^1,2

¹

College of Smart City, Chongqing Jiaotong University, Chongqing 400074, China

²

Chongqing Key Laboratory of Mountain City Traffic System and Safety, Chongqing 400074, China

³

College of Traffic & Transportation, Chongqing Jiaotong University, Chongqing 400074, China

⁴

Cmcu Engineering Co., Ltd., Chongqing 400039, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(15), 11619; https://0-doi-org.brum.beds.ac.uk/10.3390/su151511619

Submission received: 19 June 2023 / Revised: 22 July 2023 / Accepted: 23 July 2023 / Published: 27 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

Passenger cars have emerged as a substantial segment of the vehicles traversing expressways, generating extensive traffic data on a daily basis. Accurately identifying individual vehicles and their travel patterns and characteristics is crucial in addressing the issues that impede the sustainable development of expressways, including traffic accidents, congestion, environmental pollution, and losses of both personnel and property. Regrettably, the utilization of electronic toll collection (ETC) data on expressways is currently not adequate, and data analysis and feature mining methods are underdeveloped, leading to the undervaluation of data potential. Focusing on ETC data from expressways, this study deeply analyzes the spatiotemporal characteristics of travel by passenger car users. Here, we propose an advanced user classification model by combining the traditional clustering algorithm with the feature grouping recognition model based on a back propagation neural network (BPNN) algorithm. Real-world data on expressway vehicle travel are used to validate our models. The results show a significant improvement in iteration efficiency of over 26.4% and a 23.17% accuracy improvement compared to traditional algorithms. The travel feature grouping recognition model yielded an accuracy of 95.23%. Furthermore, among the identified groups, such as “Public and commercial affairs” and “Commuting”, there is a notable characteristic of high travel frequency and concentrated travel periods. This indicates that these groups have placed significant pressure on the construction of a safe, efficient, and sustainable urban transportation system.

Keywords:

expressway; electronic toll collection data; feature identification; passenger car; K-means and BP neural network combination model

1. Introduction

Passenger car are playing an increasingly prominent role on expressways. Traditionally, expressway management and operational services have treated passenger car similarly to other individual vehicles, leading to inaccurate identification of similar group travel characteristics and a lack of precise understanding of the diverse travel needs of different individuals. This increases the risk of expressway traffic congestion and accidents, posing a threat to the safety of travelers and exacerbating environmental pollution, including exhaust emissions and noise pollution [1]. Consequently, it impacts the sustainable development of the expressway economy and society. To fully explore the powerful value of ETC data on expressways, a scientific and efficient method is necessary. The use of an objective, accurate, and highly applicable algorithm model to classify and identify the heterogeneity of travel behavior among ETC passenger car users on expressways should be explored.

Expressways are an essential part of transportation infrastructure and play a significant role in meeting the high-speed and high-efficiency travel demands of society, with passenger cars representing the primary user group. The travel behavior of passenger car users directly affects the work mode of expressway management and service departments and has an impact on the operational status of expressways. Therefore, a fundamental approach to enhancing the operational environment of expressways, addressing the differentiated travel demands of diverse groups, and achieving sustainable development in urban transportation is to explore the travel characteristics and patterns of passenger car users on expressways, accurately and meticulously classify heterogeneous groups, and develop targeted management measures and service plans. This study contributes specifically to the literature in the following ways:

(1) This study explores ETC data on expressways, analyzes the spatiotemporal distribution patterns of vehicle travel, and identifies the travel characteristics of ETC passenger car users. The findings can serve as a reference for future research on the travel behavior of vehicles on expressways.

(2) We developed a precise feature identification model using machine learning methods to overcome the shortcomings of traditional clustering algorithms. The model accurately “partitions–recognizes” different types of passenger car travel groups. This model can provide data support for the design of governance programs for expressway traffic management departments and enable the identification of the needs of different user groups. This, in turn, can facilitate the development of personalized ETC value-added service products and improve overall satisfaction with the expressway travel services.

This paper is structured as follows. Section 2 provides an overview of the existing literature, followed by Section 3 which introduces the travel feature identification combination model. The combination model includes the user classification model based on improved traditional clustering algorithm, and the group recognition model based on back propagation neural network. Section 4 focuses on verifying and examining the results of the model. Finally, the conclusion and future prospects of the research are discussed in Section 5.

2. Literature Review

In recent years, digital and intelligent highways have rapidly developed, and big data have played an increasingly significant role in related research on highways. The kinds of data on highways mainly include construction, toll, traffic, facility maintenance, and traffic accident data. The research and application fields are mainly concerned with traffic flow prediction, traffic operation status analysis, and traffic safety risk assessment and prediction.

In terms of traffic flow prediction, He et al. [2] analyzed the distribution law of traffic flow through ETC toll data and used the Gaussian mixture regression (GMR) model to predict short-term traffic flow in ETC lanes. Shuai et al. [3] proposed a mixed two-layered model for predicting highway traffic flow and assessed its effectiveness by using data from 51 toll stations. In terms of traffic operation status judgment, Wang et al. [4] used actual highway data for parameter regression analysis and evaluated the traffic operation status between the mainline and ramp systems. Wang et al. [5] considered the sequential nature of traffic states and established an ordered Logit model to achieve traffic state prediction. In terms of traffic safety risk assessment and prediction, Yang et al. [6] proposed a method for accident detection and classification on highways based on deep convolutional neural networks using Global Navigation Satellite System (GNSS) positioning data. Jung et al. [7] used binary regression of random forests based on freeway tunnel accident data to evaluate the strategies recommended by the Korea Ministry of Land, Infrastructure and Transport (KMOLIT).

Studies of vehicle travel behavior mainly analyze how people choose travel modes, plan itineraries, and use transportation tools. Various methods have been proposed for studying vehicle travel behavior. These methods are generally divided into two categories: travel characteristics mining and travel characteristics identification.

Extracting travel characteristics usually involves processing large numbers of transportation data, such as vehicle travel trajectories, Global Positioning System (GPS) data, and Radio Frequency Identification (RFID) data to identify rational and scientific indicators for characterizing travel behavior. Chen et al. [8] analyzed the number of trips, travel distances, and travel time of local vehicles, non-local vehicles, and ride-hailing vehicles to understand the travel characteristics of these three different types of vehicles. Gong et al. [9] extracted the travel features of motor vehicles from license plate recognition system data, identifying four types of vehicles with unique structures while accounting for the number of trips and travel time. Lv et al. [10] used the vehicle travel trajectory data to extract feature indicators including travel frequency, travel time, and travel distance. Cluster analysis, as the main research method of data mining, has become a focus of researchers’ attention. Magnana et al. [11] analyzed the path selection rules based on travel characteristics and used density clustering algorithm to analyze GPS data, obtaining the original candidate path set. Gao et al. [12] used the traveler’s job type as a basis, analyzed trajectory data and combined with hierarchical clustering and random forest algorithm to classify and predict travel purpose. Chang et al. [13] determined three characteristic parameters by analyzing vehicle license plate data, completed the identification of commuting vehicles using the K-means algorithm, and obtained individual vehicles with commuting characteristics and concentrated distribution areas.

Mode recognition is an important method for identifying vehicle travel characteristics. It is a mathematical method of summarizing and classifying information from data. Researchers usually use machine learning methods to construct feature recognition models for recognizing travel modes of distinct groups. Xu [14] and Zhang [15] built recognition models with support vector machine (SVM) algorithms. Some studies also utilize the random forest model to identify travel modes [16,17]. Furthermore, existing studies indicate that utilizing neural networks to identify travel characteristics tend to yield favorable outcomes [18,19].

We have also reviewed studies aimed at improving the K-means algorithm. K-means algorithm [20] has three main limitations: sensitivity to the initial clustering center, manual determination of the number of clusters, and varying degrees of impact on results by data outliers. Existing studies have identified the selection of initial clustering centers as a crucial aspect to improve [21,22]. In the issue of determining the optimal number of clusters, several methods have been proposed. Guo et al. [23] utilized sample density and crown algorithm to cluster sample data, which enabled them to obtain the number of clusters and cluster centers. Liu et al. [24] developed a clear hierarchical clustering objective function by combining Bayesian theory analysis with K-means algorithm. Zhang et al. [25], on the other hand, identified the number of clusters based on the similarity in the data and applied iterative clustering to find the best cluster number. Three methods proposed in recent literature have addressed the problem of outlier influence in K-means clustering. Zhang et al. [26] introduced a K-means local search algorithm with relaxation objective function for outlier detection. Chen et al. [27] defined outliers as a new cluster for detection and elimination. Yu et al. [28] proposed a genetic algorithm-based three-layer and two-layer K-means algorithm that overcomes the influence of outliers, noise data, and initial cluster centers.

3. Materials and Methods

3.1. Overview of The Travel Feature Identification Combination Model

Figure 1 displays the technical roadmap of the combination model used for recognizing travel characteristics of ETC passenger car users on expressways.

This study analyzes the overall travel patterns of ETC small car users on highways, proposing feature indicators that can accurately describe travel differentiation. It combines a user feature grouping model based on an improved clustering algorithm with a feature grouping recognition model based on BP neural network in order to achieve group classification and individual recognition of travel behavior of ETC small car users on highways. Identifying travel characteristics of ETC small cars consists of three parts.

(1): Firstly, the ETC data should be pre-processed, which includes handling missing, duplicate, and abnormal values. The effective passing records can then be extracted, and the spatiotemporal characteristics of vehicle travel can be analyzed to select the travel characteristic indicators.
(2): Secondly, this part focuses on classifying user feature groups, primarily through the use of an improved K-means algorithm for clustering selected travel feature indicators.
(3): Finally, based on the clustering analysis results, the BP neural network algorithm can be utilized to learn and train the user classification data, thereby establishing a recognizable model for user travel feature groups.

3.2. Analysis of Temporal and Spatial Characteristics of Passenger Car Travel on Expressways

3.2.1. Data Preparation

The electronic toll collection (ETC) data used in this study were obtained from the Chongqing Expressway Group, which is characterized by its significant size, diversity, and wide sources. To conduct big data mining and research on passenger car users’ travel behavior more effectively, ETC transaction, vehicle passage, and vehicle information data were ultimately chosen as fundamental data for this research. The collection spanned across 4 months from May to June 2021 and from September to October 2021. Before analysis, the data underwent preprocessing to address missing and duplicate values, outliers, and establish correlation among dataset variables.

3.2.2. Analysis of Travel Time Characteristics

(1): Analysis of vehicle travel days

In order to gain a comprehensive and clear understanding of the characteristics of vehicle travel behavior with respect to travel days, the data were divided into three distinct categories for analysis: monthly total travel days, workday travel days, and weekend and holiday travel days.

Figure 2 displays statistical results of the number of vehicles associated with various travel days of electronic toll collection (ETC) passenger car users recorded in May and June 2021. The trend of vehicle frequency corresponds to the total number of monthly travel days is consistent for both months. A significant number of vehicles, about 40%, have a total monthly travel of 1–2 days, whereas vehicles with a total number of monthly travel days within 5 days represent around 75% of the sample.

A significant proportion of vehicles, approximately 65%, have a travel duration within three days on workdays. Furthermore, the percentage of vehicles remains largely unchanged after the number of travel days increases to around 10 days, as shown in Figure 3, suggesting a group of passengers who travel by expressway for an average daily commute. This statistical pattern aligns with the objective understanding of commuter groups during daily life.

Figure 4 shows the highest proportion of vehicles traveling on weekends and holidays is for a one-day duration, and approximately 80% of vehicles travel for a duration of less than 5 days.

(2): Analysis of Vehicle Travel Frequency

In May and June 2021, the statistical results of the number of vehicles corresponding to various travel frequencies of the ETC passenger car users are shown in Figure 5. Most vehicles had fewer than four travel instances within a month, of which vehicles that traveled six or fewer times accounted for approximately 60% of the dataset. To furnish a better characterization of the uneven distribution among vehicles in terms of travel frequency, the Gini coefficient was introduced and employed to produce the Lorenz curve corresponding to the number of vehicles and toll collection data.

G = \frac{S_{1}}{(S_{1} + S_{2})}

(1)

G represents the Gini coefficient,

S_{1}

represents the area between the Lorenz curve and the line of equality, and

S_{2}

represents the area between the Lorenz curve and the horizontal axis, as shown in Figure 6.

Approximately 20% of the vehicles have a higher frequency of monthly trips and tend to use the highway more often. The highway toll data of these vehicles account for 60% of the total toll data. Meanwhile, the calculation of G value at 0.59 suggests noticeable discrepancies and an unbalanced distribution of the indicators.

(3): Analysis of Vehicle Travel Time

The difference in traffic flow between highways during workdays and weekends or holidays is noticeable in daily life. To understand the distribution of vehicle travel times more accurately, date types can be divided into workdays and weekends or holidays (non-workdays). For each category, the distribution of the count of vehicles corresponding to various travel times can be studied.

Figure 7 shows notable bimodal changes in the distribution of average vehicle counts during different time periods on highways during workdays. The trend in changes is similar. On Fridays and the day before holidays, many vehicles travel on highways afternoon, which significantly increases the demand for traffic. However, the demand for transportation on the morning peak of the first working day after the holiday is much higher than that during the evening peak.

Unlike workdays, weekends or holidays only have one obvious peak in the distribution of vehicle counts for different travel times. In particular, passenger cars traveling through highways have a distinct morning peak during weekends and holidays, which is delayed for about 1–2 h compared with that on workdays, as shown in Figure 8.

(4): Analysis of Vehicle Travel Duration

It is apparent from Figure 9a that, during working days, most vehicles make short trips. The majority of them complete a single trip within half an hour or less, whereas only a small fraction takes over two hours, accounting for only 10% of the total. On the other hand, during weekends and holidays, around 30% of vehicles spend over six hours travelling, illustrating a significant difference in the duration of vehicle travel between working days and non-working days.

3.2.3. Analysis of Travel Space Characteristics

(1): Analysis of Vehicle Travel Distance

This paper examines the spatial characteristics of overall travel for passenger cars by analyzing the distribution of the distance and trajectory repetition rate of ETC passenger car users on highways over two months.

As shown in Figure 10, on workdays, trips of less than 50 km accounted for approximately 42.71% of all trips, whereas mid-range trips from 50 km to 200 km accounted for approximately 45.67%. On non-workdays, although the proportion of short-distance trips was lower than on workdays, the percentage of mid-range trips was higher, reaching approximately 52%.

(2): Analysis of Vehicle Travel Trajectory Repetition Rate

In this paper, the trajectory repetition rate is defined as the ratio of the number of instances where vehicles follow the same travel path during the study period to the total number of travel instances for each individual vehicle. The determination of vehicle trajectories is based on the entry and exit toll station information derived from the traffic data. Figure 11 shows that there is a positive correlation between the vehicle travel trajectory repetition rate and the number of trips taken by the vehicle. ETC passenger car users on expressways mostly show low trajectory repetition rates, accompanied by a subset of vehicles exhibiting high trajectory repetition rates.

Table 1 illustrates that the trends of the four indicators remain consistent across different months as well as during weekends and holidays. Table 2 indicates the variations in travel time and distance between workdays and non-workdays (weekends and holidays).

In summary, there are significant variations in the travel patterns and durations of vehicles between workdays and non-workdays. On workdays, most trips are for commuting or temporary business affairs, typically involving short distances and travel times. Conversely, during weekends and holidays, travel activities usually involve longer distances and longer travel times, in contrast to commuting or temporary business affairs.

3.2.4. Travel Characteristic Index Extraction

We identified six indicators that can describe the characteristics of small passenger car travel on highways: monthly number of travel days (X1), monthly travel frequency (X2), average travel distance per trip (X3), trajectory repetition rate (X4), travel preference during peak hours (X5), and travel preference on weekends and holidays (X6). These indicators were obtained from the data analysis. Subsequently, we performed a correlation analysis on these indicators, the results of which are presented in Table 3. The results revealed a significant positive correlation among monthly number of travel days (X1), monthly travel frequency (X2), and trajectory repetition rate (X4). Consequently, based on the results of the correlation analysis, the travel characteristic indicators were identified as monthly number of travel days, average travel distance per trip, travel preference during peak hours, and travel preference on weekends and holidays.

3.3. Development of a User Classification Model using an Improved Clustering Algorithm

3.3.1. Canopy-K-Means Clustering Algorithm Construction

McCallum et al. [29] first proposed the Canopy algorithm in 2000. The algorithm is commonly utilized for clustering analysis of high-dimensional datasets. The algorithm principle is illustrated in Figure 12. Unlike traditional clustering algorithms, the Canopy algorithm does not require pre-detection of the number of clusters. The number of clusters can be obtained by the Canopy algorithm after a single traversal of the sample dataset. The input of this algorithm is an n-sample dataset, and the output consists of k cluster centers.

The process of the improved K-means clustering algorithm based on the Canopy algorithm has two main parts [30,31]. The first part involves Canopy pre-clustering, which utilizes the Canopy algorithm to determine the number of clusters (‘k’), and obtain the initial clustering centers, also known as the centroids of each Canopy sub-group. The second part involves the iterative process of the K-means algorithm, which starts with the initial clustering centers obtained from pre-clustering, iterates the algorithm until the clustering centers converge and stop changing, and finally outputs the results.

Figure 13 is the flow chart of Canopy-k-means algorithm. The detailed algorithm steps are as follows:

Step 1: Build the target data sample set S, use the Euclidean distance to calculate the distance sample set D.

Step 2: The distance distribution histogram is generated based on the numerical features of the distance sample set, and the initial distance thresholds T_1 and T_2 are obtained.

Step 3: Determine the sample mean point of the target data sample set and designate it as the initial clustering center for the Canopy algorithm. Output the resulting clustering center as well as the k-value obtained from pre-clustering.

Step 4: Employ the clustering result obtained from the Canopy algorithm as an input parameter for the K-means algorithm. Calculate the distance between each sample point and every clustering center based on Euclidean distance calculation formula. Then, assign every sample point to its closest clustering center category.

Step 5: After all sample points are assigned, recalculate the center of each cluster.

Step 6: Compare the newly obtained clustering center with the previous clustering center. If they are different, go to Step 4 to continue calculation. Otherwise, go to Step 7.

Step 7: Output the final results.

The Canopy algorithm is used to enhance the K-means clustering method, which effectively resolves the challenge of uncertain primary centers and hard determination of ‘k’ number of clusters in the traditional method. Nevertheless, the efficacy of such an enhancement approach is confined to the input of primary parameters, with no actual performance optimization of the K-means algorithm’s internal framework. Since the K-means algorithm employs a partitioning strategy, it is inept in averting the impact of anomalous data on clustering output, resulting in local optimum problems. In light of this, specific optimization schemes will be proposed in the following section to tackle the aforementioned issue.

3.3.2. Construction of the Canopy-K-Means Clustering Algorithm Based on Ant Colony Optimization

In 1991, M. Dorigo [32] first introduced the ant colony algorithm and established the fundamental principles and core concepts through extensive research. Utilizing the stochastic search characteristic of the ant colony algorithm can enhance clustering outcomes, alleviate local optima problems, and enhance the overall accuracy of user segmentation models. Consequently, this paper suggests utilizing an ant colony algorithm optimization technique to create a mixed clustering method for Canopy-K-means.

Assuming that

X = {X_{i} | i = 1, 2, \dots, N}

is a set of data samples, where

X_{i} = {X_{i 1}, X_{i 2}, \dots, X_{i z}}

is a Z-dimensional vector and the number of clusters is

k

, the objective function is defined as follows:

m i n F = \sum_{k = 1}^{K} \sum_{i = 1}^{N} y_{i k} d_{i k}

(2)

d_{i k} = d (X_{i}, m_{k}) = {(\sum_{j = 1}^{Z} {|x_{i j} - m_{k j}|}^{2})}^{\frac{1}{2}}

(3)

s . t = \{\begin{cases} \sum_{k = 1}^{K} y_{i k} = 1, & i = 1, 2, \cdot \cdot \cdot, N \\ \sum_{i = 1}^{N} y_{i k} \geq 1, & k = 1, 2, \cdot \cdot \cdot, K \end{cases}

(4)

y_{i k} = \{\begin{cases} 1, & X_{i} \in M_{k} \\ 0, & X_{i} \notin M_{k} \end{cases}

(5)

where

d_{i k}

symbolizes the distance between the sample point

X_{i}

and the cluster center

m_{k}

;

y_{i k}

denotes the affiliation of

X_{i}

to

M_{k}

, and

M_{k}

is the k-th class.

The calculation of pheromone concentration on each path is as follows:

τ_{i k} (t + 1) = (1 - ρ) τ_{i k} (t) + \sum_{l = 1}^{L} Δ τ_{i k}^{l}

(6)

τ_{i k} (0) = \frac{1}{d_{i k}}

(7)

where

τ_{i j}^{k} (t)

represents the residual pheromone concentration from sample point

X_{i}

to clustering center

m_{k}

on the tth iterative path.

{∆ τ}_{i k}^{'} = 1 / F_{l}

,

F_{l}

is the minimum value of the objective function,

L

is a constant, and

ρ

is the volatility factor.

τ_{i k} (0) = 1 / d_{i k}

is the initialization of pheromone matrix.

The detailed process of the Canopy-K-means clustering algorithm optimized by the ant colony algorithm is shown in Figure 14. The specific steps are as follows:

As shown in Figure 14, the first part is the Canopy-K-means clustering algorithm mentioned in the previous section. The second part outlines the ant colony algorithm, which optimizes clustering results through the following steps:

Step 8: Initialize the pheromone matrix and set the following parameter values:

ρ, q_{0}, L, P_{s}, K, A, t_m a x

;

Step 9: To determine the category of all samples in each ant, each ant generates a corresponding random number

q

,

q ϵ (0, 1)

, where

q_{0}

is a preset value. If

q \leq q_{0}

, the ant calculates the mobility probability

p

using Equation (8) to assign sample

X_{i}

to

M_{K}

. If

q > q_{0}

, the ant randomly assigns sample

X_{i}

to

M_{K}

using the normalized probability Equation (9).

p = \{\begin{cases} m a x \{p_{i k} = τ_{i k} η_{i j} / \sum_{k = 1}^{K} τ_{i k} η_{i j}\} \\ η_{i j} = \frac{1}{d_{i k}} \end{cases}

(8)

p_{i j} = τ_{i j} / \sum_{k = 1}^{K} τ_{i j}, j = 1, 2, \cdot \cdot \cdot, K

(9)

Step 10: Obtain the cluster center by calculating the mean attribute values of each category of samples in each ant’s corresponding solution. Calculate the objective function value using Equation (2).

Step 11: After assigning sample points to their corresponding categories, arrange the value

F

for each ant in ascending order, and perform a simple local search on the top

L

ant solutions. For the random number

r ϵ (0, 1)

allocated to the element corresponding to the sample

X_{i}

in the solution, if

r < P_{s}

, assign sample

X_{i}

to another category. Recalculate

F^{'}

, if

F^{'} < F

, replace the original solution.

Step 12: Calculate and update the global pheromone concentration on each path according to Equation (6);

Step 13: If the number of iterations

t = t_{m a x}

, output the optimal solution, otherwise

t = t + 1

, and go to Step 9.

3.4. Development of a BP Neural Network Algorithm for the Identification of Travel Characteristic Groups

The BP (back propagation) neural network, also known as the error back propagation network, uses the error reverse transmission algorithm as its core rule for model training. The essence of the neural network training process is to minimize the loss function, and each optimization corresponds to an iteration of the network. The iterative forward information propagation and reverse error propagation are the processes of neural network learning and training. The process stops when the error of the model output meets the expected error target set in advance or the maximum number of model training times is reached. The operating structure of the BP neural network model is shown in Figure 15.

The construction idea of BP neural network recognition model in this study is as follows:

The number of layers of the neural network (1 ≤ L ≤ n) and the number of hidden layer neurons (1 ≤ L ≤ m) are selected based on experience to obtain varied combinations of layer numbers and neuron numbers. The model takes four feature indicators, i.e., monthly number of travel days, average travel distance per trip, travel preference during peak hours, and travel preference on weekends and holidays as inputs. The samples are labeled based on the user group partitioning results of the classification model. A BP neural network is then constructed to identify travel feature groups. The expected model recognition error is set as a threshold and the iteration is stopped to complete the training when the error between the predicted output and the actual value is lower than the expected error. The parameter combination with the shortest training time is recorded as the optimum parameter combination. This determination finally defines the number of hidden layers and neurons in the BP neural network. The specific model design process is shown in Figure 16.

4. Empirical Analysis

4.1. Model Validation

The study collected 61 days of ETC data from September to October 2021, which include 39 workdays and 22 weekends and holidays, including two large statutory holidays, Mid-Autumn Festival and National Day. The total number of target objects collected was 1,642,920. Furthermore, the data indicated the effective travel records (21,649,767 for the two months) and the result of data volume statistics for different time ranges (shown in Figure 17).

4.1.1. Model Validation for Classifying Expressway ETC Passenger Car Users

This article applies the Canopy algorithm for pre-clustering. We selected a random 2% sample from the dataset to obtain the sample distances. We drew a histogram based upon the sample distances and obtained corresponding

T_{1}

and

T_{2}

values. To ensure sampling consistency, we conducted ten iterative samplings. After each iteration, we calculated the average value of the initial threshold values of

T_{1}

and

T_{2}

. This average value served as the final initial threshold for the study. As shown in Table 4, the final initial thresholds are

T_{1} = 4.22

and

T_{2} = 3.11

.

The model computes the initial threshold values of

T_{1}

and

T_{2}

. The model outputs six Canopy subsets and their corresponding centers. These are shown in Figure 18 and Table 5; the result is a cluster number

k = 6

.

We input the six initial clustering centers obtained by Canopy into the K-means clustering model for iterative computation. We obtained a final output of six clustering centers and the number of iterations when the model converges. The results are displayed in Figure 19 and Table 6.

To further improve the accuracy of clustering results, the ant colony algorithm was used to optimize the Canopy-K-means clustering results. In this paper, the volatility factor was set to

p = 0.1

, the thresholds

q_{0}

and

P_{s}

were set to 0.9, and the constant

L = 50

. The number of ants (

A

) was set to 200. The maximum number of iterations (

t_m a x

) was determined by analyzing and comparing the clustering results with different numbers of iterations, as shown in Table 7. We found that the clustering effect was optimal when the maximum number of iterations was set to 300.

Table 8 shows the parameter settings of the ant colony algorithm used in this study. Figure 20 and Table 9 display the optimized clustering results.

To clarify the actual meaning of each feature index of the clustering centers and study the feature performance of different categories, we performed data reduction on the clustering centers to obtain the actual value of the feature clustering centers as presented in Table 10. Figure 21 provides an intuitive comparison of the differences in travel characteristics between groups.

Group 0 has a prominent preference for traveling during weekends and holidays, with fewer average travel days per month, as shown by the results. We define this group as the “travel and visitation” group.

Group 1 is defined as the long-distance travel group, as its most significant characteristic is the average distance of a single trip.

Having an average of 3.79 travel days per month, Group 2 primarily concentrates on short to medium-distance trips with no apparent preference for weekend and holiday travel. Thus, we identify Group 2 as the business travel group.

Group 3 has a higher average number of monthly travel days. While there is no apparent preference for travel on weekends and holidays, the travel characteristics are similar to those of commercial vehicles. Therefore, Group 3 is more in line with official or commercial vehicles.

Group 4 is identified as the “commuting” group due to their high travel frequency during workday peak periods, involving relatively short traveling distances with less travel on weekends and holidays.

Group 5’s travel days per month are lower, and travel times rarely occur on weekends or holidays. Their traveling activities rarely involve highways, which implies more city-based daily travel. Hence, we define this group as the “sporadic travel” group.

Figure 21 reveals that the “public and commercial affairs” and the “commuting” groups have higher travel frequency and shorter travel distance, especially on working days. The “commuting” group is particularly concentrated during morning and evening peak hours on working days, causing the most traffic pressure on expressways and easily resulting in traffic congestion. This will increase commuting time and costs, leading to environmental pollution and energy waste. The expressway management department can formulate effective traffic control measures specifically for this group to improve the operational efficiency and sustainability of expressways.

4.1.2. Model Validation for Identifying Travel Characteristics Groups

Based on the results of a user classification model, classification tags are assigned to all expressway ETC passenger car users. First, 80% of randomly selected individuals are assigned as the training set, and a BP neural network is trained to identify the travel characteristics of the expressway ETC passenger car user group. The remaining 20% of individuals are designated as the test set to calculate the identification accuracy of the recognition model. The construction of the characteristic group identification model is illustrated in Figure 22.

The detailed parameters of the neural network model are set as shown in Table 11.

By comparing the predicted results of the model with the actual values, the recognition results of each characteristic group are determined, as shown in Figure 23.

4.2. Effect Test

The validity of cluster analysis can be evaluated using commonly used indicators, such as the SSE (Sum of Squared Errors), CH (Calinski–Harabasz Index), and DB (Davies–Bouldin Index) [33]. The SSE index mathematically represents the sum of squared errors. The CH index is defined as the ratio of inter-cluster dispersion to intra-cluster dispersion. It can be obtained by calculating the between-class variance and within-class variance. A larger CH value indicates a better clustering effect. The DB index, also known as the classification appropriateness index, is obtained by calculating the average intra-class distance between any two categories divided by the maximum distance between the two cluster centers. A smaller DB value indicates a better clustering effect due to a smaller intra-class distance and a larger inter-class distance between the categories.

I_{S S E} = \sum_{i = 1}^{k} I_{S S E} (i)

(10)

where

k

represents the number of clusters;

I_{S S E} (i)

represents the distance between the data samples in the same cluster and the cluster centroid.

I_{C H I} = \frac{B G S S / (k - 1)}{W G S S / (N - k)}

(11)

W G S S^{(k)} = \sum_{i \in k} {‖x_{i}^{k} - Z_{k}‖}^{2}

(12)

B G S S^{(k)} = {\sum_{i = 1}^{k} n_{i} ‖Z_{i} - Z‖}^{2}

(13)

where

N

represents the total number of samples in the dataset,

k

represents the number of clusters,

B G S S

represents between-class variance, and

W G S S

represents within-class variance.

D B I (k) = \frac{1}{k} \sum_{i = 1}^{k} \underset{j = 1 - k, j \neq i}{m a x} (\frac{W_{i} + W_{j}}{|C_{i j}|})

(14)

W_{i} = \frac{1}{n_{i}} \sum_{x_{i} \in C_{i}} \sqrt{{(x_{i} - Z_{i})}^{2}}

(15)

|C_{i j} = \sqrt{{(Z_{i} - Z_{j})}^{2}}|

(16)

where

k

represents the number of clusters,

C_{i}

represents the i-th class object set,

{| C}_{i j} |

represents the distance between the cluster centers of the i-th class and j-th class,

W_{i}

and

W_{j}

respectively represent the average distance between the sample points in the i-th class and j-th class to their respective cluster centers, and

n_{i}

represents the number of samples in that class.

The performance of the identification model for travel characteristics group is tested by selecting TP (true and positive), FN (false and negative), FP (false and positive), and TN (true and negative) as evaluation indicators, based on the combination of its actual category and predicted category [17]. The precision rate refers to the proportion of the correct number of travel group samples identified by the model to the total samples identified as the travel group.

P_{i} = \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i}}

(17)

where

P_{i}

denotes the precision rate of group

i

,

{T P}_{i}

represents the correct number of samples of group

i

in the model, and

{F P}_{i}

represents the number of incorrect samples of group

i

.

Recall rate refers to the proportion of the correct sample number of a particular travel group identified by the model to the actual travel group.

R_{i} = \frac{{T P}_{i}}{{T P}_{i} + {F N}_{i}}

(18)

where

R_{i}

denotes the precision rate of group

i

, and

{F N}_{i}

stands for the number of incorrect samples of group

i

.

F 1 - s c o r e = \frac{2 \times P_{i} \times R_{i}}{P_{i} + R_{i}}

(19)

A c c u r a c y = \frac{{T P}_{i} + {T N}_{i}}{{T P}_{i} + {F N}_{i} + {F P}_{i} + {T N}_{i}}

(20)

Firstly, the efficacy of the method employed to determine the initial thresholds

T_{1}

and

T_{2}

of the Canopy algorithm in this study is validated by verifying the number of clusters generated by the model. The evaluation indicator values of different cluster numbers in the Canopy algorithm are demonstrated in Figure 24. When the number of clusters

k = 6

, the clustering effect is superior, as indicated by optimal values in all evaluation indicators, in line with the corresponding conclusion yielded by the Canopy algorithm.

The efficacy of the performance improvement provided by the Canopy-K-means algorithm is verified through comparison of its clustering outcomes with those yielded by the traditional K-means algorithm. The output results of the traditional K-means clustering model are presented in Table 12. Figure 25 displays a comparison of the iteration effects between the two models. After 368 iterations, the Canopy-K-means algorithm attains the optimal solution, whereas the traditional K-means algorithm does not converge when iteration numbers surpass the maximum iteration limit due to the heavy initial-value randomness.

In summary, using the Canopy algorithm for pre-clustering to obtain initial cluster centers not only sped up the convergence rate of the K-means algorithm, with a computation efficiency improved by over 26.4%, but also improved the accuracy of algorithm results. As shown in Table 13 and Figure 26, the Canopy-K-means clustering algorithm reduced the clustering error by 13.48% compared to the traditional K-means algorithm, resulting in smaller intra-class distances and larger inter-class distances. As a result, individuals of the same category were more closely related and similarities between them were higher, while the differences between individuals of different categories were more significant. The error in the model clustering results was reduced by 8.54% compared to the Canopy-K-means algorithm and by 23.17% compared to the traditional K-means clustering algorithm, after optimization by the ant colony algorithm. The clustering effect was significantly enhanced, resulting in remarkable optimization outcomes.

A recognition confusion matrix was assembled with regard to the recognition outcomes of various travel groups to evaluate the model’s recognition performance through calculation of an evaluation table. Refer to Table 14 and Table 15 for details.

As shown in Figure 27, the results indicate that the highest F1-score value, reaching 96.94%, is that of the “commuting” group, followed by the “official business” group. The identification accuracy of the travel characteristic group recognition model for expressway ETC passenger car users is 95.23%. In general, the model performs well in recognition and can accurately identify various travel characteristic groups.

5. Conclusions

This study aims to assist expressway operation departments in formulating management measures and service plans for different travel groups. To achieve this, we extracted vehicle travel characteristic indicators from electronic toll collection (ETC) data, analyzed the time and space characteristics of passenger car, optimized the K-means clustering algorithm, and proposed a combination model for recognizing feature groups based on a neural network algorithm.

(1): In the traditional K-means algorithm, the problem of determining the initial clustering center and number of clusters is addressed through the use of the Canopy algorithm for pre-clustering. This improvement results in the K-means clustering algorithm being at least 26.4% more efficient.
(2): The ant colony algorithm optimized clustering results have reduced the error of the Canopy-K-means algorithm by 8.54% and decreased the inter-cluster error of the traditional K-means clustering algorithm by 23.17%. The accuracy of clustering has improved significantly.
(3): We utilized the group classification results as labels and performed neural network training to achieve efficient identification of various travel feature groups. The findings indicated a model recognition accuracy of 95.23%.

The results demonstrate that the model proposed in this paper accurately identifies the travel characteristics of passenger cars, offering valuable insights for traffic management departments to develop effective traffic control measures, thereby contributing to the achievement of sustainable development in expressway traffic. For instance, official vehicles are prone to wear and tear due to their frequent usage. To prevent vehicle-related traffic accidents, traffic management departments can disseminate network information [34] to remind users to inspect their vehicles daily before driving. Additionally, highway law enforcement agencies should intensify their oversight of social commercial operating vehicles, enhance the scrutiny of their operational qualifications, standardize their driving behavior, ensure the safety of drivers and passengers, and elevate the level of highway safety. In terms of the commuter travel group, implementing time-sharing and regional travel guidance can mitigate the concentration of vehicle travel, thereby reducing traffic congestion at expressway toll stations during peak morning and evening hours. This approach ensures the smooth operation of the expressway and enhances overall vehicle traffic efficiency.

The research data used in this paper are mainly ETC traffic data and user basic information, which are offline historical data. In the future, we will incorporate the following two aspects:

(1): To achieve secondary division of ETC passenger car user groups, we will utilize additional ETC internet operation data and ETC consumption data to further investigate the diversity of user group characteristics and needs.
(2): To continuously improve the division criteria for different user groups with distinctive characteristics, an interface for real-time data transmission will be established, which utilizes data that are more current and relevant to update user feature identification models. This approach enables us to provide a more targeted and refined service to meet user needs, resulting in the improvement of efficiency in expressway management and service levels.

Author Contributions

Methodology, X.C. and Y.Z.; validation, B.P.; writing—original draft preparation, X.Z.; writing—review and editing, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by Science and Technology Research Major Project of Chongqing Municipal Education Commission: Research and application of AI-driven traffic congestion mechanism and core model algorithm of circle layer intelligent control in super-large mountainous cities (KJZD-M202300702).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The numerical data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ge, W.; Zhang, G. Resilient Public Transport Construction in Mega Cities from the Perspective of Ecological Environment Governance. J. Environ. Public Health 2022, 2022, 9143618. [Google Scholar] [CrossRef] [PubMed]
He, M.; Gao, L.; Shuai, C.; Lee, J.; Luo, J. Distribution Analysis and Forecast of Traffic Flow of an Expressway Electronic Toll Collection Lane. J. Transp. Eng. Part A Syst. 2021, 147, 04021043. [Google Scholar] [CrossRef]
Shuai, C.; Wang, W.; Xu, G.; He, M.; Lee, J. Short-Term Traffic Flow Prediction of Expressway Considering Spatial Influences. J. Transp. Eng. Part A Syst. 2022, 148, 04022026. [Google Scholar] [CrossRef]
Wang, Y.; Fu, Q.; Wang, X. A Traffic Status Evaluation Method of Expressway Merging Area Based on Improved Coupling Theory. Mod. Phys. Lett. B 2022, 36, 2150616. [Google Scholar] [CrossRef]
Wang, K.; Wang, L.; Ma, W. Real-Time Traffic State Prediction and Congestion Mechanism Analysis for Expressways. In Proceedings of the CICTP 2022, Changsha, China, 8–11 July 2022. [Google Scholar]
Yang, D.; Wu, Y.; Sun, F.; Chen, J.; Zhai, D.; Fu, C. Freeway Accident Detection and Classification Based on the Multi-Vehicle Trajectory Data and Deep Learning Model. Transp. Res. Part C Emerg. Technol. 2021, 130, 103303. [Google Scholar] [CrossRef]
Jung, S.; Qin, X. A Data-Driven Approach to Strengthening Policies to Prevent Freeway Tunnel Strikes by Motor Vehicles. Accid. Anal. Prev. 2021, 157, 106171. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.-C. Spatio-Temporal Analysis and Cost Modeling of Trip Data Based on License Plate Recognition BigData: An Case Study of Shenzhen City, China. Master’s Thesis, School of Civil and Traffic Engineering, Shenzhen University, Shenzhen, China, 2020. [Google Scholar]
Gong, Y.-G.; Li, J.; Wang, Y.; Ye, H. Analysis of Vehicle Travel Characteristics Based on License Plate Recognition Data. Traffic Transp. 2020, 36, 54–58. [Google Scholar]
Lv, M.; Chen, L.; Chen, T.; Zeng, D.; Cao, B. Discovering Individual Movement Patterns from Cell-Id Trajectory Data by Exploiting Handoff Features. Inf. Sci. 2019, 474, 18–32. [Google Scholar] [CrossRef]
Magnana, L.; Rivano, H.; Chiabaut, N. Implicit GPS-based Bicycle Route Choice Model Using Clustering Methods and a LSTM Network. PLoS ONE 2022, 17, e0264196. [Google Scholar] [CrossRef]
Gao, Q.; Molloy, J.; Axhausen, K.-W. Trip Purpose Imputation Using GPS Trajectories with Machine Learning. ISPRS Int. J. Geo-Inf. 2021, 10, 775. [Google Scholar] [CrossRef]
Chang, Y.-J.; Yang, D.-Y. Recognition of Vehicles with Commuting Property Using License Plate Data. J. Transp. Syst. Eng. Inf. Technol. 2016, 16, 77–82. [Google Scholar]
Xu, Z.; Aghaabbasi, M.; Ali, M.; Macioszek, E. Targeting Sustainable Transportation Development: The Support Vector Machine and the Bayesian Optimization Algorithm for Classifying Household Vehicle Ownership. Sustainability 2022, 14, 11094. [Google Scholar] [CrossRef]
Peng, H.; Wang, J.-P.; Zhang, N. Travel Mode Choice of Commuters in Corridor Valley Pattern City of Loess Plateau Based on SVM. J. Chongqing Jiaotong Univ. Nat. Sci. 2021, 40, 18–23. [Google Scholar]
Zhao, P.-J.; Cao, Y.-S. Identifying metro trip purpose using multi-source geographic big data and machine learning approach. J. Geo-Inf. Sci. 2020, 22, 1753–1765. [Google Scholar]
Lu, Z.; Long, Z.; Xia, J.; An, C. A random forest model for travel mode identification based on mobile phone signaling data. Sustainability 2019, 11, 5950. [Google Scholar] [CrossRef] [Green Version]
Xia, Y.; Chen, H.; Zimmermann, R. A Random Effect Bayesian Neural Network (RE-BNN) for Travel Mode Choice Analysis Across Multiple Regions. Travel Behav. Soc. 2023, 30, 118–134. [Google Scholar] [CrossRef]
Tang, Y.-L.; Jiang, C.; Zheng, B.-H.; Li, Q.-M. Taxi on Service Trip Characteristics Based on Multi-source Data Fusion: A Case of Yueyang. J. Transp. Syst. Eng. Inf. Technol. 2018, 18, 45–51. [Google Scholar]
MacQueen, J.; Plaut, D.; Blanchard, R. A Simplified Colorimetric Method for Serum Isocitrate Dehydrogenase. Am. J. Med. Technol. 1972, 38, 377–380. [Google Scholar]
Zhao, S.; Xiao, Y.; Ning, Y.; Zhou, Y.; Zhang, D. An Optimized K-Means Clustering for Improving Accuracy in Traffic Classification. Wireless Pers. Commun. 2021, 120, 81–93. [Google Scholar] [CrossRef]
Kumar, K.-M.; Reddy, A.-R.-M. An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf. Sci. 2017, 418, 286–301. [Google Scholar] [CrossRef]
Guo, X.-S.; Zhong, J. Optimisation of K-means Algorithm Based on Sample Density Canopy. Int. J. Ad Hoc Ubiquitous Comput. 2021, 38, 62–69. [Google Scholar]
Liu, Y.; Li, B. Bayesian Hierarchical K-means Clustering. Intell. Data Anal. 2020, 24, 977–992. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, Y.; Guo, X.; Wu, J.; He, Q.; Liu, X.; Yang, Y. Self-Adaptive K-Means Based on a Covering Algorithm. Complexity 2018, 2018, 7698274. [Google Scholar] [CrossRef]
Zhang, Z.; Feng, Q.; Huang, J.; Guo, Y.; Xu, J.; Wang, J. A Local Search Algorithm for K-Means with Outliers. Neurocomputing 2021, 450, 230–241. [Google Scholar] [CrossRef]
Chen, C.; Wang, Y.; Hu, W.; Zheng, Z. Robust Multi-View K-Means Clustering with Outlier Removal. Knowl.-Based Syst. 2020, 210, 106518. [Google Scholar] [CrossRef]
Yu, S.-S.; Chu, S.-W.; Wang, C.-M.; Chan, Y.-K.; Chang, T.-C. Two Improved K-Means Algorithms. Appl. Soft Comput. 2018, 68, 747–755. [Google Scholar] [CrossRef]
McCallum, A.; Nigam, K.; Ungar, L.-H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–23 August 2000. [Google Scholar]
Zhang, G.; Zhang, C.; Zhang, H. Improved K-means Algorithm Based on Density Canopy. Knowl.-Based Syst. 2018, 145, 289–297. [Google Scholar] [CrossRef]
Xia, D.; Ning, F.; He, W. Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform. J. Grid Comput. 2020, 18, 263–273. [Google Scholar] [CrossRef]
Dorigo, M.; Maniezzo, V.; Colorni, A. Ant system: Optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. Part B Cybern. 1996, 26, 29–41. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Maulik, U.; Bandyopadhyay, S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1650–1654. [Google Scholar] [CrossRef] [Green Version]
Ouallane, A.A.; Bakali, A.; Bahnasse, A.; Broumi, S.; Talea, M. Fusion of engineering insights and emerging trends: Intelligent urban traffic management system. Inf. Fusion 2022, 88, 218–248. [Google Scholar] [CrossRef]

Figure 1. Technical roadmap of ETC passenger car user travel characteristics identification combined model on expressway.

Figure 2. Statistics of the number of vehicles corresponding to the total number of days of monthly travel. (a) May; (b) June.

Figure 3. Statistics of the number of vehicles corresponding to the number of days of travel on workdays. (a) May; (b) June.

Figure 4. Statistics of the number of vehicles corresponding to the number of travel days on weekends and holidays.

Figure 5. Statistics of the number of vehicles corresponding to the number of monthly trips: (a) May; (b) June.

Figure 6. Lorenz curve of the number of travel vehicles and toll records.

Figure 7. Traffic flow chart for different travel periods on workdays.

Figure 8. Traffic flow chart for different travel periods on weekends and holidays.

Figure 9. Statistics of the number of vehicles corresponding to travel duration. (a) Workdays; (b) weekends and holidays.

Figure 10. Statistics of the number of vehicles corresponding to travel distance. (a) Workdays; (b) weekends and holidays.

Figure 11. Statistics of the number of vehicles corresponding to trajectory repetition rate.

Figure 12. Schematic diagram of the principle of the Canopy algorithm.

Figure 13. Flow chart of Canopy-k-means algorithm.

Figure 14. Flow chart of Canopy-K-means clustering algorithm based on ant colony algorithm optimization.

Figure 15. BP neural network model operation diagram.

Figure 16. Design framework of BP neural network recognition model.

Figure 17. Statistics of effective traffic data.

Figure 18. Clustering results of the Canopy algorithm.

Figure 19. Clustering results of the Canopy-K-means algorithm.

Figure 20. Clustering results after optimization with ant colony algorithm.

Figure 21. Radar map of travel characteristics of expressway ETC passenger car users.

Figure 22. Architecture of identification model for travel characteristics group of expressways ETC passenger car users.

Figure 23. Identification results chart of different travel groups. (a) “Visiting and traveling” group, (b) “long-distance” group, (c) “official business” group, (d) “public and commercial affairs” group, (e) “commuting” group, and (f) “sporadic” group.

Figure 24. Testing of clustering effectiveness indicators under different cluster numbers using Canopy algorithm. (a) SSE (Sum of Squared Error), (b) CH (Calinski–Harabasz Index), and (c) DB (Davies–Bouldin Index).

Figure 25. Comparison chart of iterative effects. The dotted line is the CH value when Canopy-K-means converges.

Figure 26. Comparison of clustering CH index before and after optimization.

Figure 27. Evaluation of model recognition performance.

Table 1. Correlation analysis of characteristic indicators in different periods.

Indicator	Time Period		Pearson Correlation Coefficient
Monthly total travel days	May	June	0.9939
Travel days on workdays	May	June	0.9993
Monthly travel trips	May	June	0.9928
Travel periods	Weekends	Holidays	0.9683

Table 2. Summary of characteristic indicators in different periods.

Time Period	Travel Duration (h)	Travel Distance (km)
Time Period	Mean	Mean
Workdays	2.25	100.03
Non-workdays	4.33	118.71

Table 3. Correlation analysis results of travel characteristic indicators.

	X1	X2	X3	X4	X5	X6
X1	1.000	0.909 **	−0.157	0.707 **	0.018	−0.160
X2	0.909 **	1.000	−0.108	0.604 **	0.088	−0.121
X3	−0.157	−0.108	1.000	−0.098	−0.040	−0.006
X4	0.707 **	0.604 **	−0.098	1.000	0.133	0.061
X5	0.018	0.088	−0.040	0.133	1.000	−0.041
X6	−0.160	−0.121	−0.006	0.061	−0.041	1.000
X1		0.000	0.021	0.000	0.049	0.008
X2	0.000		0.018	0.000	0.022	0.010
X3	0.021	0.018		0.021	0.029	0.018
X4	0.000	0.000	0.021		0.009	0.024
X5	0.049	0.022	0.029	0.009		0.032
X6	0.008	0.010	0.018	0.024	0.032

**. Significant correlation at a confidence level of 0.01 (one-tailed).

Table 4. Sampling results of initial distance threshold

T_{1}

and

T_{2}

.

Table 4. Sampling results of initial distance threshold

T_{1}

and

T_{2}

.

Sampling Times	T₁	Mean	T₂	Mean
1	4.3	4.22	2.9	3.11
2	4.5		3.4
3	4.6		3.5
4	3.9		2.8
5	4.1		2.9
6	4.1		3.0
7	4.2		3.3
8	4.3		3.1
9	4.2		3.2
10	4.0		3.0

Table 5. Clustering results of the Canopy algorithm.

Cluster Categories	Centroid of Canopy Subset
Cluster Categories	Monthly Travel Frequency	Average Travel Distance Per Trip	Travel Preference during Peak Hours	Travel Preference on Weekends and Holidays
0	−0.3745	−0.0697	−0.8747	−0.6868
1	−0.5675	4.9653	−0.0172	−0.2047
2	3.3059	−0.0677	0.9569	−0.7106
3	−0.7481	−0.2738	0.0210	−0.3106
4	−0.4742	0.0310	−0.0576	1.1288
5	2.0318	−0.3831	0.5309	−0.7557

Table 6. Clustering results of the Canopy-K-means algorithm.

Cluster Categories	Monthly Travel Frequency	Average Travel Distance Per Trip	Travel Preference during Peak Hours	Travel Preference on Weekends and Holidays
0	−0.1289	−0.1876	0.0285	−0.3056
1	−0.6874	3.6852	−0.0578	−0.1678
2	−0.8547	0.3256	−0.1103	1.2587
3	1.4587	−0.0985	0.1035	−0.7058
4	−0.9875	−0.5089	−0.0238	−0.5269
5	3.0167	−0.6712	1.7265	−0.9537
Number of iterations	368

Table 7. Comparison of clustering effects with different iterations.

Number of Iterations	Cluster Number	CH Value
100	6	8,185,487.3264
200	6	8,398,752.5481
300	6	8,433,489.2158
400	6	8,296,325.589
500	6	8,133,256.6584

Table 8. Parameter setting of ant colony algorithm.

Parameter	Value
Volatile factor ( $ρ$ )	0.1
Threshold 1 ( $q_{0}$ )	0.9
Constant ( $L$ )	50
Threshold 2 ( $P_{s}$ )	0.9
Cluster number ( $K$ )	6
Ant quantity ( $A$ )	200
Maximum iteration times ( $t_\max$ )	300

Table 9. Optimization results of ant colony algorithm.

Cluster Categories	Sample Size	Monthly Travel Frequency	Average Travel Distance Per Trip	Travel Preference during Peak Hours	Travel Preference on Weekends and Holidays
0	441,360	−0.5338	0.1866	−0.0887	1.1134
1	87,030	−0.4807	3.5016	−0.0397	−0.1508
2	520,395	−0.0313	−0.1295	0.0043	−0.2916
3	238,333	1.1706	−0.0022	0.0734	−0.5551
4	35,136	2.7889	−0.4816	1.6572	−0.9112
5	320,666	−0.6048	−0.0714	−0.0652	−0.7548

Table 10. Clustering centers of different feature groups.

Cluster Categories	Sample Size	Monthly Travel Frequency	Average Travel Distance Per Trip	Travel Preference during Peak Hours	Travel Preference on Weekends and Holidays
0	441,360	1.5222	123.6126	0.3588	0.9017
1	87,030	1.7623	531.9398	0.3713	0.4902
2	520,395	3.7935	84.6775	0.3825	0.4444
3	238,333	9.226	67.4591	0.4001	0.3586
4	35,136	16.5405	41.3079	0.8035	0.2427
5	320,666	1.2014	91.8284	0.3648	0.2936

Table 11. Neural network model parameter setting.

Parameter	Value
Number of layers in neural network	5
Number of neurons in hidden layer	7
Expected error	0.05
Learning rate	0.01
Momentum factor	0.9
Activation function	Sigmoid

Table 12. Clustering results of traditional K-means algorithm.

Cluster Categories	Monthly Travel Frequency	Average Travel Distance Per Trip	Travel Preference during Peak Hours	Travel Preference on Weekends and Holidays
0	2.5523	−0.3595	1.0325	−0.8512
1	−0.1864	−0.1146	0.2641	−0.3887
2	−0.5650	4.9311	−0.0402	−0.2561
3	−0.5004	−0.0963	0.0184	−0.6978
4	1.0772	−0.0656	−0.2013	−0.5878
5	−0.4516	−0.0471	−0.2379	1.0767
Number of iterations	When the maximum number of iterations was set to 500, the model did not converge.

Table 13. Comparison of clustering CH index before and after optimization.

	Traditional K-Means	Canopy-K-Means	Ant Colony Optimization-Based Canopy-K-Means
Cluster number	6	6	6
CH value	6,846,927.0771	7,769,583.5691	8,433,489.2158

Table 14. Confusion matrix of model recognition results.

	Visiting and Traveling	Long-Distance	Official Business	Public and Commercial Affairs	Commuting	Sporadic
Visiting and traveling	83,631	414	665	8	6	3548
Long-distance	184	16,978	21	12	3	208
Official business	832	32	99,459	3227	31	498
Public and commercial affairs	67	3	2026	45,181	326	64
Commuting	5	2	6	52	6959	3
Sporadic	2409	553	433	28	5	60,705

Table 15. Evaluation of model recognition performance.

Population	Precision	Recall	F1-Score	Accuracy
Visiting and traveling	95.99%	94.74%	95.36%	95.23%
Long-distance	94.42%	97.54%	95.95%
Official business	96.93%	95.56%	96.24%
Public and commercial affairs	93.14%	94.78%	93.96%
Commuting	94.94%	99.03%	96.94%
Sporadic	93.35%	94.65%	94.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, X.; Zhang, Y.; Zhang, X.; Peng, B. Travel Characteristics Identification Method for Expressway Passenger Cars Based on Electronic Toll Collection Data. Sustainability 2023, 15, 11619. https://0-doi-org.brum.beds.ac.uk/10.3390/su151511619

AMA Style

Cai X, Zhang Y, Zhang X, Peng B. Travel Characteristics Identification Method for Expressway Passenger Cars Based on Electronic Toll Collection Data. Sustainability. 2023; 15(15):11619. https://0-doi-org.brum.beds.ac.uk/10.3390/su151511619

Chicago/Turabian Style

Cai, Xiaoyu, Yihan Zhang, Xin Zhang, and Bo Peng. 2023. "Travel Characteristics Identification Method for Expressway Passenger Cars Based on Electronic Toll Collection Data" Sustainability 15, no. 15: 11619. https://0-doi-org.brum.beds.ac.uk/10.3390/su151511619

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Travel Characteristics Identification Method for Expressway Passenger Cars Based on Electronic Toll Collection Data

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Overview of The Travel Feature Identification Combination Model

3.2. Analysis of Temporal and Spatial Characteristics of Passenger Car Travel on Expressways

3.2.1. Data Preparation

3.2.2. Analysis of Travel Time Characteristics

3.2.3. Analysis of Travel Space Characteristics

3.2.4. Travel Characteristic Index Extraction

3.3. Development of a User Classification Model using an Improved Clustering Algorithm

3.3.1. Canopy-K-Means Clustering Algorithm Construction

3.3.2. Construction of the Canopy-K-Means Clustering Algorithm Based on Ant Colony Optimization

3.4. Development of a BP Neural Network Algorithm for the Identification of Travel Characteristic Groups

4. Empirical Analysis

4.1. Model Validation

4.1.1. Model Validation for Classifying Expressway ETC Passenger Car Users

4.1.2. Model Validation for Identifying Travel Characteristics Groups

4.2. Effect Test

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI