1. Introduction
In recent years, motivated by the incorporation of the ubiquitous environmental information sensing ability of WSNs as well as the powerful data storage and processing capabilities of cloud computing, sensor-cloud technology has been receiving growing attention in various application domains [
1], such as target tracking [
2], environment monitoring [
3], smart city [
4], public safety system [
5], and precision irrigation [
6]. It conveys a concept of providing physical sensors as a service (Se-aaS) [
7]. In the sensor-cloud framework, various physical sensors and the corresponding cloud services are provided by sensor owners and cloud service providers, respectively. The end-users can get the information from these infrastructures without actually owning them. They only need to send their service requests to the sensor-cloud platform and pay for usage. Considering that the end-users are usually low-income farmers in agriculture fields, this pay-per-use model may alleviate the burden of farmers from the high cost and onerous farmland tasks [
8]. Furthermore, the sensor-cloud framework enables multiple users to share the same infrastructure by using virtualization technology. Thus, the profits of the service providers can also be guaranteed.
The sensor-cloud framework is viewed as a paradigm shift from traditional WSNs. It decouples the physical sensor (data producer) from the data provider, which is conceptualized as a virtual sensor working in the virtual machine [
9]. In some early works, researchers have focused on the design of system architecture. For instance, the authors in [
9] described the design and the operation mode of the sensor-cloud framework. In [
10], the authors gave a basic definition and interrelation of each component in sensor-cloud mathematically. Similarly, the authors in [
11] introduced the sensor-cloud infrastructure based on IoT-cloud. They divided the whole system framework into three layers: the client layer, the middleware layer and the physical layer. The client layer provides various interfaces to users. Thus, the users can access the website and request the services provided by the sensor-cloud platform. When the user’s requests arrive, the middleware layer will allocate suitable physical sensors to create a virtual sensor in response to the user’s requests.
The existing studies on virtual sensor provisioning have mainly focused on data aggregation or spatial correlation. In [
12], all the physical sensors were activated to collect environmental data. Then, the collected sensing data were processed in the cloud. However, this method results in increased energy consumption. Hence, the underlying network’s lifetime will be reduced. Thereby, the sensor owners must redeploy their invalid physical sensors to maintain the services. As a result, the cost for users also increases. The authors in [
13] gave the same inference. Thereafter, the authors in [
14] proposed a cluster-based virtual sensor provisioning method. In this paper, the physical sensors with spatial and measurement similarities were clustered into one cluster, then, the ant algorithm was exploited to select physical sensors to provision virtual sensors. In this way, numerous physical sensors were switched to the dormant state, avoiding more extra energy consumption. However, this method only considers the feedback of the current sensing data. The historical data stored in the cloud server were fully not considered.
According to the existing studies, machine learning methods are widely used for data analysis in various industries. For instance, vehicular social networks [
15], cyber security [
16], and smart agriculture [
17]. Leveraging the powerful storage and computing capability of cloud computing, the massive sensing data can be stored and processed efficiently. As we all known, the data collected from the WSNs have a certain spatial-temporal similarity [
18]. Motivated by this, when implementing the virtual sensor provisioning task, the physical sensors with high similarity are clustered into one cluster. Then, several representative physical sensors were chosen from each cluster to provide data collection services. The residual physical sensors are switched to the dormant state, which reduces the overall energy consumption and prolongs the network lifetime. Considering that, the physical sensors with a close distance may have different measurements. For instance, if there are some trees in the cropland, the temperature or light around the trees may be lower than in other places. Thus, in this paper, we focused on the similarity of sensing data and exploit machine learning methods to optimize the selection of physical sensors.
Classification and clustering are commonly used data analysis methods in machine learning. For a given sample set, also named a dataset, the goal of a clustering algorithm is to segment the whole dataset into several clusters (i.e., subsets). Thus, samples within the same clusters are more similar to each other than those in different clusters. In this paper, we exploited the k-means clustering algorithm to cluster physical sensors according to their sensing data. Moreover, considering that, the physical sensors with the similar measurement values may have different changing trends. For instance, there are two temperature sensors ( and ), both of which measurement values are 25 C. However, the temperature value of is rising, and the value of is falling. In this condition, we think that and should be divided into two different clusters. Thus, before implementing the clustering algorithm, the linear regression method is applied to analyze each physical sensor’s changing trend according to the historical data. The main contributions of this article are outlined as follows:
- (1)
We propose an energy-efficient virtual sensor provisioning scheme based on the similarity of sensing data. Differently from the spatial similarity based scheme, the physical sensors in our scheme with the highest correlation of measurement values can be divided into one cluster, even though they are far from each other in geographical areas.
- (2)
To ensure the accuracy of results, we first use the linear regression model to classify all the physical sensors into several classes according to the historical data, then exploit the k-means clustering algorithm to optimize the selection. As a result, we use fewer physical sensors to provide a higher quality of service.
- (3)
In addition to paying attention to the changing trends of physical sensors, we also consider the number of sensory parameters. The sensors chose in our schemes can sense two kind parameters that differ from the scheme using single parameter sensors.
The rest of this article is organized as follows:
Section 2 introduces some related works and a brief discussion.
Section 3 describes the system model and the problem definition of virtual sensor provisioning. In
Section 4, we detail the proposed machine learning based virtual sensor provisioning scheme in detail. Then,
Section 5 evaluates the performance of our proposed scheme and analyzes the simulation results. Finally, the conclusions and future works are drawn in
Section 6.
2. Related Works
Virtual sensor provisioning is an important task in sensor-cloud, which is similar to the sensor allocation in traditional WSNs. In contrast, exploiting the virtualization technology, the physical sensors in the sensor-cloud framework can be shared with multiple users via the virtual sensors. The main goal of virtual sensor provisioning is to reduce energy consumption and redundancy data. With the same goal in traditional WSNs, the low energy adaptive clustering hierarchy (LEACH) protocol is applied widely [
19,
20]. In this method, the cluster head nodes are selected randomly with probability
p, and the data packages from cluster members are sent to the sink node via the cluster head nodes. The cluster head nodes are selected periodically so that the energy load of the entire network can be allocated evenly to each sensor node. However, this method does not consider the multiple user’s request specifically, so it is not suitable in the sensor-cloud environment.
The authors in [
21] designed an agricultural sensor-cloud framework to provide multiple services for farmers. In order to ensure that the event data packets can be forwarded to the sink fast and reliably, the authors focused on the routing protocol of the physical layer and presented a priority-based data transmission technique. However, they tended to activate all the deployed physical sensors for virtual sensor provisioning. Although this method can utilize the powerful ability of cloud computing to process the massive data packages efficiently, the lifetime of WSNs will reduce with the increase of energy consumption. Similarly, the authors in [
22] proposed a software-defined network (SDN) based load balancing and low response delay edge-cloud network framework, which reduces the redundant data and service response time.
In [
12], the authors divided the whole geographical area into several regions. The data from the same region were sent to the middleware layer and processed with a hierarchical data aggregation method. The authors introduced four different virtual sensor configurations, such as one-to-many, many-to-one, many-to-many, and derived, which are the basic forms of physical sensor virtualization. However, the network-level virtualization [
23] has not fully utilized the advantages of sensor-cloud infrastructure. Furthermore, the physical sensors from the same region do not guarantee the data correlation to perform the aggregation. As mentioned above, if there are some trees in the cropland, the temperature and humidity in the woods may be quite different from the outside.
Thereafter, some researchers take into account the spatial-temporal correlation when creating virtual sensors. They consider the dynamic correlation of the sensing data and propose an active node selection algorithm to reduce energy consumption [
24]. The values of dormant sensors can be predicted by the activated sensors with high spatial-temporal correlation. An integration model based on historical data prediction is proposed in [
25]. The authors exploited the spatial-temporal correlation to predict and controlled the accuracy of the sensing data, which provides a trade-off between the quality of service and energy efficiency.
The authors in [
26] mainly focused on the similarity of sensing data. When the user initiates a service request, all the physical sensors are activated, then the collected data are sent to the middleware layer via the sink nodes for further processing. The middleware clusters the physical sensors based on the similarity of current sensing data. The generation of clusters is regulated by the predefined mean squared error. In this way, the physical sensors in the same cluster may be distributed in different areas. However, this scheme does not fully consider the changing trends of the environment (e.g., example of
and
in
Section 2). In this case, direct data aggregation may lead to inaccurate results. Thus, before clustering the physical sensors, it is necessary to perform a classification process in advance.
As for the node selection algorithm, the authors in [
27] proposed an adaptive clustering algorithm in a multihop mobile network. This algorithm partitions the whole network into several disjoint clusters according to each node’s 1-hop neighbors. Similarly, in [
28], a coalition-head selection algorithm is present to support the trapped users to form a coalition based on users’ transmission range. These two node selection algorithms are similar to the allocation scheme based on spatial correlation as mentioned above. Thus, it is difficult to distinguish the physical sensors with different changing trends. Furthermore, these algorithms are focused on mobile networks. The network topology is unstable. In our scheme, we mainly focused on the static network.
From the above discussion, the existing works about virtual sensor provisioning are mainly focused on activating all physical sensors, network-level virtualization, spatial correlation, or data analysis for current sensing data. These methods may result in more energy consumption and redundant data. In addition, the role of historical data has not been fully considered.
4. Proposed Virtual Sensors Provisioning Scheme
In our scheme, we exploited the k-means algorithm to implement the virtual sensor provisioning. Moreover, considering that the physical sensors with similar values may have different changing trends, the linear regression model was applied to distinguish the changing trends.
4.1. Model
Assume that there are
n physical sensors deployed in the monitoring area. Their sensing data compose a sample set
, which can be denoted by an eigenvector with
m dimensions, such as time (
), node identifier (
), and sensing parameter (
). The first step is to classify
n samples into
l groups (
G) according to the changing trends, in which:
Then, clustering each group
into
different clusters (
):
From the final results, n samples are clustered into k clusters. So, the relationship between n and k can be expressed as: , in which , . The model of k-means clustering is a many-to-one function from samples to clusters.
4.2. Strategy
The linear regression model is applied to classify these physical sensors. For a sample
, there is:
the object of linear regression is to get the optimal value of
and
b to achieve
. Here,
is the actual value. The value of
can be calculated by:
where
is the average of
x. We also define a flag (
) to denote the changing trend of each attribute (
j), it can be expressed as:
Thus, for each sample , the overall changing trend can be denoted by a flag set .
After the initial classification, we apply the
k-means clustering algorithm to select the optimal division (
) for each group by minimizing the Loss Function (
). In this paper, we adopt
Squared Euclidean Distance (SED) as the index to evaluate the distance
or similarity of samples.
where
denotes the
th dimension of sample.
is the sum of the distances between each sample and the center of the cluster to which it belongs, it denotes the similarity of the samples in the same cluster.
where
is the center of the
th cluster. Thus, the
k-means clustering can be viewed as an optimization problem with the object:
The value of
reaches a minimum when similar samples are gathered into the same cluster. The task of dividing
n samples into
k clusters is a combined optimization problem; thus, the number of all possible clustering results can be calculated by:
From the Equation (
15), we can know that the result of
is exponential because the optimal solution of the
k-means clustering problem is
NP-hard. The iterative method is commonly applied to solve this problem.
4.3. Algorithm
Algorithm 1 details the entire virtual sensor provisioning scheme, which can be divided into two steps: classification and clustering. Initially, there is only one physical sensor in each group (i.e.,
). Subsequently, the linear regression model is applied to calculate the value of
according to the historical data. The physical sensors with the same value of
are grouped into one class. Then, in clustering, for each class, the first step is to assign each sample to its nearest centroid (i.e.,
), and the second step is to create new centroids by taking the mean value of all the samples assigned to each previous centroid. The difference between the old and the new centroids is computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it iterates until the centroids do not move significantly.
Algorithm 1 Virtual sensor provisioning algorithm. |
Input: Sample set X- 1:
Step1:Classification - 2:
Calculate the value of for each attribute. - 3:
- 4:
fordo - 5:
if then - 6:
- 7:
end if - 8:
end for - 9:
Return - 10:
Step2:Clustering - 11:
- 12:
- 13:
- 14:
- 15:
fordo - 16:
, then - 17:
if - 18:
Update - 19:
end if - 20:
end for Output: |
4.4. Computational Complexity
For the given dataset . The computational complexity can be divided into two phase according to the algorithm’s execution steps. In classification phase, the time complexity is , and the space complexity is ; In clustering phase, the time complexity is , and the space complexity is . Here, t is the number of iterations. The proof is given in the following.
Proof 1. We use the least squares method to train the model. For a sample
. The linear regression model can be defined as:
in which
. For simplify, we set
. Then Equation (
16) can be expressed as:
the mean square error is applied as the cost function:
where
denotes the feature vector of the
ith sample, and
denotes the actual value of the
ith sample,
n is the number of samples. Thus,
, and
. So, the cost function can be expressed as:
calculate the derivative of
:
Set
, we can get:
the most computationally intensive part is
, the time complexity is
, and then inverting it, the time complexity is
. Thus, the overall time complexity is
. Furthermore, it only need to store the dataset
, so the space complexity is
. □
Proof 2. The implementation of k-means clustering algorithm for each class can be divided into four steps:
- (1)
Select k samples as the initial centroids, ;
- (2)
Calculate the distance between sample and . Then divide into the cluster corresponding to the centroid with the smallest distance;
- (3)
For each new cluster (), calculate the new centroid ;
- (4)
Repeat the above two steps until the centroids do not move significantly.
□
In sum, the time complexity in step 2 is , in which k is the number of clusters, and m is the dimension of a sample. However, there are n samples and t iterations. Thus, the overall time complexity is . In addition, it only need to store the samples and centroids, so the space complexity is .
6. Conclusions
Virtual sensor provisioning is one of the foremost tasks in sensor-cloud. The existing studies on the sensor cloud mainly consider the selection of all physical sensors, which results in a massive amount of energy consumption. Inspired by the machine learning methods, we propose a virtual sensor provisioning scheme to realize data similarity analysis. First, all physical sensors are divided into l classes by different changing trends. Then the k-means clustering algorithm is applied for each class. Finally, representative physical sensors are chosen to create a corresponding virtual sensors. In summary, we use fewer physical sensors to provide higher quality services. Meanwhile, our scheme reduces more energy consumption and prolongs the overall lifetime of the network. The experimental results show that our approach is efficient and suitable for virtual sensor provisioning tasks.
Nonetheless, there are still some issues that need further elaboration in future studies, such as virtual sensor provisioning under the heterogeneous environment, selection algorithm of the representative physical sensors.