1. Introduction
The problem of reconstructing fine-grained spatial data from its coarse-grained aggregate observations of each subregions lies in the core of many real world applications. For example, the reconstruction of fine-grained spatial distribution of cell phone activities is of particular interest to telecommunication and information technology companies, where the recovered data can be used for device installation, capacity planning, the study of urban ecology [
1,
2,
3], population density estimation [
4,
5,
6], and human mobility prediction [
7,
8,
9,
10,
11]. However, the companies may only have access to the aggregate mobile traffic volumes on each base station, as either privacy issues or additional technical overhead is involved to get fine-grained spatial data of users. Similarly, it is also highly valuable if we can infer the spatial distribution of population (e.g., the population vote for a certain party) densities based on the total population recorded at polling stations that sparsely scattered at different subregions. Internet media providers or retailers, such as Google, Tencent, Amazon, Facebook, etc., may want to recover a fine-grained geographical distribution of their users based on the aggregated user counts observed at different points of presence (PoPs) or data centers. Note that, in all the above-mentioned cases, it is impossible or not allowed to track the position of each individual due to either privacy concerns or technical overhead. Therefore, reconstructing the spatial data from coarse aggregation will be highly useful in such cases.
In this paper, we study such spatial sparse recovery problem, that is, to infer the fine-grained distribution of certain spatial data in a region given the aggregate observations recorded for each of its subregions. However, it is an extremely challenging problem and has seldom been studied. A straightforward idea is assuming the density is uniformly distributed within each subregion. Based on the based on the obtained aggregate observation, we can calculate a patched piece-wise constant estimation for each subregion. However, the densities estimated by this method will jump between neighboring subregions and disregard the local continuity or similarity of the studied spatial distribution across subregion boundaries. In addition, the piece-wise constant spatial field given by this approach provides little value for applications such as hot spot discovery. Many spatial data presents local continuity, e.g., Internet activity or cell phone activity. This is because the data often highly depend on underlying factors which are usually smoothly changing, like area functionality, urban geographical features, population density and so on. To exploit the smoothness, we may utilize spatial smoothing techniques such as Thin Plate Splines [
12], Soap film smoothing [
13], Spline smoothing [
14], Bivariate Spline Regression [
15], or Spatial Spline Regression [
16] developed in statistics to smoothen the patched estimation. However, nearly all existing spatial smoothing techniques [
12,
13,
14,
15,
16] are designed to recover a spatial field of densities according to sampled observations, e.g., reconstruct a spatial field of temperatures based on the temperature records at some sample points. In contrast, our problem needs to recovery a spatial field based on coarse-grained aggregate observations. Therefore, existing spatial smoothing techniques are not directly applicable to our new problem. Without modification, these smoothing techniques will violate the necessary constraint that the estimated spatial data in each subregion must sum up to its corresponding aggregate observation in the first place, leading to systematic errors.
To overcome the difficulties mentioned above, in this paper, we propose a new technique named Constrained Spatial Smoothing (CSS) for the problem of spatial data reconstruction. Specifically, given a region, we aim to reconstruct a spatial field of densities over that region based on observed aggregate values in patched subregions. Our approach penalizes the “roughness” of the reconstructed spatial field subject to the constraint that the aggregation of discretized values of the spatial field in each patched subregion equals the aggregate observation made in that subregion. It is distinct from previous spatial smoothing techniques due to the additional constraint in our problem. We propose an Alternating Direction Method of Multipliers (ADMM) [
17,
18] algorithm to decouple the problem into the alternated minimizations of a quadratic program (QP) [
19] subproblem and a spatial smoothing subproblem, where we use the QP to iteratively enforce the observation constraints, while solving the spatial smoothing subproblem with a recently proposed finite element technique called Spatial Spline Regression (SSR) [
16]. In addition, our approach not only leverages the intrinsic smoothness from local continuity to reconstruct a spatial field, but is also able to incorporate additional external information, such as the number of schools, number of bus stops, population, etc., in the underlying geographical region as a regression add-on component to further enhance recovery performance. Last but not least, our algorithm can be applied to a variety of sparse recovery problem where intrinsic smoothness exists.
Another important contribution of the paper is that we conduct extensive evaluation to compare our proposed algorithms with a variety of baseline methods. In our evaluation, we are trying to reconstruct the mobile phone activity distributions in Milan, Italy from base station observations. The Telecom Italia Big Data Challenge dataset is a multi-source dataset that contains a variety of informations, including aggregation of telecommunication activities, news, social networks, weather, and electricity data from the city of Milan. With the important information about human activities contained in the dataset, especially the cellphone activity records, researchers utilized the data to study different problems, such as modeling human mobility patterns [
20,
21,
22], population density estimation [
4,
5], models the spread of diseases [
23,
24], modeling city structure [
3] and city ecology [
2], etc. Specifically, our evaluation is based on the Milan Call Detail Records (CDR) dataset, a part of the Telecom Italia Big Data Challenge dataset [
25] which contains the phone call and Short Message Service (SMS) activity records of two months in each grid square of 235 m × 235 m in the city of Milan, Italy.
Given the Milan Call Detail Records (CDR) dataset, we consider a region that consists of 2726 grid squares in an irregularly bounded region in the city of Milan. To stress-test the algorithm performance, we assume we only know the aggregate phone activities observed on 100 or 200 base stations and aim to recover the entire spatial field of phone activities. We also use another geographical attribute dataset available from the Municipality of Milan’s Open Data website [
1] as the additional external attribute data to improve performance. Extensive evaluation shows that our proposed approach achieves significant improvement, compared to various state-of-the-art baseline methods, including the spatial spline regression (SSR) [
16] approach. Our technique can recover the fine-grained cell phone activity distribution of 2726 data points only from 200 data points of base stations, with a mean absolute percentage error of 0.309, representing a 26.3% improvement from the SSR baseline scheme.
The remainder of this paper is organized as follows. In
Section 2, we formulate the problem of spatial field reconstruction from coarse aggregate observations. In
Section 3, we describe existing solutions, including a state-of-the-art Spatial Spline Regression (SSR) technique for spatial smoothing. In
Section 4, we propose our Constrained Spatial Smoothing method which respects both the local continuity in the spatial field and the aggregation constraints at the same time. In
Section 5, we conduct extensive evaluation in comparison with various other methods through a solid and extensive case study of cell phone activity density estimation in the city of Milan. We discuss related literature in
Section 6 and conclude the paper in
Section 7.
2. Problem Formulation
In this section, we formally introduce the problem of spatial field reconstruction from coarse aggregations observed at sparse scattered points in that field. Our problem can be formulated as a new type of sparse recovery problems. To ease the presentation, we may use cell phone activity recovery as an example.
Let denote an irregularly bounded domain, which is the entire region of interest in our problem. Usually, it excludes the uninhabited areas such as hills, ocean coasts, rivers, and so on. Suppose is a real-valued function that represents certain spatial densities field (e.g., cell phone activities), where denotes different geographical positions in . Let denote m observation points (e.g., base stations) that scattered in . Each point is located in a position and in charge of a subregion . In our problem, we are given the aggregated volume in that is in charge of. Our goal is to reconstruct the spatial field based on the observed aggregated volumes .
To give an instance, consider the problem of recover cell phone activity distribution. In this case, each user will connect to a base station (cell tower) that is closest to his/her cell phone. Therefore, we can observe the aggregated volume for each base station
where
denotes the subregion that
is in charge of, and is given by
Given the aggregated activity volumes recorded on m base stations, our goal is to reconstruct the entire cell phone activity densities distribution f, which is a spatial field in the domain . We may call base station volumes in this case. However, reconstructing a continuous spatial field is almost computationally infeasible as a personal computer can not handle the continuous nature of .
In reality, we only need to recover f to a certain granularity required by the operator (e.g., 235 m × 235 m squares in the dataset provided by Telecom Italia Mobile). To fix notations, suppose is discretized into n small grid squares , where , are the center positions of each square j in . We can assume the area of each square is without loss of generality. In addition, the number of aggregate observations is much smaller than the total number of squares to be reconstructed, therefore we have .
After domain discretization, we can get the aggregate volume on each base station
by
where the subregion that
represents is given by
Therefore, our goal is to reconstruct the underlying spatial field
f, and especially the activity densities
in all
n grid squares if the desired granularity is on a per-square level, with only access to the aggregated observations
in Label (
1).
The problem defined above is broadly applicable to characterize a variety of applications other than the recovery of cell phone activity density distribution, e.g., inferring a fine-grained geographical user distribution for a certain app or website based on aggregated user counts collected at sparsely distributed Presence of Points (PoPs) or data centers, and recovering the voter distribution for a certain party based on aggregate voting statistics at different polling stations. The nonessential difference is that the definition of subregion , from which volume is aggregated, is different for each specific application.
Constrained Spatial Smoothing Problem
Denote
. Since all
are predetermined, e.g., from Label (
2) for the problem of cell phone activity distribution recovery, and
are known, reconstructing spatial field
from (
1) is essentially solving a linear system of equations for
, i.e.,
where the matrix
is given by
Since , i.e., the number of equations is far smaller than the number of the unknowns, reconstructing from is essentially a sparse recovery problem.
Directly solving the linear system of Equation (
1) is infeasible, as it is an underdetermined system which has an infinite number of solutions. However, the spatial property of
f can be utilized as constraints to make the sparse recovery problem feasible and has a unique solution. We observe that spatial data usually exhibit local continuity or correlation within domain
. For example, in the problem of cell phone activity density recovery, the activity density of a certain location highly depends on the population and activity at that place, e.g., the downtown has more population and cell phone activity than suburban residential areas. In addition, the underlying area functionality and the spatial distributions of human activity density are often slowly changing over the domain
rather than suddenly jumping between different subregions.
Therefore, we can formulate our constrained spatial sparse recovery problem as the following:
by taking into account the non-negative property and the local spatial continuity (smoothness) of
f.
is the Laplacian of
f, and is utilized to encourage local similarity and penalize the roughness of the spatial field
f. It is worth noting that once
f is reconstructed, we have not only recovered the densities
at the square centers
, but can also recover the density
of any point
, e.g., between the centers of two neighboring grid squares, although such a fine-grained recovery may not be needed in every application.
To further improve the recovery performance, we can utilize additional external demographic or social features at each location. In the problem of cell phone activity density reconstruction, cell phone activities are often correlated with the underlying population density and social functionalities (e.g., the percentage of green area, the number of schools, the number of businesses/restaurants, the number of sport facilities, and the number of bus stops, etc.) of the considered regions.
Specifically, suppose
represents the feature vector consisting of
q external feature values of square
j. When
is available as additional input, we can estimate the spatial density data in square
j by
where
is an underlying spatial field functional that preserves local spatial continuity, while
is a linear regression part based on the attributes of square
that allows position-specific variation or jumps.
In the presence of attributes, we can formulate the constrained spatial sparse recovery problem as
Once we get the spatial field
and
, we can reconstruct
for all the squares using (
5). For example, we can calculate the cell phone activity at a specific place by the summation of an underlying smooth spatial field
and a linear regression of location attributes, where the add-on regression helps to model the jump between two subregions if the two regions are quite different and have distinct functionalities or attributes.
4. An ADMM Algorithm for Constrained Spatial Smoothing
Our spatial sparse recovery problem (
4) is different from (
8) from two aspects: the additional constraints and the loss function. As a consequence, we can not directly apply the previous SSR method to solve it. A new approach is needed to handle our new loss function with constraints.
In this section, we propose to utilize the Alternating Direction Method of Multipliers (ADMM) [
28], to decompose our constrained optimization problem into two sub-problems that can be solved effectively by SSR and Quadratic Programming (QP), respectively. Algorithm 1 presents the proposed ADMM algorithm to learn our model parameters.
Algorithm 1: Constrained Spatial Smoothing by ADMM |
- Input:
The m observed volume of base stations , smoothing parameter , penalty parameter , initialize , . - Output:
Spatial field and parameters . Estimation values on n locations .
|
- 1:
fordo - 2:
Update by solving ( 18) using Quadratic Programming. - 3:
Update by solving ( 19) using Spatial Spline Regression. - 4:
Update according to ( 17). - 5:
end for
|
First, we introduce the following indicator function
,
With the indicator function, the original problem (
4) is equivalent to
where
is a hyper parameter that controls the smoothness of
f.
Second, we introduce an auxiliary variable
that is defined as
This variable is utilized to split the convex optimization problem into two sub-convex problems. With
, we can formulate the problem as the standard ADMM format,
The
augmented Lagrangian for (
13) is
where
is the dual variable, and
is the penalty parameter in ADMM. Then, the ADMM consists of the following iterations:
For the
-update step in each iteration, Label (
16) is equivalent to
We can solve this convex problem efficiently by Quadratic Programming (QP).
For the
-update step in each iteration, Equation (
15) is equivalent to
which is exactly the form of (
8) with
and
, thus can be solved efficiently by SSR. It should be noted that
is the penalty parameter which controls the smoothness of
f. If it is small, we put little emphasis on the smoothness, and the estimated surface
f will be over fitted. If it is too big, the surface will be too smooth, which can cause underfitting.
For the case with attributes, the algorithm does not require major changes. We just need to replace
by
in (
19), where
represents the attributes and
is the corresponding contributions.
Our proposed ADMM training algorithm is able to efficiently reconstruct the spatial field and fit the covariates for our constrained spatial sparse recovery problem. In -update step, it enforces the constraints by solving a constrained QP with no need to worry about smoothing; in a -update step, it approximates the obtained with a smooth f using the SSR-based smoothing technique. In this way, we decouple the handling of smoothing and constraints which was not possible in pure SSR previously.
6. Related Work
The Telecom Italia Big Data Challenge dataset is widely used to study different problems [
2,
3,
4,
5,
20,
21,
22,
23,
24]. However, little research work has been done to estimate the spatial distribution of cellphone activity itself, despite the great value of this problem.
There are various tasks where the key problem is estimating a spatial field over a region based on observations of sampled points, such as house price estimation and population density estimation. Chopra et al. [
26] model the underlying surface of land desirability using kernel-based interpolation. However, it is hard to choose the form of kernel functions and tune a large number of hyper-parameters. Spatial Spline Regression technique is applied to the problem of population density estimation in Sangalli et al. [
16]. However, in our problem, we only get the accumulated activity density in base stations, rather than real densities in each base station location. In addition, BS locations distribution is highly sparse in our case.
Although a range of kernel-based methods [
26,
31,
32] can be applied to fit a spatial field, the common drawback of these approaches is that, by using uniformly damping weights in distance-based kernels, they tend to link weakly related data points across areas in a non-convex domain. Spatial spline regression [
16] on the other hand uses finite-element analysis approach to jointly solve for
f and
from the model described by Equation (
8) over any irregularly shaped domain
.
As it was discussed earlier, the fine-grained data for the distribution of the volume of calls and SMS are not usually available. A common type of data is the data collected by cell phone base stations. Sometimes, cell phone providers interpolate the data collected by the base stations as is discussed in Manfredini et al. [
33]. Some researchers interpolate the data to obtain fine grained distributions as in Ratti et al. [
29]. However, in Ratti et al. [
29], authors do not evaluate the performance of the interpolated distribution. To the best of our knowledge, there is no extensive work done in trying to obtain optimal reconstructions of fine grained cell phone data distribution. We are the first to apply the latest spatial functional analysis techniques to cellphone activity distribution modeling, assuming the activity densities consist of a regression part based on social or demographical statistic features and a spatial field that captures the underlying smoothness property of cellphone activities. In particular, we leverage the idea of spatial spline regression to handle any irregularly shaped geographic regions. We have developed a novel Constrained Spatial Smoothing approach and corresponding training algorithm to recover spatial distribution of cellphone activities from highly sparse observations.
7. Conclusions
In this paper, we study the problem of inferring the fine-grained spatial distribution of certain density data in a region based on the aggregate observations recorded for each of its subregions, which is extremely challenging and seldom visited before, and analyze the challenges of it. We propose the Constrained Spatial Smoothing (CSS) approach that exploits both the intrinsic smooth property of underlying factors and the additional features from external social or domestic statics. We further propose a training algorithm which combines the Spatial Spline Regression (SSR) technique and ADMM technique to learn our model parameters efficiently.
To evaluate our algorithm and compare it with various other approaches, we run extensive evaluations based on the Milan Call Detail Records dataset provided by Telecom Italia Mobile. The simulation results on the dataset show that our algorithm significantly outperforms other baseline approaches by a great percentage. (Note that cross validation and statistical testing are techniques that are usually applied in experiments. However, both techniques require sampling effectively from the sparse spatial data while keeping the intrinsic spatial structure, which is difficult in our problem.) This shows that jointly modeling the underlying spatial continuity and the local features that characterize the heterogeneity of different locations can effectively improve the performance of spatial recovery.
Although we use the data on cell phone activities to illustrate our methodology, our algorithm is not limited to solving the problem of inferring the distribution of cell phone activities, but is also applicable to a variety of problems where estimating an implicit or explicit smooth surface is required, such as inferring the spatial distribution of population densities based on the aggregate population observed at sparsely scattered polling stations, reconstructing a fine-grained geographical distribution of users for an Internet media provider or retailer only from aggregated user counts observed at certain datacenters or points of presence (PoPs), and so on.