In many fields, decision-making processes are increasingly based on intelligence gained from big data, complex datasets containing large amounts of data, from which new information can be extracted. Although the use of big data is relatively new in criminology, there are a lot of opportunities to increase our knowledge and improve data-based applications by leveraging big data [1
]. This is particularly true for intelligence-led policing, with its focus on data-based, proactive policing [3
]. Within the scope of intelligence-led policing, crime data analysis is used to objectively inform policy, policing strategies, and tactical operations in order to reduce and prevent crime [4
]. In that respect, the use of big data offers an opportunity to improve the analysis and prediction of spatiotemporal concentrations of crime.
It is empirically well-established within environmental criminology that crime patterns show significant spatiotemporal variability, with crime concentrations at specific times (i.e., burning times) and specific places (i.e., hotspots) [5
]. The areas and times under investigation differ in several ways, such as magnitude, population characteristics, and number of visitors (e.g., work-related or tourists). To take those differences into account, crime rates or indexes are frequently used within criminological research. A crime rate is “a statistic often used to represent the risk of criminal events [and that] help[s] to reveal clusters of crime in space and/or time based on an underlying population at risk” [8
] (p. 112). They allow for a more valid comparison of different spatiotemporal units (e.g., small city vs. metropolitan city), control for specific characteristics of the units of analysis, and reflect the population at risk to draw meaningful conclusions regarding spatiotemporal patterns of crime and its predictors. A frequently used denominator is residential population. It is a relatively easily obtainable variable via official instances (and often also via open data platforms) and has been shown to have a strong correlation with crime in general. For the same reasons, residential population is also a commonly used (control) variable in statistical models used to predict or explain spatiotemporal patterns in crime.
However, using residential population in crime analysis has one main problem: as it is a static measure, it does not take into account the spatiotemporal mobility of perpetrators, victims, and guardians [9
]. This is reflected in, for example, the effect of day and night cycles, holidays, weekends, and commuting hours [11
] (p. 346). Similarly, uninhabited areas with a lot of comings and goings (e.g., parks or business areas) can definitely generate or attract crime [12
]. As a consequence, the residential population is not always, specifically for crime types with mobile targets and/or perpetrators, a valid representation of the actual population or targets at risk for a given place and time.
A possible alternative to residential population which could better reflect this spatiotemporal mobility and therefore the actual population at risk, is ‘ambient population’. The ambient population is the number of people present in a given area at a given time [13
] and is typically estimated using big data such as mobile phone data. The first efforts to estimate the ambient population date from the mid-2000s, but, mainly due to the ubiquity of smartphones and social media, those efforts have increased recently, since circa 2014. Crowd and footfall dynamics have been related to crime and the findings show that these have a substantial impact on crime rates from the idea that “daily nonresidential activities distribute crime unevenly over space, beyond residential effects” [14
] (p. 1). Using ambient population instead of residential population could therefore result in a more valid measure of the population-at-risk and consequently improve applications depending on this measure, such as crime rates/indexes and statistical models of spatiotemporal crime patterns, such as those used in predictive policing.
In this study, we apply mobile phone data as a proxy of ambient population in two intelligence-led policing applications: crime rate analysis and crime risk prediction. In both cases, the performance of ambient population is compared with that of residential population. Our main research question (RQ) and sub-questions are as follows:
RQ: To what extent is there a stronger relationship between crime and a population measure when using ambient population compared to the residential population?
To what extent do crime rates differ when calculated based on the ambient population compared to the residential population?
From the two population-at-risk measures (ambient population and residential population), which one is a better predictor for the predictive analysis of crime events?
We hypothesize that, for crime types with a mobile target, the ambient population provides a more accurate denominator in calculating crime rates and improves spatiotemporal crime predictions.
3. Materials and Methods
3.1. Description of the Study Area and Spatial Units of Analysis
The study area for our crime and ambient population analysis was the city of Ghent. With 261,475 inhabitants in 2018, it is the second largest city in Belgium after Antwerp.
Two spatial units of analysis were used in this study: the statistical sector level and the grid level. Statistical sectors (N = 201 for Ghent) are generally the smallest meaningful units of analysis in Belgium for which demographic and socio-economic data are systematically collected and analyzed, and are comparable to census tracts in the United States and output areas in the United Kingdom. They are based on socio-economic, morphological, and land use characteristics [68
]. They are commonly used in social-ecological research and policy and practice applications (e.g., crime statistics reports). To compare the crime rates when calculated with the ambient population versus the residential population (sub-RQ1), we conducted our analysis at the statistical sector level.
Grids or street segments are most commonly used in the spatial modeling of crime. In practice, a major application of the prediction of crime events is its use by police departments to optimize patrols (the ‘predictive policing’ approach) and therefore they require small spatial units. Our intent was to test the potential of using the ambient population specifically for this purpose, relative to the residential population. Therefore, to compare ambient and residential population as predictors for the predictive analysis of crime (sub-RQ2), a raster grid with a resolution of 200 by 200 meters was used as the spatial level of analysis (N grids = 4206). Grid cells with this resolution have also previously been applied successfully in empirical criminological research (e.g., [30
3.2. Data Sources and Measurement of Key Constructs
The data used for this analysis stem from three sources: crime data collected by the Ghent Local Police Force, administrative data on the residential population collected by the City of Ghent, and data on the ambient population collected by the mobile phone operator Proximus.
The Ghent Local Police force provided crime data for three crime types from October to December 2018: aggressive theft, battery, and bicycle theft. Aggressive theft is defined as purse snatching or robbery using a weapon or threats, including attempts. Battery is defined as the intentional use of force or violence resulting in injuries, intra-familial violence is excluded. Bicycle theft is defined as simple theft of locked or unlocked bicycles in the public space, including attempts. We need to be aware that the willingness to report (from the side of citizens) and the willingness to register (from the side of the police) differs per crime type. Bicycle theft, for example, is among the most recorded crime types in Belgium and is actually the most recorded crime type in the region of Flanders [71
]. However, the results from the most recent Belgian Security Monitor indicate that bicycle theft is among the crime types with the lowest citizens’ willingness to report: in the 12 months prior to the data collection, 10% of Belgian households reported being a victim of bicycle theft, but only 48.1% reported this to the police [72
We chose these crime types to include both violent and property crime in our analysis. For each of these crime types, we received the following data for each event in the study period: the location at the address level and the exact time or time range during which the crime was assumed to have taken place based on the information given by the victim when registering the crime. If only a time range was available, the time of the crime event was assumed to have taken place at the midpoint of this range in the following analyses. After data cleaning and geocoding, our dataset contained 49 cases of aggressive theft, 293 cases of battery, and 571 cases of bicycle theft for further analysis. The crime data were geocoded by the researchers based on the official address reference database of Flanders (‘Centraal referentieadressenbestand’ or CRAB). If no full address was available (aggressive theft: 67.80% of the cases, battery: 35.06% of the cases, bicycle theft: 34.15% of the cases), a grid cell was randomly assigned from the grid cells overlapping the street. Crime events with no registered street were excluded (aggressive theft: 13.56% of the cases, battery: 1.56% of the cases, bicycle theft: 0.98% of the cases).
The crime data were then aggregated to crime counts for each crime type and for each month per statistical sector for the statistical sector analysis and to crime counts for each crime type and for each month per grid cell for the grid level analysis. Due to the high number of zero cells (more than 95%) and the low number of cells with more than one incident, the crime variable was additionally dichotomized for the grid level analysis (i.e., 0 = no incident for a given grid cell during a given period, 1 = one or more incidents happened in a given grid cell during a given period).
Data on the residential population was obtained via the City of Ghent. They provided counts of inhabitants for each 200 by 200 meter cell of the grid we provided, and their respective statistical sectors, based on the, at the time most recently (2018) available, data from the population register. Due to privacy reasons, the city masked grid cells with four or fewer (but not zero) inhabitants (6.28% of the grid cells). Those cells were shown to have four inhabitants (i.e., a count of four inhabitants means four or fewer inhabitants). Cells with zero inhabitants (48.19%) were not masked.
Finally, the ambient population was estimated using mobile phone data from Proximus as a proxy. Proximus is the largest mobile phone operator in Belgium, holding a market share of 39.10% [73
]. Specifically, the mobile phone data consist of counts of individual (smart)phones (unique users) connected to the Proximus network and present in Ghent. To produce counts at a small spatial level, Proximus used a hexagonal grid (Thiessen polygons) centered by their cell phone antennas (with a total of 288 cells for the area of Ghent). The size of the individual cells depends on the population density: the higher the population density, the smaller the cells, as there are more antennas in those areas (see Figure 1
The number of present phones was counted per hour for each cell of the grid during a period of three months for a total of 9,397,473 data points. Due to privacy reasons, Proximus excluded cells with thirty or fewer phones present (1.06% of the data points in the raw data; in the aggregated datasets used for our analysis, no cases had zero mobile population) and only allowed three months of data to be collected in total. The counts were then extrapolated to the total population proportional to the market share of Proximus in Belgium. Finally, the counts in the hexagonal grid were mapped to the rectangular 200 by 200 meter grid cells and assigned to their respective statistical sectors.
3.3. Data Analysis Methods
To investigate the relationship between crime and both the residential and ambient population in general, correlation coefficients were calculated for both the dataset with the statistical sector as the unit of analysis (which was then used for the crime rate analysis) and the dataset with the 200 by 200 meter grid cells as the unit of analysis. The correlation coefficients were calculated for each crime type and for each of the three months in the dataset to check for monthly variations. For the statistical sector dataset, the Pearson correlation coefficient was used, while for the grid dataset, the point bi-serial correlation coefficient was used due to the binary nature of the crime variable in that case. In addition, the difference between residential and ambient correlation with crime was tested using Zou’s confidence interval test of the difference between two correlations [74
To investigate whether using the ambient population as the denominator of crime rate would lead to different results than using the residential population as a denominator (sub-RQ1), crime rates were calculated for each statistical sector based on residential population and ambient population. If a statistical sector had zero population, this sector was excluded, as it would not be possible to calculate a residential population crime rate. Only one sector was excluded for this reason, and for this particular sector, no crimes for any of the three crime types were registered during the study period (October to December 2018).
To investigate whether the ambient population could be a better predictor of crime than the residential population (sub-RQ2), we used predictive analysis. Specifically, we used logistic regression to build two one-variable models: one with a residential population as the predictor of the probability that a new crime event would happen in each grid cell and one with an ambient population as the predictor for the probability that a new crime event would happen in each grid cell. The available data were split into a training and a test set. The data from October to November were used as the training dataset to train the predictive model in predicting crime events. The December data was used as the test set, to evaluate the prediction performance, i.e., the crime locations (grid cells) for the month December were predicted. To compare the residential and ambient population models, both models predicted the same fixed number of crime events depending on the average monthly number of crimes.
Prediction performance was evaluated using the following measures: recall, precision, F1-score, and AIC. Recall is the proportion of incidents predicted correctly versus the total number of incidents. Precision is the proportion of correctly predicted grid cells versus the total number of grid cells predicted at risk. Ideally, a good scoring model has both a high recall and precision. To reflect this, we also included the F1-score. The F1-score is the harmonic mean of recall and precision and therefore considers both in one measure. Finally, the Akaike Information Criterion (AIC) is a measure which allows comparison of different models on the same data, taking into account both goodness-of-fit and model simplicity. It estimates the relative amount of information lost by the model: the less information loss, the better the model. The model with the lowest AIC is therefore generally the better model. The more traditionally used Receiver-Operating Characteristic (ROC) curve and its associated Area Under the Curve (AUC) measure were not used here as they can be misleading when there is moderate to severe class imbalance [76
], as was the case here due to the relatively low crime frequency.
5. Discussion and Conclusions
There are many advantages in using ambient population instead of residential population for the analysis of crime rates. First, the measure of the ambient population is more dynamic than that of the residential population, allowing a more dynamic analysis of crime, taking into account, for example, monthly and seasonal variation. Second, the ambient population also better reflects the population that is actually present at a certain time and place, which is especially important in areas with a low population but high footfall, such as shopping districts. Third, using the ambient population avoids possible data loss in areas where there is crime but no population, which is especially a problem when using small spatial units of analysis. Finally, for the studied crime types, the ambient population is a better reflection of the population-at-risk. Mobile phone data act as an accurate proxy for ambient population, as in the present day, mobile phone use is widespread and (smart)phones tend to be carried wherever we go. In addition, mobile phone data also allow to measure ambient population in real-time, offering the opportunity to take even greater advantage of its dynamic nature. In our analysis, ambient population showed a stronger relationship with aggressive theft, battery, and bicycle theft than residential population. It should be noted, however, that the robustness of our results should be treated with caution, as we were only able to obtain three months of data for our study, which also meant we were only able to study a limited number of crime events and which limits conclusions with respect to possible seasonal effects. Nonetheless, our findings support those of previous research and can provide connection points to future research.
When looking at crime rates, the crime rate based on ambient population tends to highlight different areas than the crime rate based on residential population, which could have consequences for policy decisions or prevention initiatives aimed at problem areas. Additionally, using ambient population as a predictor instead of residential population resulted in more correct predictions of crime events. Including ambient population instead of residential population as a variable could improve applications of crime modeling such as predictive policing, especially considering that the ambient model seems to be better at prediction locations with high concentrations of crime events, which in light of the law of crime concentration, is a very desirable property.
Despite the potential advantages, there are also some challenges that arise in using mobile phone data as a proxy for the ambient population. Although one of its main advantages is its dynamic nature, this also leads to a more complicated data collection process with several practical issues. Although smartphones are ubiquitous, it is obvious that mobile phone usage is not equal. The ‘digital divide’, in this case the lower mobile phone use by specific groups (such as the elderly), affects the representativeness of mobile phone data as a proxy for ambient population. It should be noted that this observation with regard to the age of individuals composing the ambient population is less problematic than it seems, given the consistent finding within developmental criminology regarding the strong relationship between age and crime perpetration (also known as the ‘age–crime curve’; e.g., [78
]), the empirical finding that the elderly are simultaneously less victimized than younger age categories [79
], and the fact that smartphones are still omnipresent in the older age groups (e.g., 80% of Belgian inhabitants between 65 and 74 years use a mobile phone). Yet, the main limitation of mobile telecommunications data is related to the inaccuracies of the location estimates of individual devices [48
]. A more practical issue is the difficulty in obtaining mobile phone data due to several factors, mainly the willingness of mobile phone operators to cooperate, restrictions imposed by the General Data Protection Regulation (e.g., the three-month limit in our study), and possible privacy issues. Another factor to consider is the market share of the mobile phone operator: the larger it is, the more representative the data. Extrapolation to the total population is after all only a limited solution, as it is, at its core, an estimation of the real number. Finally, although ambient population is more suitable for certain crime types, for other crime types, notably residential burglary, residential population might still be the more appropriate operationalization of the population variable.
Considering the possible differences between crime types regarding the use of ambient population instead of residential population (as a consequence of the distinction between mobile and immobile targets, and its specificity), crime types should be studied separately, instead of looking at ‘crime’ as a general measure. In line with this observation, future research should extend our analysis to other crime types as well as to other study areas, to see whether the same observations hold for other contexts as well. Additionally, it would be interesting to look more closely into the intermediate mechanisms between the ambient population and crime [80
]. In this study, the focus lies on the suitability of the ambient population measured by means of mobile phone data in estimating crime rates and with a view of crime prediction. Future research should assess the impact of this predictor variable in conjunction with other relevant variables that potentially contribute to crime at micro places (e.g., land use features) [81
]. Finally, new and emerging data sources and innovative data processing methods provide continuously evolving opportunities for future research. An integration with other proxies for the ambient population (e.g., Wi-Fi data, see [83
]) or other potential big data sources (e.g., datasets from commercial businesses) could also be investigated more closely. Innovative data processing methods (e.g., convolutional neural network, see for example [84
]) enable scholars and practitioners to process data that are more voluminous, more varied, and with high velocity [85
]. These methods also deliver opportunities in validly measuring the most suitable denominator for calculating crime rates (e.g., counting the number of bicycles in geographic areas by means of computer vision for the purpose of estimating the crime rate for bicycle theft).
These endeavors regarding the use of new and emerging data sources are key to the development of the most suitable and accurate denominator in calculating crime rates and other risk measures or assessments. The results have implications for both criminological research, and policy and practice. With these measures, scholars are enabled to dissect the mechanisms related to, for example, the spatiotemporal convergence of crime-prone individuals and criminogenic settings [86
] or the spatiotemporal distribution of suitable targets. An important avenue for future research, since we know mobile phone data as a data source provide a valid alternative for existing measures, is to optimize the data source, in line with state-of-the-art theoretical insights. We know that it is important what kinds of people make up the ambient population, as well as the activities they are doing [87
]. Studies show the possibility to distinguish, based on an estimation, between these kinds of people in mobile phone data [54
]. For example, it might be more accurate to take into account the ‘exposed population’, instead of the ambient population [66
]. The challenge lies in obtaining data that allow for making these distinctions and using these distinctions (methodologically and theoretically correct) to test integrated criminological theories. Future research should reveal which type of operationalization of the population-at-risk is the most appropriate and under what circumstances (e.g., for which specific type of crime, time-of-day differences).
At the same time, the crime denominator problem has important consequences for policy and practice too. Policy makers make wide use of crime rates to defend and evaluate their policies, but using the correct denominator might make a difference in determining problematic areas in terms of crime rates. Equally, law enforcement agencies using crime data analysis and predictive applications to send out patrols proactively could benefit from taking into account a more adequate population-at-risk. Considering the improvement in prediction performance when using ambient population, the use of this variable is especially of interest for predictive policing models given that they can reflect micro-spatiotemporal fluctuations and therefore allow for more precise predictions on a micro-scale. Nevertheless, (the implementation of) intelligence-led policing practices still face numerous challenges and preconditions [88
], which should be taken into account. Future research and applications of crime prediction fully exploiting the dynamic nature of ambient population should apply a truly predictive analytical strategy, that uses the ambient population at a previous time point as a predictor of crime.
In conclusion, we argue that the ambient population better reflects the population-at-risk and better reflects the relevant mechanisms (e.g., regarding the nature of the target, either mobile or immobile) and therefore has a lot of potential in criminological research, theory testing, policy and practice. Mobile phone data are a high-potential proxy in this regard, as they can provide a large amount of data on a fine-grained spatiotemporal scale (for our study, a total of 9,397,473 data points). In this study, we corroborated this potential empirically and we are one of the firsts to propose and investigate its use as an alternative to using residential population for the purposes of calculating crime rates and predicting crime risks.