Logistic regression and neural network models were developed and evaluated based on their ability to accurately predict food crises while minimizing false positives and false negatives.
4.1. Logistic Regression Model
First, the full model was built with all the features found in
Table 1. In the full model, all features were found to be statistically significant (
p < 0.05) except for NDVI (
p = 0.288), NDVI anomalies (
p = 0.357), and evapotranspiration anomalies (
p = 0.051). From here, features were eliminated on the basis of
p-value, with the goal of minimizing model complexity while maintaining performance. After paring down the model, the
p-value selection model yielded the following features with
p < 0.001: month, food assistance, mean evapotranspiration, number of violent events, food price index, and the cropland percent. The presence of humanitarian assistance was the feature most strongly associated with a food crisis (coefficient: 0.62). Month was the feature that was most negatively correlated with a food crisis (coefficient: −0.12). The remaining features had coefficients that were relatively small (<0.1) and had less influence on the model. In
Table 3, the accuracy (0.86), recall (0.16), and precision (0.81) of the model is indicative of a trivial model that only predicts food crises.
Then, recursive feature elimination (RFE) was conducted to create models ranging 7–11 features. All RFE models had the following features in common: humanitarian assistance, mean evapotranspiration, number of violent events, and latitude. After testing each of the five models suggested by RFE, there was not a significant difference between the ROC curve, accuracy, recall, and precision. Thus, the simplest model was selected with the following features: latitude, month, humanitarian assistance, food price index, number of violent events, and cropland percent. All selected featured had
p < 0.001. The inferences from this model were that humanitarian assistance was most strongly correlated with food crises (coefficient: 0.73), and month was most strongly negatively correlated with food crises (coefficient: −0.11). Although different features were utilized in the RFE model, the model performance was identical to the performance of the
p-value selection model. These metrics are also shown in
Table 3.
The last feature selection technique utilized was select
k-best. After creating models with 7–11 features, the following model features were found to be statistically significant with
p < 0.001: year, month, mean NDVI, mean rainfall, mean evapotranspiration, number of violent events, price index of food, cropland percent, log population, and log ruggedness index. The price index of food and the mean NDVI were the most positively correlated with food crises. In this model, mean evapotranspiration and the log population were the most negatively correlated with food crises. As seen in
Table 3, the select
k-best model, in comparison to the previous logistic regression models, slightly increased the model’s AUC (0.59), recall (0.20), and accuracy (0.87) while slightly decreasing the model’s precision (0.75).
Additionally,
Table 3 provides a set of hypothetical models that assist in evaluating the models developed in this work:
Chance: metrics that result if the model predicts randomly.
Always predicts crisis: a model that always predicts the minority class or food crisis.
Never predicts crisis: a model that always predicts the majority class or no food crisis. This is also known as the no information rate (NIR).
Goal/prior work: the performance metrics from the model developed by World Bank researchers (Andree, Chamorro, Kraay, Spencer, and Wang, Predicting Food Crises. Policy Research Working Paper; no 9412, 2020).
4.2. Neural Network Model
Five iterations of neural network modeling were conducted in this work. A summary of their performance metrics is located in
Table 4 at the end of this section.
Iteration 1 baseline: The baseline neural network consisted of one input layer, a normalization layer, two 150-neuron ReLu hidden layers, one sigmoid output layer, and the RMSprop optimizer with a 0.0001 learning rate. Training was conducted with 1000 epochs, and the early stopping
patience parameter was set to 50. The training curves for iteration 1 are shown in
Figure 6, and they show that even though the training recall (left) continued to improve with additional training epochs, the recall for the validation set indicated overfitting as it started to degrade after 25 epochs.
Iteration 2: In this iteration, single-hyperparameter sweeps were conducted to explore the hyperparameter space. The learning rate was swept from 0.00001 to 0.01, resulting in an optimal recall and precision at a learning rate of 0.001. Then, the number of training iterations were swept from 10 to 200 using the RMSProp optimizer algorithm. This showed that precision increased linearly with training, while recall increased linearly before reaching a plateau at approximately 100 iterations. The training iteration sweep was then repeated with the Adam optimizer, and the recall and precision increased logarithmically and plateaued at around 200 iterations. The Adam optimizer was ultimately selected because it achieved the highest recall and precision and was less prone to overtraining in comparison to RMSprop.
Sweeping the classification threshold from 0.10 to 0.90 revealed that the optimal classification threshold was approximately 0.30. A lower classification threshold favors recall, while a higher threshold favors precision. Then, the number of hidden layers (1–2) and the number of neurons (25–200) were varied, and it was identified that a model with 2 hidden layers and 150 neurons each optimized the model’s precision and recall. A model utilizing the optimized hyperparameters exhibited the following performance metrics for predicting food crises: accuracy (0.88), AUC (0.83), recall (0.76), and precision (0.60).
The effects of L2 regularization on model performance were investigated due to the large neuron count of the model. As seen in
Figure 7 (left), the model without regularization overtrained quickly. Next, L2 regularizers were added to the hidden layers, and the regularization factor was swept from 0.00001 to 0.01 to find the optimal value (0.0001). As shown in
Figure 7 (right) and
Table 4, the addition of L2 regularization appeared to stabilize the training of the model and prevented overfitting without substantially changing model performance. The model performance metrics were as follows: accuracy (0.88), AUC (0.83), recall (0.74), and precision (0.60).
Iteration 3: Multidimensional hyperparameter sweeps: Significant gains in model performance were attained with the multi-dimensional hyperparameter sweep. A total of 920 networks were trained in the high batch variation, and 660 networks were trained in the low batch variation.
Using the f1 metric of the test set as the dependent variable, hyperparameter correlation was examined and is presented in
Figure 8. Within the hyperparameter space examined, neuron count had the biggest linear impact on f1, followed by batch size and learning rate. It is notable that while model capacity (in terms of the number of weights) would be most affected by the number of layers,
Figure 8 indicates that it had the smallest impact on model performance.
A wide range of modeling performance resulted when the counts of false negatives (FN) and false positives (FP) were examined. Many of the models arrived at a trivial solution, with the model overspecifying either FP or FN in order to achieve a low value on its compliment. In
Figure 9 below, all model FN and FP results for the test set are reported, with a red line indicating the Pareto front, which is the best family of models. These models contain the lowest FN count for each value of FP.
There were 658 food crisis in the training set, and a trivial model is shown on the left side of
Figure 9, with 0 FP and 658 FN. At the other extreme, the lowest value of FN was 158 (of 3800); however, that model misclassified half of the food crises, resulting in a FP of 324.
The impact of individual hyperparameters on the f1
test metric were then examined, starting with neuron count. As shown in
Figure 10, the best model performance was achieved at a neuron count of 250; however, only incremental improvement occurred once the neuron count surpassed 50. Moreover, the impact of batch size on performance can be clearly seen, with the red markers in
Figure 10 showing the best performance for the low batch variation and the blue markers reporting performance for the high batch variation. The low batch variation (≤1024) surpassed the performance of the high batch variation (2048 or 4096) in every case.
Figure 11 reports the f1
test metric for the other hyperparameters. Local maximums were apparent for the batch size, L2 regularization parameter, and learning rate, while any layer count ≥1 had equivalent performance.
When examining the level of performance between the f1 metrics of the entire dataset and the holdout dataset, a quasi-linear overfitting relationship was observed. This relationship is presented in
Figure 12, along with an arrow that shows the best model that meets the 10% overfitting threshold.
Iteration 4: Multiple hyperparameter sweeps (tapered stack): Finally, a subset of the iteration 3 hyperparameter search was repeated using the architecture modification discussed in
Section 3. While this iteration used the full dataset, its performance (f1
test = 0.84) was equivalent to the “no year” modeling. This indicated that this series of models was not useful, and they are not discussed further.
Iteration 5: Multiple hyperparameter sweeps (no year): Similar to iteration 3, high batch and low batch variations of modeling the “no year” dataset were conducted, and the low batch variation had the best performance. Removing the year feature had the impact of slightly degrading model performance, as shown in
Figure 13. In the figure, the Pareto front from “with year” dataset in iteration 3 is plotted (blue), along with two Pareto fronts from iteration 5, namely orange for high batch and green for low batch. It is notable that for the “no year” dataset, models with FP < 65 yielded equivalent performance to the “with year” dataset. However, as the number of FP increased, there was noticeable degradation in the FN count.
The final step in model selection was to select a best family of iteration 5 models and compare them to the holdout dataset. The models were ranked by the f1 metric on the entire dataset, and the model performance was then evaluated for false negatives, which is a priority for this application. Eight models were selected.
Table 4 shows the hyperparameters associated with these models and their f1 performance on the entire/test/holdout datasets.
Finally, models were excluded if their f1 metric on either the test or holdout set was more than 10% lower than the f1 metric on the entire dataset. This ensured that the model would be able to generalize well on unseen data. The best model is denoted by bold text in
Table 4, and while it did not have the highest f1
all, it performed well and met all criteria. Further metrics for the best model are presented in
Table 5.
Additional information on the final model is presented in
Figure 14, which displays the neural network structure, confusion matrix, and ROC AUC.
Figure 14 also contains a statistical analysis of the recall primary performance metric, and additional background on the analysis follows. The validation and holdout datasets each consisted of a random 15% portion of the original dataset, as determined by a specified random seed. A sensitivity analysis was performed by recording the recall metric for a 15% split that resulted from 1500 different random seed values. The resulting histogram is shown in
Figure 14 (bottom left), showing a normal distribution and a recall mean value of 0.92 with a 95% CI of ± 0.01. This gives confidence that there was a relatively even distribution of outliers in the dataset; the CI would have been larger if that was not the case. As expected, the mean value of the 15% split histogram matched the recall of the entire dataset.
Iteration 1–5 summary: A summary table of all neural network modeling iterations is presented in
Table 6, which was measured on the test dataset. When comparing iterations 3 and 5, it is notable that there was minimal effect of removing year from the data set.
4.3. Discussion
For all models, recall must be high (>0.80) to ensure most food crises are predicted accurately, but without maintaining precision (>0.80), the significance of the model predictions would be diluted. When considering these goals, the performance metrics for the logistic regression models indicate they are not useful. While the optimal select
k-best model was statistically significant, its AUC (0.59) performed only 9% better than the chance hypothetical model that is shown in
Table 3. Additionally, the model’s f1 (0.32) represented a 37% decrease from the goal model performance. The model’s precision (0.75) indicated that 25% of the food crises classified as food crises were false positives; furthermore, the model’s recall (0.20) indicated the model failed to classify 80% of the food crises in the dataset. These performance metrics suggest that the logistic regression models generalize to the trivial solution of not predicting food crises. This is further evidenced by the model’s accuracy (0.87), which only performed 2% better than a model that did not classify food crises (
Table 3). Thus, the logistic regression model cannot be considered a viable model and does not meet the performance goals outlined for evaluating the model.
Although the logistic regression models did not prove meaningful for prediction, they yield valuable inferences that the presence of humanitarian assistance and the food price index are potential indicators of food crises. These features may be an indicator of a systemic government crisis and could prove meaningful for future study.
The neural network models vastly outperformed the logistic regression models. The most striking improvement was in the best neural network model’s AUC (0.98), which was 66% better than the logistic regression models and 96% better than a pure chance model. When considering performance on the unseen holdout dataset, the best NN model’s precision (0.85) and f1 (0.83) significantly improved upon the goal model performance by 139% and 68%, respectively, while maintaining recall (0.81) and accuracy (0.92).
Moreover, of critical significance, the best neural network model was developed to be year agnostic and maintained a f1 on the test (0.84) and holdout (0.83) sets, thereby suggesting that the model has the capability to extrapolate on future data. The main limitation of this work is the potential for the model to misclassify a food crisis. When this model is applied to future data, the 0.82 recall on the holdout dataset indicates that nearly 1 in 5 food crises will be misclassified. This indicates that it is essential for governments and aid organizations to verify current conditions prior to making a decision based on this model.