## 1. Introduction

In recent years, precision agriculture has been gaining momentum [

1]. Rice is one of the three major food crops in the world and covers the largest planting area in China [

2]. Therefore, monitoring the phenotypic information of rice plays an important role in precision agriculture. In the photosynthesis process, chlorophyll is an essential pigment whose content is related to the phenology and health status of crops [

3]. Besides, the chlorophyll content is close to the nitrogen nutritional status of crops, since most part of nitrogen was contained in chlorophyll [

4]. As a result, crop chlorophyll content can indirectly predict soil nitrogen conditions and guide the application strategy of nitrogen fertilizer, which can increase the overall crop profitability.

Previous studies have revealed that remote sensing is a reliable and effective technology for obtaining crop biophysical and biochemical information [

5,

6,

7]. In particular, hyperspectral remote sensing has narrow and continuous spectral bands, providing almost continuous spectra, which is more sensitive to specific vegetation properties such as nitrogen status, canopy biomass, and chlorophyll content [

5,

8]. Many studies have developed a large number of spectral indices to estimate the chlorophyll content of crops based on hyperspectral remote sensing reflectance [

9,

10,

11]. Although most of these spectral indices perform well in estimating crop chlorophyll content, it is difficult to interpret their physical meanings as most of them are in normalized radio or normalized difference form. In addition, the first derivative of a spectrum, having physical interpretability that can indicate the rate of change in reflectance, was widely used to evaluate crop chlorophyll content [

12,

13]. However, the first derivative of a spectrum can only reflect the rate of change in reflectance near a specific wavelength on the micro level. Some other information, such as the rate of change of reflectance within a wavelength range, is ignored, which may provide more sensitive information for the chlorophyll content estimation. As a result, the rate of change in reflectance between wavelengths ‘a’ and ‘b’ (RCRW

_{a-b}), proposed in this study, is the promotion of first derivative of spectrum, which can not only reflect the rate of change in reflectance at a certain wavelength, but also the rate of change in reflectance between any two wavelengths.

The processing techniques used for retrieving vegetation characteristics can be divided into two broad groups: physically-based algorithms and empirically-based algorithms [

14,

15]. Compared with physically-based algorithms, such as PROSAIL [

16] and N-PROSAIL [

17], empirically-based algorithms have low complexity, high computational efficiency, fewer required variables, and provide reliable results [

15,

18]. Empirically-based algorithms, especially machine learning algorithms, including random forest (RF) [

19], support vector regression (SVR) [

20,

21], neural networks (NNs) [

22,

23], and the Gaussian process (GP) [

24,

25], have been widely used to evaluate chlorophyll content of vegetation. However, few studies have investigated and compared the performance of different machine learning algorithms in estimating the chlorophyll content of vegetation. In addition, the optimization process of some key hyperparameters that can greatly affect the performance of machine learning algorithms, was not deep studied or ignored in previous study.

Therefore, this study has two major objects: (1) investigating the potential of the rate of change in reflectance between wavebands in estimating the chlorophyll content of rice; and (2) comparing the performance of four advanced machine learning technology, GP, RF, SVR and gradient boosting regression tree (GBRT), in estimating the chlorophyll content of rice.

## 5. Discussion

There have been, in the literature, many studies on estimating the chlorophyll content of the rice plant using machine learning methods. However, few studies have compared the performance of various machine learning algorithms in estimating the chlorophyll content of rice. In addition, the parameter optimization process of machine learning has been described in detail in a few studies. In this study, four machine learning models, GPR-M, RFR-M, SVR-M, and GBRT-M, were optimized and established based on in situ hyperspectral data and GS-CV parameter optimization algorithms. In addition, the performances of RCRW_{a-b}, first derivative of spectra, and some indices for rice chlorophyll content estimation were compared. According to the selected features and the prediction results, the following outcomes were observed.

The RCRW

_{a-b} proposed in this study has apparent physical significance, which reflects the rate of change, either fast or slow, of reflectance between two wavelengths. There are redundant data between two RCRW

_{a-b} with adjacent ‘a’ or ‘b’ since collinearity exists between these two RCRW

_{a-b}. In this study, the RCRW

_{a-b} with the strongest correlation with SPAD value was selected from each wavelength range (four wavelength distribution ranges in total, see in

Section 4.1) and four RCRW

_{a-b} were selected finally, which were RCRW

_{551.0–565.6}, RCRW

_{739.5–743.5}, RCRW

_{684.4–687.1}, and RCRW

_{667.9–672}. The performances of the four machine leaning techniques show that RCRW

_{a-b} is a potential variable for estimate the chlorophyll content of rice, and the predicted result indicates that these four features selected are effective to estimate the chlorophyll content of rice using machine learning algorithms.

Among the four generated machine learning models, the RFR-M yielded the highest accuracy in either the training set or validation set. However, the result of the training set is better than the result of the validation set, which indicates that the generalization of RFR-M is relatively poor. This may be due to the natural limitation of this algorithm, which often requires a relatively large data set, otherwise it could lead to an over-fitting problem [

30].

The performance of the GBRT-M was the second best after RFR-M, and the performance characteristic of GBRT-M was similar to RFR-M, that is, although the GBRT-M showed an excellent goodness-of-fit, it provided relatively poor prediction results. However, in contrast to the RMSE and MAE of the results of RFR-M, the MAE of the training set and validation set of GBRT-M were very close, and the RMSE of the training set and validation set of GBRT-M were also very close. Therefore, compared to the RFR-M, the GBRT-M is more generalized and shows a stronger stability.

Regarding the SVR-M, this model had difficulties in learning from high SPAD values and lacked the sensibility to the high SPAD values, and the model showed under fitting, hence resulting in poor generalization, which mainly manifested in the R^{2} of predicted value using the training set is small and lower than the validation set, and the RMSE and MAE of predicted result of the training set were high, and higher than the validation set. The poor prediction using the training dataset and the large difference between the performance of the training set and the performance of the validation could be attributed to the fact that: (1) the parameters (C and gamma) selected by CV-GS are not optimal; (2) SVR has low potential to estimate SPAD values of rice by using the variable of RCRW_{a-b}; or (3) both (1) and (2). Therefore, new machine learning parameter optimization algorithms should be considered to find the optimized values of the two parameters or new variable need be developed when using SVR algorithm for estimating the SPAD values of rice.

For the GPR-M, this model shows great generalization and robustness in estimating the SPAD value of rice by using four selected RCRW_{a-b} when the data set is not large enough, which is mainly manifested in the R^{2}, MAE, and RMSE of the predicted result for the training set being close to the R^{2}, MAE, and RMSE of the predicted result for the validation set, respectively. Besides, although the performance of GPR-M is not the best, it is very close to the best one. Overall, GPR is a powerful machine leaning technique in estimating the SPAD value of rice.

By comparing the predicted results of Four-RCRW

_{a-b}, FD

_{556.9} and some vegetation indices based on four machine learning algorithms (see

Table 4), we found that: (1) no matter which algorithm is used, CI-green gives the worst result; (2) Four-RCRW

_{a-b}, FD

_{556.9}, MTCI and Re-NDVI show similar performance when using GPR; (3) the R

^{2} of FD

_{556.9} and Four-RCRW

_{a-b} are similar when using SVR, but the RMSE of the two show a relatively large difference; and (4) comparing with vegetation indices and FD

_{556.9}, Four-RCRW

_{a-b} gives the best results when using RFR, SVR, or GBRT. Therefore, when comparing with FD

_{556.9} and certain vegetation indices, Four-RCRW

_{a-b} shows greater potential in estimating rice chlorophyll content, which is likely because Four-RCRW

_{a-b} considers more including the green, red, and red edge bands that are more chlorophyll sensitive.

Given the potential of Four-RCRW_{a-b} in estimating rice chlorophyll content, its application to satellite data faces a big challenge because of the coarser spectral resolution of satellite sensors. The method is expected to work well when applying to unmanned aerial vehicle (UAV)-based hyperspectral imagery.

As the spectra measurements were performed on sunny clear days between 11:30 a.m. and 2:00 p.m., there may be some minor BRDF effects caused by the change of sunlight direction. Therefore, two sample plots (measured in different time) with the same SPAD value on 21 July 2019 (labeled as Sample1 and Sample2, and the solar altitude angle of Sample1 is greater than Sample2) were selected to investigate the influence of the BRDF effects caused by the change of sunlight direction on SPAD estimation. The spectra of Sample1 and Sample2 are shown in

Figure 18. When the solar altitude angle was large, the reflectance in the near-infrared bands showed a significant increase than in visible bands. In addition, the RCRW

_{a-b} of Sample1 and Sample2 are also compared (

Table 5). As shown in

Table 5, when the solar altitude angle increased, the positive RCRW

_{a-b} values slightly increased, and the negative values slightly decreased.

Table 6 shows the predicted results of Sample1 and Sample2 based on four generated machine learning methods. As seen in

Table 6, the difference in predicted results between Sample1 and Sample2 is small. Therefore, we conclude that between 11:30 a.m. and 2:00 p.m., the BRDF effect caused by the change of solar altitude angle has little influence on SPAD estimation.