Radiomics has emerged as a potential aid to non-invasively characterize tumors using images [1
]. Radiomics extracts quantitative features from medical images that can describe lesion characteristics in detail, thus completing and supporting the radiologist visual assessment. These quantitative features are then used to build models that can provide valuable clinical information to direct patient treatment. Multiple studies have shown that radiomics can aid in predicting cancer prognosis [5
], a tumor’s gene mutation status [9
], and tumor recurrence [1
]. However, current radiomics studies are limited in their ability to use large, multi-center data because heterogeneous computerized tomography (CT) acquisition parameters can be confounding factors [15
The literature shows that CT scanners, scanning techniques, reconstruction parameters, and other non-clinical variables can alter the computed feature values in radiomics studies and thus influence the conclusions of these studies. A recent article comprehensively reviewed sources of variations and potential strategies to reduce such variations in radiomics [16
]. In order to compare and conduct multi-center studies and to improve the generalizability of radiomic results, various techniques have been proposed: controlling image acquisition parameters, processing images (e.g., resampling images, filtering the images) post image acquisition and prior to feature extraction, converting images to a desired imaging setting, standardizing the definitions of features, and harmonizing feature values statistically using the ComBat method [17
]. Although there are many methods being investigated to improve radiomics research, it is difficult to assess which one is better. There are no published direct comparisons, and Mali et al. [28
] and Ibrahim et al. [29
] recently published review articles in which they both discuss the need for further investigations on harmonization methods to analyze radiomics data using the available retrospective and unpaired imaging data from multiple centers.
In order to facilitate multi-center studies and utilize existing imaging data that can include a variety of CT scanners and scanning protocols, we sought to find a method to harmonize CT images of different scanning protocols for improving radiomics studies. Reconstruction kernel setting is one of the key confounding variables we can strive to control in radiomics to help us make correct and reproducible conclusions from our experiments [19
]. Recently, Choe et al. [30
] showed that a convolutional neural network (CNN) can convert CT image reconstruction kernels to reduce the effect of two different reconstruction kernels and improve the reproducibility of radiomic features in pulmonary nodules. The CNN uses deep learning to learn the differences between CT images of different resolutions, and then applies it on CT images to convert images of different kernels. They have made this CNN model publicly available for other researchers to apply to their research. However, this work was limited in that all the images came from one CT scanner with only two kernels (B30f and B50f), and their CNN model was not validated in a real-world clinical application.
In this study, we further fine-tuned this open-source CNN to convert reconstruction kernels of thin slice CT images. We then used the prediction of epidermal growth factor receptor (EGFR) status in lung cancer as an example, because lung cancer diagnosis and treatment are important topics of research, since various tumor characteristics have diagnostic and prognostic factors. For example, the treatment plan for lung adenocarcinoma has become tailored based on the tumor’s gene mutation status [10
]. To determine tumor genotypes, molecular tests from tissue biopsies are considered to be the gold standard; however, biopsies are invasive and limited to a small sample of the tumor [32
]. As a result, it is difficult to fully characterize the tumor’s spatial heterogeneity [33
We show that CNN can create a more harmonized dataset from a randomized set of mixed reconstruction kernels, verified with an improvement in feature reproducibility and in EGFR prediction performance. Furthermore, we aim to select the best reconstruction kernel to set as the standard to maximize the reproducibility of the features and the EGFR prediction performance derived from the newly harmonized dataset. To our knowledge, this is the first study to utilize both the artificial intelligence (AI) kernel conversion method to harmonize image settings and the converted images to predict clinical information directly after the AI-aided harmonization.
A method that can enable researchers to use a large collection of multi-setting CT images will be beneficial for improving the statistical power and clinical application of radiomic studies. Currently, many radiomic studies are limited due to having relatively small sample sizes and their lack of external dataset for validation, in part because these studies require a dataset of relatively homogeneous CT acquisition parameters [42
], which may not have been available.
In this study, we successfully retrained a CNN model developed by Choe at el. [30
] on one dataset and tested it on an external dataset acquired using a different CT scanner to convert the reconstruction kernels of CT images from smooth to sharp and vice versa. We then showed that kernel harmonization via a CNN converter can increase the reproducibility of radiomics features. There was an increase from 20 to 40 percent in the total number of features out of 1158 with CCC > 0.85 (which is considered to be highly reproducible) and an increase of 0.3 in the average CCC after the kernel conversion to smooth (p
< 0.001). Furthermore, we observed an increase in the clinical predictive performance for predicting the EGFR mutation status of lung cancer lesions after the kernel conversion to smooth (median AUC = 0.614, Z = 15.1, p
With an increasing number of studies showing diagnostic and prognostic promise of radiomics in an era of personalized medicine [43
], it is imperative that we improve the quality, reproducibility and robustness of radiomics research. Some critics have raised the concern that radiomic features are not robust and are susceptible to small differences in CT acquisition parameters [18
]. The results of this study are consistent with previous studies on how CT reconstruction kernels affect radiomic feature values and reproducibility [48
]. In the comparison between the two original kernel settings (ori_smo vs ori_shp), only 20% of 1158 features had CCC > 0.85. This susceptibility is a hindrance in radiomics studies and shows that datasets for radiomics studies cannot have heterogeneous kernel settings.
To address the non-biological impact of kernel setting on the radiomics results, Choe et al. [30
] developed a CNN to convert the reconstruction kernels of retrospectively collected CT images acquired from Siemens and showed promising results in improving feature reproducibility. However, their group only trained their model using kernels B10f, B30f, B50f and B70f, and they did not have trained models for direct conversion between B30f and B70f, which are two of the most commonly used kernels in chest CT. In addition, the pretrained model’s kernel conversion performance was poor when used on the same-day repeat CT data acquired from GE. Thus, we retrained this open-source network for kernel conversion using the same-day repeat CT data (development cohort) [34
] and successfully validated the CNN on an external dataset (validation cohort) that was acquired from Siemens.
Using the newly trained CNN kernel converter, we confirmed a similar improvement in the feature reproducibility using our in-house feature extractor. There was a significant improvement on average in the development cohort’s original CCC from 0.523 ± 0.314 to 0.763 ± 0.181 and 0.794 ± 0.178 for smooth and sharp conversions, respectively. Furthermore, it is worth noting that the newly trained CNN kernel converter was successful in converting the CT image data in the validation cohort, which had significantly different acquisition parameters from the development cohort used to train the network. Although we have split the image kernel groups to two simple groups (smooth and sharp), these groups actually contain a variety of algorithms. For instance, smooth group contains Standard/B30f/B31s/B31f, while sharp contains Lung/B60f/B70s/B70f/B80f. Our CNN that was trained on CT images from GE with 1.25 mm slice thickness and standard/lung kernels was successful in converting external CT images from Siemens with 1 mm slice thickness and a wide range of kernels (B30/B31f/B60f/B70f/B80f). This shows that our trained CNN does not require the input images to have exactly the same settings as the development cohort, and the CNN may be applied to CT images from other vendors with similar thin slices around 1 mm and similar kernel settings as smooth and sharp.
In our first phase of the experiment with the development cohort, we observed that certain feature groups increased in CCC more so than others after the kernel conversion. As seen in Figure 3
, the CCC heatmap shows significant improvements in groups 18 and 19, which are composed of Intensity_Skewness_2D, Intensity_Skewness_3D, GLCM_Entropy, and GLCM_Diff_Entropy. These features currently show promise in the literature for EGFR prediction models as second order texture features that are highly predictive of EGFR mutation status: GLRLM, wavelet, LOG-sigma GLDM, LOG-sigma GLCM, skewness, short-run-low-grey-level-emphasis [49
]. Many of these studies cite in their limitations that their homogeneous sample sizes are not large enough for machine learning or deep learning models. To increase the sample size for training and testing these prediction models, our trained CNN may be of use in harmonizing kernel settings to allow a larger dataset collection.
We applied the developed CNN to an external clinical CT data of lung cancer patients with known EGFR status. The CNN kernel harmonization improved the reproducibility of many features, as seen in Figure 4
, with over 40% of 1158 features having high reproducibility at CCC > 0.85 after CNN kernel conversion to smooth kernel, which was a significant increase from 20% in the original set (p
< 0.001). Harmonizing the image settings to sharp kernel did not improve in reproducibility, as the reproducibility calculation showed a similar percentage of features with CCC > 0.85 as the original smooth vs original sharp comparison at approximately 20% of the features (p
> 0.05). The median CCC was also higher after the conversion to smooth kernel, but not for the conversion to sharp. Our results are in agreement with previous reports that show the smooth kernel having higher number of radiomics features with high reproducibility [19
]. A possible reason for this might be that sharp reconstruction kernel, while it may provide higher resolution, also comes at the price of having images with significantly more noise.
Our results from the second phase show that conversion to smooth kernel may benefit clinical studies to predict EGFR mutation status. As previously mentioned, converting the kernel to smooth improved the reproducibility of the features. The univariate analysis results depicted in Figure 5
also show that there is a significant improvement in the median AUC in the conv_mix_smo group, with the median AUC increasing from 0.595 for the original mixture group to 0.614 (Z = 15.1, p
< 0.001) in the converted smooth (sharp → smooth) mixture group. When we took a closer look at the top 3 features with the highest AUC values, as shown in Table 4
and Figure 6
, we observe a significant improvement in CCC and a small improvement in AUC. The top three features were all texture-based features: two Laplacian of Gaussian (LOG) and one GLCM. LOG feature is an entropy-based quantification of image homogeneity with varying Gaussian filters. The top two Gaussian filters were filters with sigma of 1.5 and 2.5. GLCM is a histogram of co-occurring greyscale values at a given offset over an image to calculate how often pairs of pixels with specific values in a specific spatial relationship occur in an image. This finding is consistent with the first phase of the experiment and the literature. As previously mentioned, LOG and GLCM have been found to have clinical significance [49
], especially in predicting EGFR. In our study, we have also found that these three texture features performed the best in predicting the EGFR status of the validation cohort, as measured by the AUC.
Some studies have proposed to approach the harmonization method in a statistical way, as has been accomplished in genomics using ComBat [26
]. The advantages of the ComBat method are clear in that the method is easy, it can be performed on the given datasets without having to manipulate large image files, and it successfully harmonizes data statistically while accounting for various non-biological factors. However, one of the major disadvantages to ComBat is that it is difficult to set a standard to which to compare new data against, and any incoming new data cannot be adjusted on its own, requiring a set of data to harmonize the new data with. In the case of a CNN, any CT image may be given as an input, and there will be an output of converted images that can have its tumor features extracted and compared against a pre-set standard.
There are several limitations in our study. One limitation is that this study did not analyze individual lesion characteristics, so it is unclear if these individual characteristics have been harmonized. However, our goal of mutation status prediction was improved. Finally, our prediction model for the EGFR mutation status was a simple statistical analysis using the raw feature values for the univariate analysis. Univariate analyses are not comprehensive, and they are often utilized as the initial benchmark test to assess the feature’s potential in a more complex model. For instance, studies have shown that radiomic models using individual features perform worse than a multivariate model that uses machine learning or deep learning [23
]. Further analyses with machine learning or deep learning models are needed to better assess how CNN kernel converter can improve feature reproducibility and clinical predictive performance.