Lung Cancer Diagnosis System Based on Volatile Organic Compounds (VOCs) Profile Measured in Exhaled Breath

Shaffie, Ahmed; Soliman, Ahmed; Eledkawy, Amr; Fu, Xiao-An; Nantz, Michael H.; Giridharan, Guruprasad; van Berkel, Victor; El-Baz, Ayman

doi:10.3390/app12147165

Open AccessArticle

Lung Cancer Diagnosis System Based on Volatile Organic Compounds (VOCs) Profile Measured in Exhaled Breath

¹

BioImaging Laboratory, Department of Bioengineering, University of Louisville, Louisville, KY 40292, USA

²

Department of Chemical Engineering, University of Louisville, Louisville, KY 40292, USA

³

Department of Chemistry, University of Louisville, Louisville, KY 40292, USA

⁴

Department of Cardiovascular and Thoracic Surgery, University of Louisville, Louisville, KY 40292, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(14), 7165; https://0-doi-org.brum.beds.ac.uk/10.3390/app12147165

Submission received: 12 April 2022 / Revised: 11 July 2022 / Accepted: 14 July 2022 / Published: 16 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

Lung cancer is one of the world’s lethal diseases and detecting it at an early stage is crucial and difficult. This paper proposes a computer-aided lung cancer diagnosis system using volatile organic compounds (VOCs) data. A silicon microreactor, which consists of thousands of micropillars coated with an ammonium aminooxy salt, is used to capture the volatile organic compounds (VOCs) in the patients’ exhaled breath by means of oximation reactions. The proposed system ranks the features using the Pearson correlation coefficient and maximum relevance–minimum redundancy (mRMR) techniques. The selected features are fed to nine different classifiers to determine if the lung nodule is malignant or benign. The system is validated using a locally acquired dataset that has 504 patients’ data. The dataset is balanced and has 27 features of volatile organic compounds (VOCs). Multiple experiments were completed, and the best accuracy result is 87%, which was achieved using random forest (RF) either by using all 27 features without selection or by using the first 17 features obtained using maximum relevance–minimum redundancy (mRMR) while using an 80–20 train-test split. The correlation coefficient, maximum relevance–minimum redundancy (mRMR), and random forest (RF) importance agreed that

C_{4} H_{8} O

(2-Butanone) ranks as the best feature. Using only

C_{4} H_{8} O

(2-Butanone) for training, the accuracy results using the support vector machine, logistic regression, bagging and neural network classifiers are 86%, which approaches the best result. This shows the potential for these volatile organic compounds (VOCs) to serve as a significant screening tests for the diagnosis of lung cancer.

Keywords:

lung cancer; volatile organic compounds (VOCs); maximum relevance–minimum redundancy (mRMR); random forest; computer-aided diagnosis (CAD)

1. Introduction

Lung cancer is the worldwide leading cause of death among all cancers, and one of the top four causes of overall mortality in people under 70 years of age in nearly every nation in the world outside of sub-Saharan Africa [1]. Lung cancer is the most common neoplasm in men (14.3%) and the third most common in women (8.4%), where its incidence is exceeded by breast (24.5%) and colorectal (9.4%) cancers. When treatment is available, mortality strongly depends upon the stage at which the cancer is detected. Early detection of lung cancer improves clinical outcomes, raising the five-year survival rate from 17.7% to 55.2% [2]. Unfortunately, the majority of patients present with lung cancer having already reached third or fourth stages, in which treatment options are limited, and the mortality rate is high [3].

A chest X-ray is most commonly used for lung cancer screening; thus, the first indication that cancer may be present is an abnormal finding on an X-ray. These findings are non-specific and must be followed up by other techniques. More sophisticated imaging modalities, such as computed tomography (CT), magnetic resonance imaging (MRI) and positron emission tomography (PET), are far more sensitive than the chest X-ray, and their usefulness has been enhanced by a number of computer-assisted detection (CAD) algorithms that automatically identify malignant lung nodules [4,5,6,7,8,9,10,11]. CT is the most informative modality for detecting lung cancer, and, if used for screening, it can decrease mortality by 20% [12]. The greatest drawback of its use as a screening tool is that it is quite expensive. Furthermore, detection of malignancy using a single CT scan has a high false-positivity rate. More accurate detection uses a track-and-trace approach that considers a nodule’s growth rate, which requires multiple CT scans over a period of time and thus incurs an even greater cost.

These considerations have motivated research into relatively inexpensive, non-invasive approaches to lung cancer screening. These look for the biomarkers of cancer in a patient’s blood (e.g., [13]), urine (e.g., [14,15,16]), saliva (e.g., [17,18]) or exhaled breath. Volatile organic compounds (VOCs) are a class of biomarker associated with cancer. The fluctuating redox environment within cancer cells is thought to cause oxidative stress, which increases the synthesis of numerous VOCs [19,20]. These are released into the extracellular environment and pass into the bloodstream. Exhaled breath analysis is a method for identifying VOCs at very low concentrations. Breath analysis studies have supported the hypothesis that the VOCs profile is altered in the presence of lung cancer [21,22,23,24,25]. The quantitative measurement of carbonyl VOCs in exhaled breath has recently been pursued as a method for early lung cancer detection [26,27,28,29,30].

Studies of note include that of Pesesse et al. [31], who used thermal desorption comprehensive two-dimensional gas chromatography time of flight mass spectrometry (TD-GC × GC-TOFMS) to analyze VOCs. Out of six VOCs’ chemical families (aldehydes, alcohols, hydrocarbons, nitrogen-containing compounts, ketones and fatty acid methyl esters [FAME]), a principal component analysis and random forest (RF) identified ketones and FAME as the most indicative of cancer. The study data was quite small, however, with only 15 lung cancer patients and 14 healthy participants. Koureas et al. [32] used a set of 19 VOCs to diagnose lung cancer. Data from 51 lung cancer patients and 38 cancer-free patients with abnormal CTs were used to train a random forest classifier that achieved 84% accuracy and an area under the curve (AUC) of 0.94. Li et al. [28] looked at six carbonyl VOCs gathered from 34 individuals with benign pulmonary nodules, 85 lung cancer patients and 85 healthy participants. They tested several classification algorithms and employed dimensionality reduction via recursive feature elimination (RFE). The optimal model discriminated between benign and malignant nodules with 89% accuracy. Lastly, Tsou et al. [33] measured 116 VOCs from 168 healthy particiapnts and 148 lung cancer patients. The 50 most significant VOCs identified via ordinary statistical tests (the Wilcoxon rank-sum and a two-sample t-test) were used to train an XGBoost classifier, which achieved 92% accuracy and an AUC of 0.98. The main concern regarding their results is that the lung cancer patients all had later-stage cancer and were significantly older than the control patients.

In summary, the exhaled breath analysis provides a number of advantages. It is non-invasive and does not require the employment of trained personnel to collect samples. It is also a low-cost, fast, painless and secure sampling method [34,35]. The main drawbacks to the current systems based on the concentrations of VOCs in the exhaled breath are: (a) development from a small dataset, (b) poor matching between lung cancer patients and controls and (c) use of late-stage cancer patients, precluding use of the system for early detection. The research presented here aims to create and evaluate a CAD tool built on a large data set of breath VOCs’ biomarkers that will improve lung cancer detection accuracy and speed while avoiding the aforementioned drawbacks.

2. Materials and Methods

2.1. Patients

Patients ranging in age from 30 to 96 years old were recruited, and a breath test was taken by our partners at the University of Louisville Hospital. The University of Louisville’s Institutional Review Board (IRB) authorized the study protocol, and all techniques were carried out in compliance with the appropriate rules and regulations. After obtaining informed consent from the patients, one liter of mixed tidal and alveolar breath was collected from each participant in a non-reactive Tedlar bag (Sigma Aldrich, St Louis, MO, USA).

The dataset’s details are shown in Table 1. The dataset contains a total of 504 patients. The dataset is balanced such that the number of patients with malignant lung nodules is equal to the number of patients with benign lung nodules; this number is 252. The nodules are determined to be malignant or benign, in the ground truth, based on either a biopsy for diagnostic conclusion or by following up for two years until a final diagnosis is determined based on multiple CT scans. In our methodology, the classification is based only on the VOCs features and is independent of the clinical history data of the patient or the patient’s family.

The dataset has 27 VOCs numeric marker feature values for each patient, which are (

{CH}_{2} O

,

C_{2} H_{4} O

,

C_{3} H_{6} O

,

C_{4} H_{8} O

,

C_{5} H_{10} O

,

C_{6} H_{12} O

,

C_{7} H_{14} O

,

C_{8} H_{16} O

,

C_{9} H_{18} O

,

C_{10} H_{20} O

,

C_{11} H_{22} O

,

C_{12} H_{24} O

,

C_{13} H_{26} O

,

C_{4} H_{8} O_{2}

,

C_{2} H_{4} O_{2}

,

C_{3} H_{4} O

,

C_{6} H_{12} O_{2}

,

C_{9} H_{16} O_{2}

,

C_{3} H_{4} O_{2}

,

C_{4} H_{6} O_{2}

,

C_{4} H_{6} O

,

C_{4} H_{4} O_{2}

,

C_{5} H_{8} O

,

C_{7} H_{6} O

,

C_{7} H_{11} O

,

C_{13} H_{22} O

and

C_{15} H_{10} O

).

2.2. Methodology

Our methodology follows the pattern shown in Figure 1. The first step is to acquire raw data, which will be in the form of a dataset. The dataset contains both of VOCs features and the patient’s diagnosis, either malignant or benign. The dataset will then be reduced using dimensionality reduction techniques. In our methodology, a correlation coefficient and maximum relevance–minimum redundancy (mRMR) are used for dimensionality reduction. An algorithm known as a classifier is then used to implement classification. We used 9 different classifiers in our methodology. The dataset is split into training and testing datasets. These datasets are locally acquired for classification. The training dataset is used to train the classifier, while the testing dataset is then used to evaluate the classifier’s performance on new data that are not seen in the training phase.

The data acquisition process is as follows. The exhaled breaths collected in 1 L Tedlar bags were drawn through a microreactor chip by applying a vacuum, as shown in Figure 2. Then, 2-(Aminooxy)-N,N,N-trimethylethanammonium (ATM) iodide is used to coat the micropillars of the microreactor chip. ATM uses oximation processes to chemoselectively capture the carbonyl molecules in the exhaled breath. ATM adducts in the microreactor chip were eluted with 100 micro-liters of methanol from a gently pressurized tiny vial after the breath sample was entirely evacuated from the Tedlar bag. Fourier-transform ion cyclotron resonance mass spectrometry (FT-ICR-MS) was used to examine the eluted solution directly. The FT-ICR-MS is a hybrid linear ion trap MS (Finnigan LTQ FT, Thermo Electron, Bremen, Germany) with a TriVersaNanoMate ion source (AdvionBioSciences, Ithaca, NY, USA), and an electrospray chip (nozzle inner diameter 5.5 micro-liter) was used to analyze all breath samples using the eluted solution. As an internal reference for measurement of ATM adducts, a known quantity of deuterated acetone fully reacted with ATM (ATM-acetone-d6) in methanol was added to the eluted solution. The relative abundance of all 27 carbonyl VOCs identified in the exhaled breath was compared to that of the added ATM-acetone-d6 to determine their concentrations. Please see [26] for more details.

The only reprocessing that occurred to the features is the normalization for each feature to be in the range from zero to one. This normalization helps in solving the machine learning challenges by making the learning process faster. In the following subsections, the feature selection and classification steps were described in detail.

2.3. Feature Selection

To get the significant features that affect the diagnosis, feature ranking and selection methods are used. We detail the feature ranking techniques used in the proposed system below.

2.3.1. Correlation Coefficient

The Pearson correlation coefficient is a correlation statistic that is used to determine the degree and direction of a link between two variables. Define X and Y as the concentration of two compounds. The linear correlation between them is represented by the letter r and calculated using Equation (1) as follows:

r = \frac{N \sum X Y - (\sum X \sum Y)}{\sqrt{[N \sum X^{2} - {(\sum X)}^{2}] [N \sum Y^{2} - {(\sum Y)}^{2}]}}

(1)

where N is the number of rows in the sample, r has two primary values:

- 1

and 1, positive values between 0 and 1 represent a positive linear correlation between X and Y, 0 represents no linear correlation between X and Y and negative values between 0 and

- 1

represent a negative linear correlation between X and Y [36].

2.3.2. Maximum Relevance–Minimum Redundancy (mRMR)

Maximum relevance–minimum redundancy (mRMR) works in several rounds. In mRMR, we want to pick the feature that has the most relevance to the target variable and the least redundancy to the features that were already chosen in earlier rounds. Assume there are m features in total and that, for each feature,

X_{i} (i \in {1, 2, 3, \dots, m})

, based on the mRMR criteria, its feature importance may be represented as [37]:

f^{m R M R} (X_{i}) = I (Y, X_{i}) - \frac{1}{|S|} \sum_{X_{s} \in S} I (X_{s}, X_{i})

(2)

where Y denotes the target variable (patient diagnosis), S denotes the collection of features that was chosen in earlier rounds,

|S|

is the number of features in the feature set,

X_{s} \in S

is one of the features in the feature set S and

X_{i} \notin S

signifies a feature that is not chosen. The function

I (., .)

is the mutual information that is calculated as follows:

I (Y, X) = \int_{Ω Y} \int_{Ω X} p (x, y) \log (\frac{p (x, y)}{p (x) p (y)}) d x d y

(3)

where

Ω Y

and

Ω X

are the sample spaces for Y and X, respectively;

p (x, y)

is the joint probability density; and

p ()

is the marginal density function. For the discrete variables Y and X, the mutual information formula takes the following form:

I (Y, X) = \sum_{y \in Ω Y} \sum_{x \in Ω X} p (x, y) \log (\frac{p (x, y)}{p (x) p (y)})

(4)

In the

mRMR

, at each round of the feature selection procedure, the feature with the highest feature significance score is chosen, and

{m a x}_{X_{i} \notin S} f^{m R M R} (X_{i})

will be added to the selected feature set S.

After applying mRMR to our dataset, the ranking of the features is as follows: (1-

C_{4} H_{8} O

, 2-

C_{4} H_{8} O_{2}

, 3-

C_{13} H_{22} O

, 4-

C_{5} H_{10} O

, 5-

C_{2} H_{4} O

, 6-

C_{2} H_{4} O_{2}

, 7-

C_{6} H_{12} O

, 8-

C_{8} H_{16} O

, 9-

C_{9} H_{16} O_{2}

, 10-

C_{7} H_{6} O

, 11-

C_{3} H_{4} O_{2}

, 12-

C_{6} H_{12} O_{2}

, 13-

C_{10} H_{20} O

, 14-

C_{4} H_{6} O

, 15-

C_{7} H_{14} O

, 16-

C_{15} H_{10} O

, 17-

C_{4} H_{4} O_{2}

, 18-

C_{11} H_{22} O

, 19-

C_{9} H_{18} O

, 20-

C_{12} H_{24} O

, 21-

C_{5} H_{8} O

, 22-

{CH}_{2} O

, 23-

C_{3} H_{4} O

, 24-

C_{7} H_{11} O

, 25-

C_{3} H_{6} O

, 26-

C_{4} H_{6} O_{2}

and, finally, 27-

C_{13} H_{26} O

).

2.4. Classification

SVM, logistic regression (LR), k-nearest neighbors (kNN), naïve Bayes (NB), decision tree (DT), random forest (RF), bagging, AdaBoost and neural network (NN) classifiers are used to differentiate malignant lung nodules from benign ones using breath data.

Support vector machine
SVM [38] is a marginal predictor that finds the best hyperplane in the feature vector space, thereby establishing a boundary that maximizes the margin between data samples in distinct categories, which results in strong generalization capabilities. In this study, a hyperplane linear classifier is used with the support vectors closest to the decision boundary being used. SVM with N support vectors $v_{1}, v_{2}, \dots, v_{n}$ and weights $w_{1}, w_{2}, \dots, w_{n}$ was used to estimate the classification given by:

$S V M = \sum_{i = 1}^{n} w_{i} (v_{i}, x) + b$

(5)

where a feature vector is represented by x, while a bias is represented by b.
Logistic regression
In LR [39], the predictor variable is generated by a linear combination of the input variables. The values of this predictor variable are converted into probabilities using a logistic function. It uses the logit function to calculate the chance of an event occurring:

$l o g i t = \log (\frac{p}{1 - p})$

(6)

where p is the probability of the event occurring. Maximum likelihood is used to estimate the significance of input characteristics, and the coefficients of the model give this information.
kNN classifier
The k-nearest neighbors algorithm [39] is a non-parametric classification and regression. The input in the feature space is made up of the k closest training examples. A class membership is the outcome of kNN. The object is assigned to the most frequent class among its k nearest neighbors after a majority of its neighbors vote to categorize it. If k = 1, the item is simply assigned to the class of the item’s closest neighbor. In the proposed system, we used k = 7 so that the class with the majority vote for the closest seven training samples would be selected as the classification result.
Naïve Bayes
The Bayes rule is used to calculate NB, which assumes that features (variables) are independent of one another given the class. For a training sample s with m VOCs values levels ${v_{1}, v_{2}, \dots, v_{m}}$ for the m features, the posterior probability that $s$ belongs to a class $h_{k}$ is

$p (h_{k} | s) \propto \prod_{i \in s} p (v_{i} | h_{k})$

(7)

where $p (v_{i} | h_{k})$ are conditional tables (or conditional densities) that are constructed from training instances [38].
Decision tree (DT)
The generation of a decision tree is a very efficient approach for constructing classifiers from data. The most commonly used logic technique is the tree representation. A typical decision-tree learning system uses a top-down approach to find a solution in a specific section of the search space. It splits the working area into subparts and uses the Gini Index, Gain Ratio or Information Gain to verify the purity of a class division. These metrics are utilized in the decision tree that forms the classification and regression tree (CART), C4.5 and ID3, respectively. In this study, CART is used [39].
Random forest (RF)
Random forest [39] is an ensemble learning system for classification, regression and other tasks that works by creating a large number of decision trees during training and generating a class that is the mode of the individual tree classes (classification) or a mean/average predictor (regression). Random forest is a method of averaging a large number of deep decision trees that have been trained to minimise variation across various parts of the same training set. This comes at the cost of a small increase in bias and some lack of interpretability, but it usually results in a significant gain in the efficiency of the final model. In this study, RF is built using 1000 CART decision trees.
Bagging classifier
Bagging classifiers [38] are ensemble meta-estimators that apply basic classifiers to random subsets of the original dataset before aggregating their predictions (either by vote or average) to generate a final prediction. A meta-estimator that incorporates randomness into the building technique of a black-box estimator (e.g., a decision tree) can be used to reduce the variance of a black-box estimator. In this study, the base estimator is a linear SVM, and the number of base estimators is 10.
Adaboost classifier
An AdaBoost classifier [39] is a meta-estimator that begins by fitting a classifier on the original dataset before fitting consecutive copies of the classifier on the same dataset, but it modifies the weights of incorrectly classified cases such that subsequent classifiers focus more on challenging circumstances. In this study, the base estimator used is the CART decision tree, and the number of base estimators is 100.
Neural networks (NN)
The multilayer perceptron (MLP) [39] is a type of artificial neural network that is used to simulate complicated functions. It has three or more layers of nodes, including an input layer, one or more hidden levels and an output layer, as shown in Figure 3. MLP, in contrast to conventional techniques, does not need any previous assumptions about the distribution of training data, which eliminates the impact of data distribution on performance. The activation function converts input to output. A cost function is used to determine the best parameter values. To enhance the model, the network is run many times. The backpropagation method aids in the learning of NN parameters. In this study, to determine the best parameters, the Adam optimizer is utilized. In the hidden layers, the rectified linear unit (ReLU) activation function is employed, and, in the output layer, the sigmoid activation function is used to offer a prediction between 0 and 1, with a value of larger than 0.5 indicating malignancy and a value of less than 0.5 indicating benignity.

3. Results

The proposed lung cancer diagnosis system’s performance is measured quantitatively using well-known metrics such as accuracy, sensitivity, specificity and F-score. These metrics are computed using true positive (TP), false positive (FP), true negative (TN) and false negative (FN), as clarified in the equations from Equations (8)–(12). TP is the number of patients with a malignant lung nodule that is correctly classified as malignant. FP is the number of patients with a benign lung nodule that is incorrectly classified as malignant. TN is the number of patients with a benign lung nodule that is correctly classified as benign. FN is the number of patients with a malignant lung nodule that is incorrectly classified as benign.

Accuracy = \frac{TP + TN}{TP + FP + TN + FN}

(8)

Precision = \frac{TP}{TP + FP}

(9)

Sensitivity = \frac{TP}{TP + FN}

(10)

Specificity = \frac{TN}{TN + FP}

(11)

F - Score = \frac{2 \times Precision \times Sensitivity}{Precision + Sensitivity}

(12)

Table 2, Table 3 and Table 4 show the results with different holdout testing techniques. These different techniques are used to train our model by using different sizes of data to check the robustness of the model and to make sure that there is no overfitting to the data in any way.

Table 2 uses 80% of the data as the training set and 20% as the testing set. Table 3 uses 75% of the data as the training set and 25% as the testing set, while Table 4 uses 70% of the data as the training set and 30% as the testing set. The best results are achieved using the 80–20% train-test split with random forest classifier (87%, 88%, 86% and 87% for accuracy, sensitivity, specificity and F-score, respectively). Moreover, RF outperformed the other models in the 75–25% train-test split, achieving 84% accuracy, 86% sensitivity, 83% specificity and 84% for the F-score. For the 70–30% train-test split, the SVM, RF and bagging classifiers achieved 82% accuracy, but RF continues to outperform them with an 83% F-score. From the results of these three tables, it can be concluded that the model is robust, and the features are distinguishing between malignant and benign nodules, regardless the technique used, as there is no big difference between the accuracy measures while using different amounts of data for training.

The correlation coefficient between each feature and the target diagnosis is calculated and ranked in ascending order.

The

C_{9} H_{16} O_{2}

,

C_{5} H_{8} O

,

C_{3} H_{4} O

and

C_{3} H_{6} O

features have negative correlations with the target diagnosis, while the rest of the features have a positive correlation.

C_{9} H_{16} O_{2}

has the largest negative correlation (

- 0.066104

), while

C_{4} H_{8} O

has the largest positive correlation (

0.465866

).

The absolute value of the negative correlation coefficient between the feature and target diagnosis is calculated so all of the values are positive, and, the larger the value, the stronger the correlation. Then, we picked the features whose correlations were greater than 0.1; this resulted in 10 features. These 10 features were selected to be used with classification, while the rest of the 17 features were dropped. The results are shown in Figure 4. RF and kNN achieved 85% accuracy, but RF outperformed kNN with an 85% F-score. Figure 5 shows the results with features whose correlation is >0.25 (6 features). Moreover, RF outperformed other models that achieved 85% accuracy, 82% sensitivity, 88% specificity and 85% for F-score. Figure 6 shows the correlation coefficient between each feature and the target diagnosis in ascending order.

For feature ranking and selection, mRMR is also used. Table 5 shows the results using 17 features. RF and kNN achieved 87% accuracy, but RF outperformed kNN with an 87% F-score. Table 6 shows the results using nine features selected with mRMR. Moreover, RF outperformed other models achieving 84% accuracy, 80% sensitivity, 88% specificity and 83% for F-score. All the system components are developed using the Python scikit-learn library.

4. Discussion

The random forest outperformed the other algorithms, according to this study’s findings. This classifier is a tree-structured base ensemble learner. Each tree predicts the target response independently, with the final estimates based on the average of the individual tree’s estimates.

It is a nonparametric technique that permits complicated interactions between predictor variables to be modeled automatically because there is no predetermined type of interaction between features and the result variable. This classifier is one of the most commonly utilized in the construction of CAD systems because of its benefits, which have been employed in a range of clinical situations. The most important features as determined by the RF classifier were examined since it proved to be the best among the other models. Figure 7 shows that the feature importances were calculated using the RF algorithm for classification purposes.

C_{4} H_{8} O

(2-Butanone) is the most important feature using the RF importance measure. Ketones can be derived from the amino acid metabolism. The rate of tissue protein catabolism is more or less constant under normal conditions. However, it has been identified that the amount of ketones is increased by protein metabolism in cachexia, which is associated with advanced cancer and other disease [34,40]. Since cachexia usually occurs in the final stages of lung cancer, it leads to the formation of elevated ketones. Elevated concentrations of 2-butanone were found in the breath of lung cancer patients of both early stages and advanced stages by several international research groups [23,27,41]. Elevated 2-butanone was also found in cancer cell culture headspace samples [42]. These results lead us to hypothesize that, even if at the early stages, dysregulation of lung cancer cells and tissues causes oxidative stress and high protein catabolism in a local cancer environment, which results in elevated concentrations of 2-butanone in the exhaled breath of lung cancer patients.

Moreover

C_{4} H_{8} O

(2-Butanone) has the highest Pearson correlation coefficient value and is ranked first using the mRMR technique. This explains that the value of this feature may be considered a good indicator for lung cancer diagnosis. Table 7 shows the results using only the values of the

C_{4} H_{8} O

(2-Butanone) feature. The accuracy results using SVM, LR, bagging and NN classifiers are 86%, which approaches the best result obtained using the RF classifier when using all the features (87%). This leads us to the conclusion that

C_{4} H_{8} O

(2-Butanone) can be used as a significant indication for the diagnosis of lung cancer. The accuracy results using the 17 features selected using the mRMR technique is also 87%, so we can drop the 10 features that decrease the complexity of training and still have the highest achieved accuracy using all the features (87%). Finally, all results show that this technique is practical for use in medical applications as it gives an accurate classification for the lung cancer without any need for an invasive and expensive method, such as biopsy or depending on multiple CT scans, which are needed to check the nodule’s growth rate.

Author Contributions

Conceptualization, A.E., A.S. (Ahmed Shaffie), A.S. (Ahmed Soliman) and A.E.-B.; breath analysis, X.-A.F. and M.H.N.; methodology, A.E., A.S. (Ahmed Shaffie) and A.S. (Ahmed Soliman); validation, A.S. (Ahmed Shaffie) and A.S. (Ahmed Soliman); formal analysis, A.E., A.S. (Ahmed Shaffie), A.S. (Ahmed Soliman) and A.E.-B.; investigation, M.H.N., X.-A.F. and V.v.B.; resources, M.H.N., X.-A.F., V.v.B. and A.E.-B.; data curation, A.S. (Ahmed Shaffie), A.S. (Ahmed Soliman) and V.v.B.; writing—original draft preparation, A.S. (Ahmed Shaffie), A.S. (Ahmed Soliman), A.E. and X.-A.F.; writing—review and editing, A.S. (Ahmed Shaffie), A.S. (Ahmed Soliman), M.H.N., X.-A.F. and A.E.-B.; supervision, G.G., X.-A.F., M.H.N., V.v.B. and A.E.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by U.S. Department of Defense, grant number W81XWH-19-1-0799.

Institutional Review Board Statement

The research protocol was approved by the Institutional Review Board (IRB) at the University of Louisville (10.0642), and all methods were performed in accordance with the relevant guidelines and regulations.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Our locally acquired dataset presented in this study are not publicly available as it is protected under IRB number (10.0642).

Conflicts of Interest

The authors declare no conflict of interest.

References

Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
Midthun, D.E. Early diagnosis of lung cancer. F1000prime Rep. 2013, 5, 12. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Siegel, R.L.; Miller, K.D.; Fuchs, H.E.; Jemal, A. Cancer statistics, 2018. CA Cancer J. Clin. 2018, 71, 7–33. [Google Scholar] [CrossRef] [PubMed]
Gordienko, Y.; Gang, P.; Hui, J.; Zeng, W.; Kochura, Y.; Alienin, O.; Rokovyi, O.; Stirenko, S. Deep learning with lung segmentation and bone shadow exclusion techniques for chest X-ray analysis of lung cancer. In Proceedings of the International Conference on Computer Science, Engineering and Education Applications, Kiev, Ukraine, 18–20 January 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 638–647. [Google Scholar]
Feng, P.H.; Chen, T.T.; Lin, Y.T.; Chiang, S.Y.; Lo, C.M. Classification of lung cancer subtypes based on autofluorescence bronchoscopic pattern recognition: A preliminary study. Comput. Methods Programs Biomed. 2018, 163, 33–38. [Google Scholar] [CrossRef] [PubMed]
Hyun, S.H.; Ahn, M.S.; Koh, Y.W.; Lee, S.J. A machine-learning approach using PET-based radiomics to predict the histological subtypes of lung cancer. Clin. Nucl. Med. 2019, 44, 956–960. [Google Scholar] [CrossRef]
Bębas, E.; Borowska, M.; Derlatka, M.; Oczeretko, E.; Hładuński, M.; Szumowski, P.; Mojsak, M. Machine-learning-based classification of the histological subtype of non-small-cell lung cancer using MRI texture analysis. Biomed. Signal Process. Control 2021, 66, 102446. [Google Scholar] [CrossRef]
De Mesquita, V.A.; Cortez, P.C.; Ribeiro, A.B.; de Albuquerque, V.H.C. A novel method for lung nodule detection in computed tomography scans based on Boolean equations and vector of filters techniques. Comput. Electr. Eng. 2022, 100, 107911. [Google Scholar] [CrossRef]
Da Nóbrega, R.V.M.; Rebouças Filho, P.P.; Rodrigues, M.B.; da Silva, S.P.; Dourado Júnior, C.M.; de Albuquerque, V.H.C. Lung nodule malignancy classification in chest computed tomography images using transfer learning and convolutional neural networks. Neural Comput. Appl. 2020, 32, 11065–11082. [Google Scholar] [CrossRef]
Barros, A.C.; Ramalho, G.L.; Pereira, C.R.; Papa, J.P.; de Albuquerque, V.H.C.; Tavares, J.M.R. Automated recognition of lung diseases in CT images based on the optimum-path forest classifier. Neural Comput. Appl. 2019, 31, 901–914. [Google Scholar]
Rodrigues, M.B.; Da Nobrega, R.V.M.; Alves, S.S.A.; Reboucas Filho, P.P.; Duarte, J.B.F.; Sangaiah, A.K.; De Albuquerque, V.H.C. Health of things algorithms for malignancy level classification of lung nodules. IEEE Access 2018, 6, 18592–18601. [Google Scholar] [CrossRef]
Jett, J. Screening for lung cancer: Who should be screened? Arch. Pathol. Lab. Med. 2012, 136, 1511–1514. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Ricarte Filho, J.; Mallisetty, A.; Villani, C.; Kottorou, A.; Rodgers, K.; Chen, C.; Ito, T.; Holmes, K.; Gastala, N.; et al. Detection of Promoter DNA Methylation in Urine and Plasma Aids the Detection of Non–Small Cell Lung Cancer. Clin. Cancer Res. 2020, 26, 4339–4348. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Todd, N.W.; Qiu, Q.; Fan, T.; Zhao, R.Y.; Rodgers, W.H.; Fang, H.B.; Katz, R.L.; Stass, S.A.; Jiang, F. Genetic Deletions in Sputum as Diagnostic Markers for Early Detection of Stage I Non–Small Cell Lung Cancer. Clin. Cancer Res. 2007, 13, 482–487. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hanai, Y.; Shimono, K.; Matsumura, K.; Vachani, A.; Albelda, S.; Yamazaki, K.; Beauchamp, G.K.; Oka, H. Urinary volatile compounds as biomarkers for lung cancer. Biosci. Biotechnol. Biochem. 2012, 76, 679–684. [Google Scholar] [CrossRef] [Green Version]
Zhang, C.; Leng, W.; Sun, C.; Lu, T.; Chen, Z.; Men, X.; Wang, Y.; Wang, G.; Zhen, B.; Qin, J. Urine proteome profiling predicts lung cancer from control cases and other tumors. EBioMedicine 2018, 30, 120–128. [Google Scholar] [CrossRef] [Green Version]
Li, C.; Hong, W. Research status and funding trends of lung cancer biomarkers. J. Thorac. Dis. 2013, 5, 698. [Google Scholar]
Bel’skaya, L.V.; Sarf, E.A.; Kosenok, V.K.; Gundyrev, I.A. Biochemical markers of saliva in lung cancer: Diagnostic and prognostic perspectives. Diagnostics 2020, 10, 186. [Google Scholar] [CrossRef] [Green Version]
Taghizadeh-Hesary, F.; Akbari, H.; Bahadori, M. Anti-mitochondrial therapy: A potential therapeutic approach in oncology. Preprints 2022. [Google Scholar] [CrossRef]
Janfaza, S.; Khorsand, B.; Nikkhah, M.; Zahiri, J. Digging deeper into volatile organic compounds associated with cancer. Biol. Methods Protoc. 2019, 4, bpz014. [Google Scholar] [CrossRef]
Kort, S.; Tiggeloven, M.; Brusse-Keizer, M.; Gerritsen, J.; Schouwink, J.; Citgez, E.; de Jongh, F.; Samii, S.; van der Maten, J.; van den Bogart, M.; et al. Multi-centre prospective study on diagnosing subtypes of lung cancer by exhaled-breath analysis. Lung Cancer 2018, 125, 223–229. [Google Scholar] [CrossRef]
Phillips, M.; Altorki, N.; Austin, J.H.; Cameron, R.B.; Cataneo, R.N.; Greenberg, J.; Kloss, R.; Maxfield, R.A.; Munawar, M.I.; Pass, H.I.; et al. Prediction of lung cancer using volatile biomarkers in breath 1. Cancer Biomark. 2007, 3, 95–109. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bajtarevic, A.; Ager, C.; Pienz, M.; Klieber, M.; Schwarz, K.; Ligor, M.; Ligor, T.; Filipiak, W.; Denz, H.; Fiegl, M.; et al. Noninvasive detection of lung cancer by analysis of exhaled breath. BMC Cancer 2009, 9, 1–16. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mazzone, P.J.; Wang, X.F.; Lim, S.; Jett, J.; Choi, H.; Zhang, Q.; Beukemann, M.; Seeley, M.; Martino, R.; Rhodes, P. Progress in the development of volatile exhaled breath signatures of lung cancer. Ann. Am. Thorac. Soc. 2015, 12, 752–757. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gasparri, R.; Santonico, M.; Valentini, C.; Sedda, G.; Borri, A.; Petrella, F.; Maisonneuve, P.; Pennazza, G.; D’Amico, A.; Di Natale, C.; et al. Volatile signature for the early diagnosis of lung cancer. J. Breath Res. 2016, 10, 016007. [Google Scholar] [CrossRef] [PubMed]
Bousamra, M., II; Schumer, E.; Li, M.; Knipp, R.J.; Nantz, M.H.; Van Berkel, V.; Fu, X.A. Quantitative analysis of exhaled carbonyl compounds distinguishes benign from malignant pulmonary disease. J. Thorac. Cardiovasc. Surg. 2014, 148, 1074–1081. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fu, X.A.; Li, M.; Knipp, R.J.; Nantz, M.H.; Bousamra, M. Noninvasive detection of lung cancer using exhaled breath. Cancer Med. 2014, 3, 174–181. [Google Scholar] [CrossRef]
Li, M.; Yang, D.; Brock, G.; Knipp, R.J.; Bousamra, M.; Nantz, M.H.; Fu, X.A. Breath carbonyl compounds as biomarkers of lung cancer. Lung Cancer 2015, 90, 92–97. [Google Scholar] [CrossRef]
Schumer, E.M.; Trivedi, J.R.; van Berkel, V.; Black, M.C.; Li, M.; Fu, X.A.; Bousamra, M., II. High sensitivity for lung cancer detection using analysis of exhaled carbonyl compounds. J. Thorac. Cardiovasc. Surg. 2015, 150, 1517–1524. [Google Scholar] [CrossRef]
Schumer, E.M.; Black, M.C.; Bousamra, M., II; Trivedi, J.R.; Li, M.; Fu, X.A.; van Berkel, V. Normalization of exhaled carbonyl compounds after lung cancer resection. Ann. Thorac. Surg. 2016, 102, 1095–1100. [Google Scholar] [CrossRef] [Green Version]
Pesesse, R.; Stefanuto, P.H.; Schleich, F.; Louis, R.; Focant, J.F. Multimodal chemometric approach for the analysis of human exhaled breath in lung cancer patients by TD-GC× GC-TOFMS. J. Chromatogr. B 2019, 1114, 146–153. [Google Scholar] [CrossRef]
Koureas, M.; Kirgou, P.; Amoutzias, G.; Hadjichristodoulou, C.; Gourgoulianis, K.; Tsakalof, A. Target analysis of volatile organic compounds in exhaled breath for lung cancer discrimination from other pulmonary diseases and healthy persons. Metabolites 2020, 10, 317. [Google Scholar] [CrossRef] [PubMed]
Tsou, P.H.; Lin, Z.L.; Pan, Y.C.; Yang, H.C.; Chang, C.J.; Liang, S.K.; Wen, Y.F.; Chang, C.H.; Chang, L.Y.; Yu, K.L.; et al. Exploring volatile organic compounds in breath for high-accuracy prediction of lung cancer. Cancers 2021, 13, 1431. [Google Scholar] [CrossRef] [PubMed]
Hakim, M.; Broza, Y.Y.; Barash, O.; Peled, N.; Phillips, M.; Amann, A.; Haick, H. Volatile organic compounds of lung cancer and possible biochemical pathways. Chem. Rev. 2012, 112, 5949–5966. [Google Scholar] [CrossRef]
Kim, K.H.; Jahan, S.A.; Kabir, E. A review of breath analysis for diagnosis of human health. TrAC Trends Anal. Chem. 2012, 33, 1–8. [Google Scholar] [CrossRef]
Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
Brown, G.; Pocock, A.; Zhao, M.J.; Luján, M. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
Asuntha, A.; Srinivasan, A. Deep learning for lung Cancer detection and classification. Multimed. Tools Appl. 2020, 79, 7731–7762. [Google Scholar] [CrossRef]
Gupta, S.; Sedamkar, R. Machine learning for healthcare: Introduction. In Machine Learning with Health Care Perspective; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 1–25. [Google Scholar]
Murray, R.K.; Granner, D.K.; Rodwell, V.W. Harper’s Illustrated Biochemistry, 27th ed.; Lange Medical Books: New York, NY, USA, 2006. [Google Scholar]
Kischkel, S.; Miekisch, W.; Sawacki, A.; Straker, E.M.; Trefz, P.; Amann, A.; Schubert, J.K. Breath biomarkers for lung cancer detection and assessment of smoking related effects—Confounding variables, influence of normalization and statistical algorithms. Clin. Chim. Acta 2010, 411, 1637–1644. [Google Scholar] [CrossRef]
Sponring, A.; Filipiak, W.; Mikoviny, T.; Ager, C.; Schubert, J.; Miekisch, W.; Amann, A.; Troppmair, J. Release of volatile organic compounds from the lung cancer cell line NCI-H2087 in vitro. Anticancer Res. 2009, 29, 419–426. [Google Scholar]

Figure 1. Methodology procedure.

Figure 2. (a) A snapshot of the breath collecting system; points to a DRIE-created optical image of the microfabricated microchip with fused silica tubes attached to the intake and exit ports; points to a SEM micrograph of the micropillar array within the preconcentrator; (b) setup diagram for capturing carbonyl VOCs from exhaled breath.

Figure 3. Simple neural network architecture.

Figure 4. Results using 10 features whose correlation

> 0.1

.

Figure 4. Results using 10 features whose correlation

> 0.1

.

Figure 5. Results using 6 features whose correlation

> 0.25

.

Figure 5. Results using 6 features whose correlation

> 0.25

.

Figure 6. Correlation coefficient between feature and diagnosis.

Figure 7. Random forest feature importance.

Table 1. Patients clinical characteristics.

Age (years)	30–96
Malignant	252
Height (cm)	126–193
Weight (Kg)	33–183
Active smoker	169
Previous smoker	232
Lifelong non-smoker	100
Personal history of lung cancer	61
Family history of lung cancer	125

Table 2. Results with 80–20% train-test split.

Algorithm	Accuracy	Sensitivity	Specificity	F-Score
SVM	0.84	0.78	0.90	0.83
LR	0.82	0.74	0.90	0.80
kNN	0.80	0.70	0.90	0.78
NB	0.59	0.32	0.86	0.44
DT	0.68	0.58	0.78	0.64
RF	0.87	0.88	0.86	0.87
Bagging	0.84	0.82	0.86	0.84
AdaBoost	0.75	0.74	0.76	0.75
NN	0.69	0.66	0.73	0.68

Table 3. Results with 75–25% train-test split.

Algorithm	Accuracy	Sensitivity	Specificity	F-Score
SVM	0.83	0.79	0.87	0.83
LR	0.81	0.78	0.84	0.80
kNN	0.79	0.68	0.89	0.76
NB	0.63	0.38	0.87	0.51
DT	0.68	0.67	0.70	0.68
RF	0.84	0.86	0.83	0.84
Bagging	0.83	0.81	0.86	0.83
AdaBoost	0.74	0.76	0.71	0.74
NN	0.68	0.68	0.68	0.68

Table 4. Results with 70–30% train-test split.

Algorithm	Accuracy	Sensitivity	Specificity	F-Score
SVM	0.82	0.78	0.87	0.81
LR	0.81	0.78	0.84	0.80
kNN	0.77	0.66	0.88	0.74
NB	0.63	0.39	0.86	0.51
DT	0.69	0.66	0.72	0.68
RF	0.82	0.84	0.80	0.83
Bagging	0.82	0.80	0.84	0.82
AdaBoost	0.78	0.80	0.75	0.78
NN	0.69	0.67	0.71	0.68

Table 5. Results using first 17 features ranked using mRMR.

Algorithm	Accuracy	Sensitivity	Specificity	F-Score
SVM	0.79	0.68	0.90	0.76
LR	0.78	0.66	0.90	0.75
kNN	0.87	0.78	0.96	0.86
NB	0.62	0.36	0.88	0.49
DT	0.74	0.68	0.80	0.72
RF	0.87	0.86	0.88	0.87
Bagging	0.79	0.66	0.92	0.76
AdaBoost	0.75	0.74	0.76	0.75
NN	0.69	0.58	0.80	0.65

Table 6. Results using first 9 features ranked using mRMR.

Algorithm	Accuracy	Sensitivity	Specificity	F-Score
SVM	0.79	0.68	0.90	0.76
LR	0.78	0.66	0.90	0.75
kNN	0.81	0.76	0.86	0.80
NB	0.75	0.54	0.96	0.68
DT	0.65	0.60	0.71	0.63
RF	0.84	0.80	0.88	0.83
Bagging	0.82	0.72	0.92	0.80
AdaBoost	0.78	0.72	0.84	0.77
NN	0.73	0.68	0.78	0.72

Table 7. Results using

C_{4} H_{8} O

(2-Butanone).

Table 7. Results using

C_{4} H_{8} O

(2-Butanone).

Algorithm	Accuracy	Sensitivity	Specificity	F-Score
SVM	0.86	0.80	0.92	0.85
LR	0.86	0.80	0.92	0.85
kNN	0.77	0.68	0.86	0.75
NB	0.81	0.66	0.96	0.78
DT	0.65	0.58	0.73	0.62
RF	0.66	0.60	0.73	0.64
Bagging	0.86	0.80	0.92	0.85
AdaBoost	0.66	0.60	0.73	0.64
NN	0.86	0.80	0.92	0.85

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shaffie, A.; Soliman, A.; Eledkawy, A.; Fu, X.-A.; Nantz, M.H.; Giridharan, G.; van Berkel, V.; El-Baz, A. Lung Cancer Diagnosis System Based on Volatile Organic Compounds (VOCs) Profile Measured in Exhaled Breath. Appl. Sci. 2022, 12, 7165. https://0-doi-org.brum.beds.ac.uk/10.3390/app12147165

AMA Style

Shaffie A, Soliman A, Eledkawy A, Fu X-A, Nantz MH, Giridharan G, van Berkel V, El-Baz A. Lung Cancer Diagnosis System Based on Volatile Organic Compounds (VOCs) Profile Measured in Exhaled Breath. Applied Sciences. 2022; 12(14):7165. https://0-doi-org.brum.beds.ac.uk/10.3390/app12147165

Chicago/Turabian Style

Shaffie, Ahmed, Ahmed Soliman, Amr Eledkawy, Xiao-An Fu, Michael H. Nantz, Guruprasad Giridharan, Victor van Berkel, and Ayman El-Baz. 2022. "Lung Cancer Diagnosis System Based on Volatile Organic Compounds (VOCs) Profile Measured in Exhaled Breath" Applied Sciences 12, no. 14: 7165. https://0-doi-org.brum.beds.ac.uk/10.3390/app12147165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lung Cancer Diagnosis System Based on Volatile Organic Compounds (VOCs) Profile Measured in Exhaled Breath

Abstract

1. Introduction

2. Materials and Methods

2.1. Patients

2.2. Methodology

2.3. Feature Selection

2.3.1. Correlation Coefficient

2.3.2. Maximum Relevance–Minimum Redundancy (mRMR)

2.4. Classification

3. Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI