Exploring Promising Biomarkers for Alzheimer’s Disease through the Computational Analysis of Peripheral Blood Single-Cell RNA Sequencing Data

Krokidis, Marios G.; Vrahatis, Aristidis G.; Lazaros, Konstantinos; Vlamos, Panagiotis

doi:10.3390/app13095553

Open AccessArticle

Exploring Promising Biomarkers for Alzheimer’s Disease through the Computational Analysis of Peripheral Blood Single-Cell RNA Sequencing Data

by

Marios G. Krokidis

^*,

Aristidis G. Vrahatis

,

Konstantinos Lazaros

and

Panagiotis Vlamos

Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5553; https://0-doi-org.brum.beds.ac.uk/10.3390/app13095553

Submission received: 3 March 2023 / Revised: 7 April 2023 / Accepted: 27 April 2023 / Published: 29 April 2023

(This article belongs to the Special Issue Machine/Deep Learning: Applications, Technologies and Algorithms)

Download

Browse Figures

Versions Notes

Abstract

:

Alzheimer’s disease (AD) represents one of the most important healthcare challenges of the current century, characterized as an expanding, “silent pandemic”. Recent studies suggest that the peripheral immune system may participate in AD development; however, the molecular components of these cells in AD remain poorly understood. Although single-cell RNA sequencing (scRNA-seq) offers a sufficient exploration of various biological processes at the cellular level, the number of existing works is limited, and no comprehensive machine learning (ML) analysis has yet been conducted to identify effective biomarkers in AD. Herein, we introduced a computational workflow using both deep learning and ML processes examining scRNA-seq data obtained from the peripheral blood of both Alzheimer’s disease patients with an amyloid-positive status and healthy controls with an amyloid-negative status, totaling 36,849 cells. The output of our pipeline contained transcripts ranked by their level of significance, which could serve as reliable genetic signatures of AD pathophysiology. The comprehensive functional analysis of the most dominant genes in terms of biological relevance to AD demonstrates that the proposed methodology has great potential for discovering blood-based fingerprints of the disease. Furthermore, the present approach paves the way for the application of ML techniques to scRNA-seq data from complex disorders, providing new challenges to identify key biological processes from a molecular perspective.

Keywords:

feature selection; machine learning; deep learning; big data; Alzheimer’s disease; ensemble method

1. Introduction

Alzheimer’s (AD) disease is a chronic degenerative disease of the brain and is the most common cause of dementia. Its main clinical manifestation is memory impairment, but other brain functions are also affected, and thus, the patient’s social life is seriously affected [1]. Exploring the molecular characteristics of blood cells using single-cell RNA-sequencing (scRNA-seq), data have emerged as a powerful approach to unravelling the molecular basis of AD at the cellular level [2]. It has enabled the identification of transcriptional changes in individual cells, providing unprecedented resolution and sensitivity to detect early changes in disease progression. Furthermore, scRNA-seq can capture the heterogeneity of cell types within the peripheral immune system, which is crucial for understanding the immune response to AD [3,4]. Due to these advantages, scRNA-seq analysis of liquid biopsies, such as blood, is considered a first choice to address the challenges of AD discovery as well as non-invasive blood-based sampling [5]. Although this area may offer promising results, it is in its infancy, and limited studies have been published [6,7,8]. It is worth mentioning that machine learning (ML) analysis has not yet been applied to scRNA-seq data for the purpose of extracting valuable knowledge and identifying potential AD biomarkers. Therefore, there is a significant need for novel approaches to extract insights from such data and advance our understanding of AD.

One major challenge in scRNA-seq data lies in the fact that these are often generated in multiple batches, which can introduce batch effects that confound downstream analyses [9]. In order to address this issue, deep learning (DL) techniques have been developed to integrate scRNA-seq data from multiple batches [10,11,12]. These approaches can effectively remove batch effects, producing a more accurate representation of underlying biological processes. In recent years, there has been a significant advance in the application of deep learning methods for the integration of single-cell RNA sequencing (scRNA-seq) data [13,14]. These methods have shown remarkable progress in addressing the challenges posed by batch effects in scRNA-seq datasets, enabling the integration of data from multiple batches and producing a more accurate representation of underlying biological processes.

During single-cell RNA-sequencing studies, the initial FASTQ file output is transformed into a count’s matrix that provides a summary of the number of molecules for each gene that can be detected in each cell in the dataset [15]. After going through preprocessing procedures, the resulting counts matrix contains tens of thousands of gene expression profiles (features) for each sample, leading to an extremely high-dimensional dataset. Two approaches are commonly employed to handle high-dimensional datasets: dimensionality reduction methods and feature selection [16]. The first entails transforming the initial high-dimensional space into a lower-dimensional space while maintaining pairwise distances between points below a specified threshold. The latter aims to identify the most relevant features that encapsulate relevant information in the dataset. Concentrating on feature selection preserves the gene structure, thus lending interpretability to the studied model.

In this context, we selected the study conducted by Xu et al. (2021) [6] as it provides the most indicative scRNA-seq approach for peripheral five subtypes of immune cells in AD, offering gene expression profiles for tens of thousands of selected cellular samples. The authors employed a variety of analytical processes, including statistical tests, network-based analysis and functional enrichment analysis, and promising results were obtained, revealing abnormal changes in the immune cell composition and transcriptional state of AD. These findings suggest a disturbance in the immune infiltration of the peripheral immune environment at a single-cell level. While statistical approaches and enrichment analysis offer reliable outcomes, the use of ML methods can provide an alternative perspective and uncover hidden insights that may be missed by traditional methods. In addition, when processing scRNA-seq datasets from multiple groups, such as AD patients with different experimental batches in our case, batch effect issues need to be addressed [17]. Herein, we propose an ensemble pipeline that can extract the most significant gene signatures from scRNA-seq data for a case under study. Our method comprised two main steps: first, we employed a cutting-edge deep-learning-based method for scRNA-seq data integration, which accurately removes batch effects; second, we employed an ensemble methodology that utilized three ML-based feature selection algorithms and a voting scheme for ranking the leading genes. Further biological interpretation using enrichment analysis stressed our findings and extracted valuable biological insights relating to the functions of a group of genes.

2. Materials and Methods

2.1. Dataset

To conduct our analysis, we utilized the peripheral blood cell dataset generated by Xu et al. (GEO accession number: GSE181279) [6]. This dataset includes 36,849 peripheral blood mononuclear cells (PBMCs) obtained from three AD patients with an amyloid-positive status and two cognitively normal controls with amyloid-negative status. Our objective was to identify the most dominant genes that differentiated AD patients from healthy individuals.

The major challenge here arose from the nature of single-cell technologies, which create batch effects. This is a systematic technical variation that occurs due to experimental factors such as differences in sample preparation, processing, and sequencing. These effects can lead to false conclusions and mask true biological differences between the cells. Thus, it is important to address them during data integration in order to ensure accurate and reliable results.

2.2. Methodology

Our pipeline consisted of two primary steps: (a) the integration of scRNA-seq data by eliminating batch effects using a deep-learning approach and (b) utilizing an ensemble methodology that combined three ML-based feature selection algorithms and a voting scheme to identify the most dominant genes across single-cell RNA-seq data (Figure 1). Further details for each step are provided below. The data were pre-processed and visualized through the use of Python’s Scanpy library [18].

2.2.1. Deep Learning for Data Integration

The observed raw transcriptomic dataset was divided into five batches, with each batch labeled to indicate the source subject of the cells collected. To create a unified dataset from the five batches, we utilized a state-of-the-art tool for integrating single-cell data. We applied SCALEX [10], a deep-learning technique that integrates single-cell data by projecting cells into a batch-invariant, shared cell-embedding space without the need for model retraining. The core of the SCALEX approach is a variational autoencoder (VAE) that maps distinct dataset batches into a common low-dimensional space free from batch effects. To achieve this, the authors trained a batch-free encoder and a batch-specific decoder within the VAE framework. More specifically, the encoder component of SCALEX is batch-free and extracts biologically relevant latent features, while the decoder is batch-specific and uses batch information and a domain-specific batch normalization layer (DSBN) to reconstruct the original data from the latent features. The encoder maps individual cells to a common space using a generalized projection function that enables the integration of new data without requiring model retraining. Furthermore, SCALEX employs a mini-batch approach where data are sampled from all batches and subjected to batch normalization to adjust for deviation and align with the overall input distribution.

To integrate the data, we first classified the five batches based on the health status of individuals from whom the cells were collected, categorized as a “normal control” “AD” (indicating Alzheimer’s disease). We then merged the batches into a single dataset and employed SCALEX to adjust the batch effects. We used the default parameters for this analysis. After preprocessing and integration/correction, the dataset comprised 35,685 (22,775 AD and 14,074 NC) cells and 2000 genes. Figure 2 shows four data tSNE plots depicting the data before and after integration/batch effect correction with SCALEX, with color-coded batch names and disease types. In addition, we utilized three other cutting-edge techniques for integrating single-cell data, underscoring the capacity of SCALEX (see Supplementary Materials and Figures S1–S3).

2.2.2. An Ensemble Machine Learning Framework for Dominant Genes Identification

Following the first step with the integration of the data batches, we obtained a dataset containing FS. Although the dataset was not considered high dimensional, the large number of genes posed a challenge in conducting a rigorous and reliable analysis. To overcome this challenge, feature selection was employed as a technique to address the issue.

Three different feature selection methods were utilized, each with a distinct purpose, and all of them shared a core perspective from a machine learning standpoint. The aim here was to exploit the results of three well-established feature selection methodologies, each designed to identify relevant genes in a distinct manner. More specifically, the first method leveraged differential expression analysis using a logistic regression approach to identify genes that exhibited significant differences in their expression levels across distinct groups [19]. The second approach (called triku) was based on the nearest neighbor search, aiming to identify genes closely associated with the groups of interest [20]. By utilizing the variable importance measure from the random forest algorithm [21], the third approach leveraged the inherent potential of tree-based methods to examine the significance of each gene in a machine learning prediction process. This technique demonstrated effectiveness in identifying relevant features across various datasets, including single-cell RNA-seq data [22]. In our analysis, we utilized the Mean Decrease Impurity (MDI), which measures the reduction in the impurity of the node caused by a variable. The importance of a variable can be calculated as the total reduction in impurity across all trees in which the variable was used.

In the next step, a set of 500 top genes was extracted from each feature selection method. These three lists were merged using a Borda-based aggregation technique [23] to produce a “consensus” list that included all the features identified as significant by the three feature selection methods utilized in this study. The Borda count score is a voting system that assigns a score to each candidate based on their ranking in a list of preferences. The candidate with the highest overall score is considered the winner. The score is calculated by assigning points to each candidate based on their position in the ranking, with the top candidate receiving the highest number of points and subsequent candidates receiving decreasing points. The total score for each candidate is then calculated, and the candidate with the highest score is considered the winner. The Borda count method is commonly used in elections and decision-making processes to determine the most popular or preferred option.

3. Results and Discussion

To achieve a more robust outcome, we utilized three distinct machine learning-based feature selection methods in this study, each with its unique procedure and significance. Regarding the first method, the objective of conducting differential expression analysis was to discern the genes that exhibited differential expression levels across specific conditions. The identification of such genes could provide valuable biological insights into the underlying mechanisms affected by the conditions of interest. Logistic regression is a simple yet powerful algorithm that proves particularly useful for scRNA-seq analyses [19]. This is due to the substantial number of cells typically available in scRNA-Seq experiments, as well as the ability to incorporate transcript information in gene-level testing. By utilizing logistic regression, the contribution of individual transcripts (isoforms) to the gene-level differential analysis could be identified, which enhanced the interpretability of results. Rather than adopting the conventional method of using cell labels to study gene expression, logistic regression was utilized for each gene to predict cell labels using constituent transcript quantification. This approach capitalizes on the ability of logistic regression to highlight the optimal linear combination of the number of isoforms for differential analysis. In contrast to other methods, fitting a linear logistic regression was relatively and computationally efficient, which was particularly important as the scale of scRNA-Seq experiments continued to grow.

Triku is a feature selection algorithm that belongs to the second category. It selects genes that have a localized expression and are expressed in neighboring cells that share similar characteristics (i.e., functions). The idea behind triku is that biologically important genes have localized expressions in groups of similar cells. It distinguishes between genes that are expressed in a few unrelated cells (uninteresting) and genes expressed in a set of similar cells (interesting) by looking at the expression of nearby cells. Triku calculated the distribution of gene expression values considering the expression of neighboring cells, and this helped distinguish between the two cases. Finally, the expected distribution of gene expression values was calculated using a random sample of k cells.

Variable importance (VI) in random forests is a well-known feature selection method that estimates the importance of each feature in a dataset by measuring how much each feature contributes to the accuracy of the model [21]. In the random forest algorithm, multiple decision trees are built on different subsets of the data, and the VI score of each feature can be calculated based on how much the accuracy of the model decreases when that feature is randomly permuted. Features with high VI scores are considered to be more important, as they have a larger impact on the accuracy of the model.

Regarding our outcomes, the tSNE plots clearly indicated a significant change in the scatter pattern of the data after integration with SCALEX, suggesting that the use of SCALEX effectively corrected the batch effects, making the data appropriate for subsequent analyses. Our study capitalized on the results of three well-established feature selection methodologies, each designed to identify relevant genes in a distinct manner. By exploiting the heterogeneity across these methods, we were able to highlight a diverse set of important genes, offering a more comprehensive understanding of the underlying biological mechanisms at play in the analyzed dataset. Specifically, we utilized a differential expression-based approach, the nearest neighbor search strategy, and an embedded method employing the variable importance measure of a random forest algorithm, each contributing unique insights to the overall analysis.

It is important to point out that our analysis had five distinct single-cell RNA-seq studies (three AD and two normal control). To ensure the reliability of our analysis, we had to tackle batch effects that could have resulted in biased or incorrect results when simply consolidating these studies. Without the proper integration of scRNA-seq data, we may have observed a spurious separation between Alzheimer’s and normal cells, leading to the incorrect identification of dominant genes that may have only been relevant due to the batch effects. However, by performing an accurate deep learning-based integration of the data using the cutting-edge SCALEX tool, we were able to enhance the reliability of our findings and extract valuable biological insights.

In addition, our feature selection approach integrated three different machine learning-based strategies to capture the dynamics of gene selection. This integration resulted in a more robust outcome by identifying genes that were dominant from multiple computational perspectives. Moreover, our ranking score, which was based on the Borda count, provided a fair prioritization of the top genes that could distinguish between Alzheimer’s disease and a healthy state. Our in silico pipeline, combined with a thorough biological interpretation, yielded reliable transcriptional biomarkers for AD. However, further in vitro and in vivo validation of these findings is necessary. We conducted an enrichment analysis using Enrichr [24,25] to explore the high-level functions and utilities of the biological system according to the exported genes. The GO function and Reactome [26] pathway enrichment analyses were performed for most of the provided DEGs (Tables S1–S4). Reactome pathway analysis showed that DEGs were mainly enriched in pathways such as the adaptive and innate immune system, the translocation of the immunological synapse, and PD-1 signaling and neutrophil degranulation were also provided (Figure 3A). The enriched GO terms were divided into the biological process (BP), molecular functions (MF), and cellular component (CC) ontologies. The results of the GO analysis revealed that DEGs were mainly enriched in BP, including the regulation of the immune system, interferone-gamma-mediated signaling, antigen processing, the cellular defense response, and the positive regulation of T cell-mediated immunity (Figure 3B). MF analysis indicated that the DEGs were significantly enriched in MHC class II receptor activity, MHC class II protein complex binding, serine-type peptidase activity, ferric ion binding, superoxide-generating NADPH oxidase activator activity, Tat protein binding and serine-type endopeptidase activity (Figure 3C). For CC, the DEGs were enriched in an MHC and MHC class II protein complex with the luminal side of the endoplasmic reticulum membrane, the integral component of the luminal side of the endoplasmic reticulum membrane and endoplasmatic reticulum for the Golgi transport vesicle membrane (Figure 3D).

More precisely, based on our analysis, EFhd2 encodes a conserved calcium-binding protein that is highly involved in neurodegeneration. It is widely expressed in the brain and central nervous system with cell-type specific functions, including microtubule transport, and is more abundant in cortical neurons [27]. Mielenz and Gunn-Moore found that the down-regulation of the protein could enhance synaptic delivery through kinesin-mediated transport and regulate synaptic plasticity through F-actin binding and Cdc42, Rac and Rho activity [28]. Previous studies identified how EFhd2 interacts with pathologically aggregated tau protein and accomplishes the formation of an amyloid structure [29], while recently, it was reported that EFhd2 in mice interact with several intracellular components such as vesicle trafficking modulators, cellular stress response-regulating proteins, and metabolic proteins [30]. Pathway enrichment analysis showed that a EFhd2 was enriched in the RHO GTPase cycle and signaling by Rho GTases and Miro GTPases (Table S5), which are involved in the regulation of cell migration, neuronal development and mitochondrial distribution with neuronal activity [31,32]. MF analysis showed enrichment in cadherin binding, with these proteins’ participation in the development and functional organization of the adult neural tissue, axon elongation and synaptogenesis [33]. The ribosomal protein L11 (RPL11) was also detected, which encodes RPL11. In a recent study, RPL11 was detected in the brain cerebral gray and white matter of AD and control donors [34]. Regarding TGFBR3, increased levels of this gene have been observed through functional genomic analysis in the temporal cortex of AD patients and particularly, its promoter methylation level was downregulated and negatively related to the Aβ level. The promoter hypomethylation of the gene in Aβ accumulation has been executed through the enhancement of β- and γ-secretase activities, while mediators of the synaptic vesicle cycle, calcium signaling pathway and glutamatergic synapse pathways are associated with an expressed protein [35].

Ras Homolog Family Member C (RhoC), another high-scored gene in our analysis, encodes a member of the Rho family of small GTPases, which are involved in the reorganization of actin cytoskeleton while their expression of some members of this family has been linked with the diminution of human brain malignancy [36]. This was verified through pathways enrichment analysis according to Table S6, which shows mainly Rho GTPases activation pathways as well as semaphorin interactions and sema4D-induced cell migration. In neural stem/progenitor cells, the expression of RhoC was down regulated by the presence of Aβ42, demonstrating its contribution to the migration of these cellular types as a sufficient indicator for NSPCs’ migration capacity [37]. GO analysis demonstrated that RhoC in BP was significantly enriched in the regulation of lipase activity, cortical cytoskeleton organization and mitotic cytokinesis in MF with GTP and guanyl ribonucleotide binding and in CC with neuron projection, cytoskeleton and intracellular membrane and non-membrane-bounded organelles (Tables S7–S9).

LMNA encodes lamin A/C: a two-dimensional matrix of proteins located next to the inner nuclear membrane, which is both made from the LMNA gene through alternative splicing. Deficiencies in A-type lamins account for cardiomyopathy, muscular dystrophy, peripheral neuropathy and progeroid disorders, even if the significance of lamin A in the brain is not clear, probably because A-type lamins are expressed at later stages when cells undergo differentiation. According to previous studies, the expression of lamin A may depend on the epigenetic regulation of LMNA. However, it should be stressed that the adult brain preferentially expresses C-type lamin rather than type-A lamin [38]. Although gene mutations have been related to an increasing number of diseases, often involving skeletal and cardiac muscle, Benedetti et al. indicated a correlation of the gene with patients affected by the neuromuscular disease [39]. Pathways enrichment analysis showed that LMNA was mainly enriched in the depolymerization of the nuclear lamina, deregulated CDK5 neurodegenerative pathways in AD models, and the defective intrinsic pathway for apoptosis and meiotic synapsis, while BP showed enrichment in the regulation of cell aging and mitotic nuclear membrane organization and reassembly (Tables S10 and S11).

Fc Epsilon Receptor Ig (FCER1G), another observed high score gene, was expressed in dissociated mouse dorsal root ganglia neurons, but the function of the neuronal protein was unknown. Liu et al. showed the significant upregulation of the protein in small-sized trigeminal ganglion neurons in the ovalbumin-sensitized mouse model [40]. The results of the Reactome pathway analysis revealed that FCER1G was enriched in platelet adhesion for collagen exposure and dectin-2 family and neutrophil degranulation, while BP showed enrichment in neutrophil degranulation, neutrophil activation and myeloid cell activation when involved in immune response (Tables S12 and S13). Fc Gamma Receptor IIIa (FCGR3A) encodes a receptor for the Fc portion of immunoglobulin G and participates in the removal of antigen–antibody complexes from the circulation, antibody-dependent cellular mediated cytotoxicity, and the antibody-dependent enhancement of virus infections. FCGR3A was recently detected as a significant contributor to peripheral blood nucleated cells derived from AD patients in an integrated pipeline based on single-cell analysis [41]. In BP, it was found to be significantly enriched in peptide cross-linking and the regulation of inflammatory responses.

PELI1 (Pellino E3 Ubiquitin Protein Ligase 1) is involved in the negative regulation of the necroptotic process as well as in protein polyubiquitination and promotes responses to lipopolysaccharide. Its expression can be detected in murine brain neural cells, with the highest expression level being in the microglia [42]. It is also involved in microglial Aβ phagocytosis by inhibiting the expression of scavenger receptor CD36, which has been demonstrated as a potent therapeutic target against neurodegenerative diseases. [43]. According to GO analysis, BP PELI1 enrichment is depicted in the Toll signaling pathway, the regulation of the necroptotic process and protein K63- and K48-linked ubiquitination (Table S14). The important role of E3 ubiquitin ligases and deubiquitinating enzymes in the pathogenesis mechanisms of neurodegenerative diseases have been extensively analyzed, such as their association with a-synuclein in Parkinson’s disease, as well as their deficiency in mitophagy and mitochondrial dynamics in relation to Parkin’s response [44]. ZEB2, which encodes the Zinc Finger E-Box Binding Homeobox 2 protein, significantly enriched the regulation of transforming growth factor beta receptor signaling pathway and regulation of organelle organization (Table S15). Upregulated levels of this transcription factor have been detected in the prospective neural plate and induced the expression of Sox2 and Ncam neural-specific genes while participating in the differentiation of granule neurons in the cerebellum and together with miR-200c as maintain the balance between midbrain dopaminergic progenitor proliferation and neurogenesis [45]. NR4A2 (nuclear receptor subfamily 4 group A member 2) gene encodes a member of the steroid-thyroid hormone-retinoid receptor superfamily, which acts as a nuclear receptor and a transcription factor. Mutations in this gene have been associated with disorders related to dopaminergic dysfunction, including Parkinson’s disease, AD and multiple sclerosis. It has been linked with neuroinflammation, neuronal cell death as well as neuroprotection, and its expression has been reduced in PD post-mortem brains in PD patients. It regulates dopaminergic neuronal differentiation and also localizes in microglia and astrocytes [46]. Nr4a2 is induced by neuronal activity and regulates hippocampal synaptic plasticity and hippocampal-dependent memory [47]. The results of GO analysis indicated that NR4A2 in BP was enriched in midbrain dopaminergic neuron differentiation, dopaminergic neuron differentiation, central nervous system neuron differentiation, regulation of apoptotic signaling pathway and neuron migration (Table S16). The functional role of the integrin alpha4 subunit (ITGA4) in neural crest cell migration has been detected [48] as well as an anti-integrin therapeutic strategy against multiple sclerosis has been described [49] and was found significantly enriched in integrin cell surface interactions in immunoregulatory interactions between a lymphoid and a non-lymphoid cell and extracellular matrix organization pathways (Table S17). It participates also in T cell trafficking during various inflammatory responses and in CNS pathologies such as autoimmune encephalomyelitis, while improved cognitive function in mice has been induced by the inhibition of α4β1 integrin [50]. Further ontological analysis and gene-annotation cluster networks using GeneCodis4 were implemented, as shown in Figure 4 (and Figures S4–S7).

Alzheimer’s disease has a complex etiology involving the interaction of genetic and environmental factors. It can be divided into familial and sporadic. The appearance of the familial form is associated with mutations in APP, PSEN1 and PSEN2. The rare familial form of the disease accounts for 1–2% of all dementia cases and is inherited from generation to generation in an autosomal dominant manner; on the other hand, the sporadic form concerns 98–99% of cases and occurs in people over 65 years of age with the main predisposing genes for Alzheimer’s disease being APOE, CLU, CR1 and PICALM [51,52]. Previous studies using scRNA-seq analysis have been used to explore differential cell subpopulations in the PBMCs of AD patients, and specifically expressed genes in B cells were indicated [8] as well as in the post-mortem brains of AD patient tissues, whereas cross-study large-scale transcriptomic approaches were performed to reveal the role of aberrant gene expression in the disease [53,54]. These methods provide well-defined networks of genes that could be potential targets for early disease intervention.

4. Conclusions

Emerging next-generation sequencing technologies, such as scRNA-seq, enable the whole-genome analysis of an unprecedented number of cells, producing large datasets with ultra-high volume and complexity. However, the analysis of such data poses a significant challenge due to its sheer size and complexity. In order to address these challenges, DL and ML techniques can be implemented efficiently. These techniques are well-suited to handle big data due to their inherent nature and can produce robust and reliable outcomes when applied to large and complex datasets such as those generated by NGS technologies. In our study, we proposed an ensemble pipeline that extracted the most significant gene signatures from the scRNA-seq data of PBMCs, based on a cutting-edge deep-learning-based method for scRNA-seq data integration followed by three ML-based feature selection algorithms and a voting scheme with which to rank the leading genes. Compared to other state-of-the-art integration techniques, the applied pipeline, including SCALEX, demonstrated a significantly superior performance making it a reliable tool for large-scale scRNA-seq data integration and providing significant insights within individual cells, which could serve as potent transcriptional markers for unraveling the heterogeneity and complexity of complex diseases.

Supplementary Materials

The following supporting information can be downloaded at: https://0-www-mdpi-com.brum.beds.ac.uk/article/10.3390/app13095553/s1, Figure S1. tSNE embeddings of GSE181279 scRNA-seq dataset before and after integration with BBKNN; Figure S2. tSNE embeddings of GSE181279 scRNA-seq dataset before and after integration with Scanorama; Figure S3. tSNE embeddings of GSE181279 scRNA-seq dataset before (left) and after integration with Harmonypy; Figure S4. GeneCodis Ontological analysis and gene-annotation cluster networks for GO Biological Process; Figure S5. GeneCodis Ontological analysis and gene-annotation cluster networks for GO Cellular Component; Figure S6. GeneCodis Ontological analysis and gene-annotation cluster networks for GO Molecular Function; Figure S7. GeneCodis Ontological analysis and gene-annotation cluster networks for Reactome. Table S1. The top 10 enriched Reactome pathways for DEGs; Table S2. The top 10 enriched biological processes for DEGs; Table S3. The top 10 enriched molecular functions for DEGs; Table S4. The top 10 enriched cellular components for DEGs; Table S5. The top 5 enriched Reactome pathways for EFhd2; Table S6. The top 10 enriched Reactome pathways for Rhoc; Table S7. The top 10 enriched biological processes for Rhoc; Table S8. The top 10 enriched molecular functions for Rhoc; Table S9. The top 10 enriched cellular components for Rhoc; Table S10. The top 10 enriched Reactome pathways for LMNA; Table S11. The top 10 enriched Reactome pathways for LMNA; Table S12. The top 10 enriched biological processes for FCER1G; Table S13. The top 10 enriched biological processes for FCER1G; Table S14. The top 10 enriched biological processes for PELI1; Table S15. The top 10 enriched biological processes for ZEB2; Table S16. The top 10 enriched biological processes for NR4A2; Table S17. The top 10 enriched Reactome pathways for LMNA.

Author Contributions

Conceptualization, M.G.K. and A.G.V.; methodology, A.G.V. and K.L.; software, A.G.V. and K.L.; validation, M.G.K. and A.G.V.; data curation, M.G.K. and A.G.V.; writing—original draft preparation, M.G.K., A.G.V. and K.L.; writing—review and editing, P.V.; supervision, M.G.K. and A.G.V.; funding acquisition, P.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been co-financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call Regional Excellence (Research Activity in the Ionian University, for the study of protein folding in neurodegenerative diseases) (FOLDIT) MIS 5047144.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, P.P.; Xie, Y.; Meng, X.Y.; Kang, J.S. History and progress of hypotheses and clinical trials for Alzheimer’s disease. Signal Transduct. Target. Ther. 2019, 4, 29. [Google Scholar] [CrossRef]
Phongpreecha, T.; Fernandez, R.; Mrdjen, D.; Culos, A.; Gajera, C.R.; Wawro, A.M.; Stanley, N.; Gaudilliere, B.; Poston, K.L.; Aghaeepour, N.; et al. Single-cell peripheral immunoprofiling of Alzheimer’s and Parkinson’s diseases. Sci. Adv. 2020, 6, eabd5575. [Google Scholar] [CrossRef]
Chen, H.; Ye, F.; Guo, G. Revolutionizing immunology with single-cell RNA sequencing. Cell. Mol. Immunol. 2019, 16, 242–249. [Google Scholar] [CrossRef] [PubMed]
Bettcher, B.M.; Tansey, M.G.; Dorothée, G.; Heneka, M.T. Peripheral and central immune system crosstalk in Alzheimer disease—A research prospectus. Nat. Rev. Neurol. 2021, 17, 689–701. [Google Scholar] [CrossRef] [PubMed]
Geekiyanage, H.; Jicha, G.A.; Nelson, P.T.; Chan, C. Blood serum miRNA: Non-invasive biomarkers for Alzheimer’s disease. Exp. Neurol. 2012, 235, 491–496. [Google Scholar] [CrossRef] [PubMed]
Xu, H.; Jia, J. Single-Cell RNA Sequencing of Peripheral Blood Reveals Immune Cell Signatures in Alzheimer’s Disease. Front. Immunol. 2021, 12, 645666. [Google Scholar] [CrossRef]
Song, L.; Yang, Y.T.; Guo, Q. ZIB Consortium and Zhao, X.M. Cellular transcriptional alterations of peripheral blood in Alzheimer’s disease. BMC Med. 2022, 20, 266. [Google Scholar] [CrossRef]
Xiong, L.L.; Xue, L.L.; Du, R.L.; Niu, R.Z.; Chen, L.; Chen, J.; Hu, Q.; Tan, Y.X.; Shang, H.F.; Liu, J.; et al. Single-cell RNA sequencing reveals B cell–related molecular biomarkers for Alzheimer’s disease. Exp. Mol. Med. 2021, 53, 1888–1901. [Google Scholar] [CrossRef]
Yang, Y.; Li, G.; Qian, H.; Wilhelmsen, K.C.; Shen, Y.; Li, Y. SMNN: Batch effect correction for single-cell RNA-seq data via supervised mutual nearest neighbor detection. Brief. Bioinform. 2021, 22, bbaa097. [Google Scholar] [CrossRef]
Xiong, L.; Tian, K.; Li, Y.; Ning, W.; Gao, X.; Zhang, Q.C. Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space. Nat. Commun. 2022, 13, 6118. [Google Scholar] [CrossRef]
Lopez, R.; Regier, J.; Cole, M.B.; Jordan, M.I.; Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 2018, 15, 1053–1058. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Johnson, T.S.; Shao, W.; Lu, Z.; Helm, B.R.; Zhang, J.; Huang, K. BERMUDA: A novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol. 2019, 20, 165. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Wang, K.; Lyu, Y.; Pan, H.; Zhang, J.; Stambolian, D.; Susztak, K.; Reilly, M.P.; Hu, G.; Li, M. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat. Commun. 2020, 1, 2338. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Hou, S.; Zhang, L.; Wang, X.; Liu, B.; Zhang, Z. iMAP: Integration of multiple single-cell datasets by adversarial paired transfer networks. Genome Biol. 2021, 22, 63. [Google Scholar] [CrossRef]
Hwang, B.; Lee, J.H.; Bang, D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 2018, 50, 96. [Google Scholar] [CrossRef]
Townes, F.W.; Hicks, S.C.; Aryee, M.J.; Irizarry, R.A. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019, 20, 295. [Google Scholar] [CrossRef] [PubMed]
Tran, H.T.N.; Ang, K.S.; Chevrier, M.; Zhang, X.; Lee, N.Y.S.; Goh, M.; Chen, J. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020, 21, 12. [Google Scholar] [CrossRef]
Wolf, F.; Angerer, P.; Theis, F. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 2018, 19, 15. [Google Scholar] [CrossRef]
Ntranos, V.; Yi, L.; Melsted, P.; Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 2019, 16, 163–166. [Google Scholar] [CrossRef]
Ascensión, A.M.; Ibáñez-Solé, O.; Inza, I.; Izeta, A.; Araúzo-Bravo, M.J. Triku: A feature selection method based on nearest neighbors for single-cell data. GigaScience 2022, 11, giac017. [Google Scholar] [CrossRef]
Nicodemus, K.K.; Malley, J.D.; Strobl, C.; Ziegler, A. The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinform. 2010, 11, 110. [Google Scholar] [CrossRef]
Pouyan, M.B.; Kostka, D. Random Forest based similarity learning for single cell RNA sequencing data. Bioinformatics 2018, 34, i79–i88. [Google Scholar] [CrossRef]
Van Erp, M.; Schomaker, L. Variants of the borda count method for combining ranked classifier hypotheses. In Proceedings of the 7th International Workshop on Frontiers in Handwriting Recognition, Amsterdam, The Netherlands, 11–13 September 2000; pp. 443–452. [Google Scholar]
Xie, Z.; Bailey, A.; Kuleshov, M.V.; Clarke, D.J.; Evangelista, J.E.; Jenkins, S.L.; Lachmann, A.; Wojciechowicz, M.L.; Kropiwnicki, E.; Jagodnik, K.M.; et al. Gene set knowledge discovery with Enrichr. Curr. Protoc. 2021, 1, e90. [Google Scholar] [CrossRef]
Kuleshov, M.V.; Jones, M.R.; Rouillard, A.D.; Fernandez, N.F.; Duan, Q.; Wang, Z.; Koplev, S.; Jenkins, S.L.; Jagodnik, K.M.; Lachmann, A.; et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016, 44, W90–W97. [Google Scholar] [CrossRef] [PubMed]
Fabregat, A.; Sidiropoulos, K.; Viteri, G.; Forner, O.; Marin-Garcia, P.; Arnau, V.; D’Eustachio, P.; Stein, L.; Hermjakob, H. Reactome pathway analysis: A high-performance in-memory approach. BMC Bioinform. 2017, 18, 142. [Google Scholar] [CrossRef]
Borger, E.; Herrmann, A.; Mann, D.A.; Spires-Jones, T.; Gunn-Moore, F. The calcium-binding protein EFhd2 modulates synapse formation in vitro and is linked to human dementia. J. Neuropath. Exp. Neurol. 2014, 73, 1166–1182. [Google Scholar] [CrossRef] [PubMed]
Mielenz, D.; Gunn-Moore, F. Physiological and pathophysiological functions of Swiprosin-1/EFhd2 in the nervous system. Biochem. J. 2016, 473, 2429–2437. [Google Scholar] [CrossRef] [PubMed]
Vega, I.E.; Traverso, E.E.; Ferrer-Acosta, Y.; Matos, E.; Colon, M.; Gonzalez, J.; Dickson, D.; Hutton, M.; Lewis, J.; Yen, S.H. A novel calcium-binding protein is associated with tau proteins in tauopathy. J. Neurochem. 2008, 106, 96–106. [Google Scholar] [CrossRef]
Soliman, A.S.; Umstead, A.; Grabinski, T.; Kanaan, N.M.; Lee, A.; Ryan, J.; Lamp, J.; Vega, I.E. EFhd2 brain interactome reveals its association with different cellular and molecular processes. J. Neurochem. 2021, 159, 992–1007. [Google Scholar] [CrossRef] [PubMed]
Lee, K.S.; Lu, B. The myriad roles of Miro in the nervous system: Axonal transport of mitochondria and beyond. Front. Cell. Neurosci. 2014, 8, 330. [Google Scholar] [CrossRef]
Stankiewicz, T.R.; Linseman, D.A. Rho family GTPases: Key players in neuronal development, neuronal survival, and neurodegeneration. Front. Cell. Neurosci. 2014, 8, 314. [Google Scholar] [CrossRef] [PubMed]
Yamagata, M.; Duan, X.; Sanes, J.R. Cadherins interact with synaptic organizers to promote synaptic differentiation. Front. Mol. Neurosci. 2018, 11, 142. [Google Scholar] [CrossRef]
Suzuki, M.; Tezuka, K.; Handa, T.; Sato, R.; Takeuchi, H.; Takao, M.; Tano, M.; Uchida, Y. Upregulation of ribosome complexes at the blood-brain barrier in Alzheimer’s disease patients. J. Cereb. Blood Flow Metab. 2022, 42, 2134–2150. [Google Scholar] [CrossRef] [PubMed]
Song, H.; Yang, J.; Yu, W. Promoter Hypomethylation of TGFBR3 as a Risk Factor of Alzheimer’s Disease: An Integrated Epigenomic-Transcriptomic Analysis. Front. Cell Dev. Biol. 2022, 9, 3944. [Google Scholar] [CrossRef]
Forget, M.A.; Desrosiers, R.R.; Del Maestro, R.F.; Moumdjian, R.; Shedid, D.; Berthelet, F.; Béliveau, R. The expression of rho proteins decreases with human brain tumor progression: Potential tumor markers. Clin. Exp. Metastasis 2002, 19, 9–15. [Google Scholar] [CrossRef]
Zhang, C.; Ge, X.; Lok, K.; Zhao, L.; Yin, M.; Wang, Z.J. RhoC involved in the migration of neural stem/progenitor cells. Cell. Mol. Neurobiol. 2014, 34, 409–417. [Google Scholar] [CrossRef] [PubMed]
Jung, H.J.; Lee, J.M.; Yang, S.H.; Young, S.G.; Fong, L.G. Nuclear lamins in the brain—New insights into function and regulation. Mol. Neurobiol. 2013, 47, 290–301. [Google Scholar] [CrossRef]
Benedetti, S.; Menditto, I.; Degano, M.; Rodolico, C.; Merlini, L.; D’Amico, A.; Palmucci, L.; Berardinelli, A.; Pegoraro, E.; Trevisan, C.P.; et al. Phenotypic clustering of lamin A/C mutations in neuromuscular patients. Neurology 2007, 69, 1285–1292. [Google Scholar] [CrossRef]
Liu, F.; Xu, L.; Chen, N.; Zhou, M.; Li, C.; Yang, Q.; Xie, Y.; Huang, Y.; Ma, C. Neuronal Fc-epsilon receptor I contributes to antigen-evoked pruritus in a murine model of ocular allergy. Brain Behav. Immun. 2017, 61, 165–175. [Google Scholar] [CrossRef]
Chen, Y.; Sun, Y.; Luo, Z.; Chen, X.; Wang, Y.; Qi, B.; Lin, J.; Lin, W.W.; Sun, C.; Zhou, Y.; et al. Exercise modifies the transcriptional regulatory features of monocytes in Alzheimer’s patients: A multi-omics integration analysis based on single cell technology. Front. Aging Neurosci. 2022, 14, 427. [Google Scholar] [CrossRef]
Xiao, Y.; Jin, J.; Chang, M.; Chang, J.H.; Hu, H.; Zhou, X.; Brittain, G.C.; Stansberg, C.; Torkildsen, Ø.; Wang, X.; et al. Peli1 promotes microglia-mediated CNS inflammation by regulating Traf3 degradation. Nat. Med. 2013, 19, 595–602. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Yu, T.; Pietronigro, E.C.; Yuan, J.; Arioli, J.; Pei, Y.; Luo, X.; Ye, J.; Constantin, G.; Mao, C.; et al. Peli1 impairs microglial Aβ phagocytosis through promoting C/EBPβ degradation. PLoS Biol. 2020, 18, e3000837. [Google Scholar] [CrossRef] [PubMed]
Liu, N.; Lin, M.M.; Wang, Y. The Emerging Roles of E3 Ligases and DUBs in Neurodegenerative Diseases. Mol. Neurobiol. 2023, 60, 247–263. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Toledo, E.M.; Rosmaninho, P.; Peng, C.; Uhlén, P.; Castro, D.S.; Arenas, E. A Zeb2-miR-200c loop controls midbrain dopaminergic neuron neurogenesis and migration. Commun. Biol. 2018, 1, 75. [Google Scholar] [CrossRef] [PubMed]
Jakaria, M.; Haque, M.E.; Cho, D.Y.; Azam, S.; Kim, I.S.; Choi, D.K. Molecular insights into NR4A2 (Nurr1): An emerging target for neuroprotective therapy against neuroinflammation and neuronal cell death. Mol. Neurobiol. 2019, 56, 5799–5814. [Google Scholar] [CrossRef] [PubMed]
Català-Solsona, J.; Miñano-Molina, A.J.; Rodríguez-Álvarez, J. Nr4a2 transcription factor in hippocampal synaptic plasticity, memory and cognitive dysfunction: A perspective review. Front. Mol. Neurosci. 2021, 14, 786226. [Google Scholar] [CrossRef]
Kil, S.H.; Krull, C.E.; Cann, G.; Clegg, D.; Bronner-Fraser, M. The α4Subunit of Integrin Is important for Neural Crest Cell Migration. Dev. Biol. 1998, 202, 29–42. [Google Scholar] [CrossRef]
Kawamoto, E.; Nakahashi, S.; Okamoto, T.; Imai, H.; Shimaoka, M. Anti-integrin therapy for multiple sclerosis. Autoimmune Dis. 2012, 2012, 357101. [Google Scholar] [CrossRef]
Pietronigro, E.; Zenaro, E.; Bianca, V.D.; Dusi, S.; Terrabuio, E.; Iannoto, G.; Slanzi, A.; Ghasemi, S.; Nagarajan, R.; Piacentino, G.; et al. Blockade of α4 integrins reduces leukocyte-endothelial interactions in cerebral vessels and improves memory in a mouse model of Alzheimer’s disease. Sci. Rep. 2019, 9, 12055. [Google Scholar] [CrossRef]
Lagisetty, Y.; Bourquard, T.; Al-Ramahi, I.; Mangleburg, C.G.; Mota, S.; Soleimani, S.; Shulman, J.M.; Botas, J.; Lee, K.; Lichtarge, O. Identification of risk genes for Alzheimer’s disease by gene embedding. Cell Genom. 2022, 2, 100162. [Google Scholar] [CrossRef]
Kok, E.H.; Luoto, T.; Haikonen, S.; Goebeler, S.; Haapasalo, H.; Karhunen, P.J. CLU, CR1 and PICALM genes associate with Alzheimer’s-related senile plaques. Alzheimer’s Res. Ther. 2011, 3, 12. [Google Scholar] [CrossRef] [PubMed]
Guennewig, B.; Lim, J.; Marshall, L.; McCorkindale, A.N.; Paasila, P.J.; Patrick, E.; Kril, J.J.; Halliday, G.M.; Cooper, A.A.; Sutherland, G.T. Defining early changes in Alzheimer’s disease from RNA sequencing of brain regions differentially affected by pathology. Sci. Rep. 2021, 11, 4865. [Google Scholar] [CrossRef] [PubMed]
Williams, J.B.; Cao, Q.; Yan, Z. Transcriptomic analysis of human brains with Alzheimer’s disease reveals the altered expression of synaptic genes linked to cognitive deficits. Brain Commun. 2021, 3, fcab123. [Google Scholar] [PubMed]

Figure 1. Overview of the experimental computational workflow. FS: feature selection; scRNA-seq: single-cell RNA sequencing.

Figure 2. tSNE embeddings of the GSE181279 scRNA-seq dataset before (left) and after integration (right); (A) tSNE embeddings colored by batch types. The observed dense clusters are based on batches which could lead to inaccurate interpretations. After integrating the data, a more mixed distribution of batches was noticed, indicating a reduction in the batch effect. (B) tSNE embeddings colored by disease types. The two classes in the original data were easily separable, leading potentially to incorrect analyses as the separability was caused by the batch effect problem. After data integration, a more realistic outcome of these classes was provided. This suggests that integrating the data with SCALEX helped mitigate the impact of the batch effect and led to a more accurate representation of the true biological signal.

Figure 3. GO term and pathway enrichment analysis performed using Enrichr on DEGs. (A) The top 10 enriched Reactome pathway for DEGs. (B) The top 10 enriched biological processes for DEGs. (C) The top 10 enriched molecular functions for DEGs. (D) The top 10 enriched cellular components for DEGs.

Figure 4. Gene-annotation cluster networks. Following visualizations generated for 10 top terms of related categories with identified DEG list and gene-annotation cluster networks for (A) GO biological process and (B) Reactome are presented. Genes are presented in yellow and pathways in blue.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krokidis, M.G.; Vrahatis, A.G.; Lazaros, K.; Vlamos, P. Exploring Promising Biomarkers for Alzheimer’s Disease through the Computational Analysis of Peripheral Blood Single-Cell RNA Sequencing Data. Appl. Sci. 2023, 13, 5553. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095553

AMA Style

Krokidis MG, Vrahatis AG, Lazaros K, Vlamos P. Exploring Promising Biomarkers for Alzheimer’s Disease through the Computational Analysis of Peripheral Blood Single-Cell RNA Sequencing Data. Applied Sciences. 2023; 13(9):5553. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095553

Chicago/Turabian Style

Krokidis, Marios G., Aristidis G. Vrahatis, Konstantinos Lazaros, and Panagiotis Vlamos. 2023. "Exploring Promising Biomarkers for Alzheimer’s Disease through the Computational Analysis of Peripheral Blood Single-Cell RNA Sequencing Data" Applied Sciences 13, no. 9: 5553. https://0-doi-org.brum.beds.ac.uk/10.3390/app13095553

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Promising Biomarkers for Alzheimer’s Disease through the Computational Analysis of Peripheral Blood Single-Cell RNA Sequencing Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Methodology

2.2.1. Deep Learning for Data Integration

2.2.2. An Ensemble Machine Learning Framework for Dominant Genes Identification

3. Results and Discussion

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI