1. Introduction
The brain is a complex organ made up of more than 100 billion neurons grouped into many functional regions, that communicate with each other by means of electro-chemical signals. When referring to the brain, physical connectivity refers to the pattern of anatomical links constituted by the neurons’ axons and connected to the dendrites of post-synaptic neurons [
1]. The physical connections that link numerous groups of neurons constitute a network that at a larger scale constitutes the so-called anatomical brain connectivity.
It is shown in the literature that functional properties of neurons and neuronal systems depend on neural connectivity patterns [
2,
3]. This has long attracted the attention of neuro-anatomists, who dedicated their studies to the new field of science dealing with the assembly, mapping and analysis of the connectome [
4].
The anatomical connectivity in the brain is constituted by fibers that propagate from the neuronal bodies. These, in turn, contain the nucleus and all the nuclear components that contribute with their activity to the cellular differentiation and morphogenesis. Accordingly, the main factors influencing connectivity patterns have to be searched at the cellular scale, meaning that cellular activity influences physical brain connectivity patterns at the anatomical level. Hence, the analysis of cellular activity in the form of neuronal gene expression profiles may represent an effective way of understanding the physical connectome more in depth [
5,
6].
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product such as a protein. Gene regulation gives the cell control over structure and function, and is the basis for cellular differentiation, morphogenesis and adaptability of any organism.
Gene expression profiling is the measurement of the activity (i.e., expression) of genes. Sequence-based techniques such as RNA-Seq provide information on the sequences of genes, from which their expression level can be derived. Nonetheless, they extract information through a disruptive process of the tissue under investigation, providing gene expression levels averaged over the whole cellular population without any spatial information. On the other hand, Single-cell RNA-seq (scRNA-seq) [
7], relying on separation of single cells from tissue by enzymatic or mechanical dissociation, provide cell-specific information but even in this case with lack of information on spatial location and the micro-environment [
8]. Instead, using in-situ techniques it is possible to detect the spatial distribution of gene expression levels in the tissue. Fluorescence in situ hybridization (FISH or ISH) uses RNA or DNA complementary hybridization probes labeled to fluorescent molecules. Once the probes have hybridized the target in the fixed tissue, the transcript can be localized and quantified through fluorescence microscopy images. Thanks to this process, FISH allows to maintain both spatial and morphological information. On top of that, it generally generates better-quality images than other in situ techniques [
7], which makes it the ideal source of information for connectome studies.
Due to the crucial role played by anatomical brain connectivity, scientists generated and made available a number of brain atlases, modelling the axonal connections between different brain regions [
9,
10]. Upon these connectivity models, the scientific community conducted several studies, aimed at either detecting the existence of anatomical neural connections or spatial correlations between intrinsic properties of the brain tissue. Studies on the mouse brain based on visualization and clustering showed that gene expression and connectivity information have significant levels of spatial auto-correlation, which needs to be accounted for through integrative analysis [
11]. Based on the results of these studies, brain regions with similar expression profiles tend to have similar connectivity profiles. Likewise, brain regions which are anatomically connected to each other have gene expression patterns that are particularly similar [
11]. Some studies have also identified a set of genes that are responsible for the relationship between cellular activity and connectivity, as they are directly involved in neuronal development and axon guidance [
12]. With more in-depth investigations of the specific relationship between gene expression and connectivity in the mouse brain, it was shown that gene expression is predictive of the connectivity pattern when the connectivity signals are in a discrete form. In addition, most of the predictive power resides in the expression data from a relatively small number of genes, suggesting that very few genes are responsible for generating brain connectivity in each specific brain structure [
13].
All these findings stem from the analysis of data from the mouse brain. Nonetheless, a large number of genes in the mouse brain find a direct correspondence in the human brain, and regionally enriched genes were demonstrated to be conserved when shifting from one species to another [
14]. Based on this evidence, the combination of human and mouse single-cell trascriptomic profiles, through the application of feature selection and linear modeling, was used to provide better insights into human brain connectivity. Then, the combined data were used to demonstrate that gene expression is a better indicator of cellular localization than the location of cell nuclei, especially for cells with large and irregularly shaped cell bodies such as the neurons [
15].
Upon these considerations, gene expression data can be effectively used to automatically predict brain connectivity. While most of the works in the literature focus on either analysing the most relevant genetic signatures of neuronal connectivity [
16,
17], or on investigating the direct relationships between gene expression and brain functionality [
18,
19,
20], lesser attention is devoted to predicting anatomical connectivity at a cellular resolution, directly using the trascriptomic profile as the input baseline. The most representative works in this regard use model-based techniques (e.g., sparse linear models [
13,
21]), obtaining good prediction accuracy level (between 80% and 90%) at the well-known cost of difficult parametrization and non-obvious selection of the variables.
In this paper, we push forward the path of predicting the degree of anatomical connection of brain areas, by performing an integrative analysis of gene expression profiles and connectivity data. To do so, we interpret the prediction as a classification problem, where the input feature vectors describing gene expression profiles of brain region pairs are automatically grouped into multiple classes based on their level of physical connectivity. To solve this classification problem, we exploit neural networks, which, compared to traditional model-based techniques, have the advantage of being non-parametric, and do not require a priori definition of the mathematical relationships among variables. As such, our tool is developed on a case-study of mouse brain data, but it can be ideally applied to any other application of interest.
Our method implements all the stages that are essential to solve the connectivity classification problem in a fully automated way, including data collection, storage and pre-processing, as well as the in-depth analysis of the prediction outcome, aiming at the investigation of anatomical connections between the brain macro-regions. Based on the nature and complexity of the analysis to perform, we chose to implement a Multilayer Perceptron (MLP), a class of feed-forward artificial neural networks that is often used both for data classification and regression [
22]. This network is fed with feature vectors representing the gene expression profiles of two different brain regions. Each element of the feature vector corresponds to a gene, and more specifically to the expression level of that gene in a low-dimensional spot (i.e., a voxel) of a region. To obtain classification labels for the feature vectors, the available spatial connectivity data are aggregated, obtaining a unique value representing the connection between two regions. Then, we investigated the outcome of our network on two different tasks, respectively a multi-class classification task (with three classes corresponding to unconnected, weakly connected and strongly connected areas, respectively) and a simpler binary classification task (unconnected and connected). The analysis was performed on a very large dataset from the cortex and the cerebellum (58 regions in total). These specific regions were selected because the corresponding annotated datasets ensure a very broad representation of connectivity degrees.
4. Methods
Besides the MLP prediction model, we implemented a complete automated pipeline to handle dataset collection, as well as the organization and processing of the gene expression and connectivity data into Source-Target vectors with corresponding class labels, to be given as input to the MLP. The main steps of this pipeline, implemented exploiting the Knime framework [
28] and the SeqAn library [
29], are the following:
download of grid data from the available data sources;
processing of the raw grid data to integrate the gene expression and the connectivity information;
generation of a full and coherent dataset of Source-Target gene expression vectors and corresponding connectivity labels, ready to be cropped into training, validation and test sets for the MLP.
4.1. Download of Grid Data
The Allen Brain Atlas provides grid-data at different resolutions, consisting in 3D summaries of both the gene expression and connectivity data, re-sampled to a Common Coordinate Space of the 3D reference brain model [
30]. To enable spatially coherent processing of these two sets of data, the database provides a structural grid data annotation system at each resolution scale. This annotation allows to link mouse brain voxels to anatomical structures in the Common Coordinate Space.
Grid data is downloadable through an API service by queries. The queries were implemented through a web application (the RMA BUILDER), that is freely accessible on the Allen Brain Atlas’s API section.
4.1.1. Gene Expression
As mentioned in
Section 3, the Allen Institute Mouse Gene expression data consist of whole-brain in-situ hybridization data obtained from brains of 56-day old C57BL/6J male mice [
31]. The grid-data of the detected expression levels are provided for coronal and sagittal sections. Even though the sagittal section counts more than 20,000 genes, connectivity data are available only for the coronal section. Thus, for our study we focused on the 3318 gene expression grid-data corresponding to this specific section.
The main phases of the elaboration of gene expression data are represented in
Figure 5. The expression profile of each gene throughout the mouse’s brain is associated to a SectionDataSet, a specific data object of the Allen Brain Atlas framework where all the experiment’s information is stored. We first build a query to retrieve the SectionDataSet unique identifiers (IDs) for the gene expression experiments in the form of an XML document. Then, to retrieve the corresponding gene-expression grid-data, we build a query with the RMA BUILDER and obtain in return an energy.raw file for each of the 3318 gene expression experiments. This file contains a vector of 159,326 elements corresponding to the 3D voxels of the mouse brain model (67 × 41 × 58 voxels at 200
m resolution) that can be reconstructed leveraging the reference information provided by the database, as shown in
Figure 6 [
32]. At the end of the download procedure from the Allen Brain website [
33], we obtain a 3318 × 159,326 matrix of gene expression levels, with rows corresponding to genes and columns to 3D voxels. This matrix is stored into a single .csv file, as represented in
Figure 5.
4.1.2. Allen Connectivity
As outlined in
Section 3, The Mouse Brain Connectivity Atlas provides connectivity information in the form of axonal projections labeled by rAAV viral tracer and detected through two-photon tomography for more than 200 mouse brain regions in coronal section. Injection sites refer to the spots where the viral tracer is injected. The region where a certain injection site is placed and the region where the injection produced axonal projections are referred to as Source and Target regions, respectively. Injections involving a single region are called primary. Nonetheless, because of the small size of the mouse brain, a single injection can involve more than one region. These are called secondary injections.
In this work, we focused only on the primary injection sites, and considered connectivity data at 100 m resolution, that is the closest to the 200 m gene expression resolution among all the available ones (10, 25, 50, 100 m, respectively).
The main phases of the elaboration of connectivity data are represented in
Figure 7. Again, each primary injection site corresponds to a SectionDataSet. Hence, we first designed a query to retrieve the SectionDataSet IDs of injection experiments through the API service, which returns an XML document with 2333 primary injection IDS. Such IDs are exploited to build a query with the RMA BUILDER and retrieve the connectivity grid data in return. By doing so, we obtain 2333 .Nrrd files, each representing the axonal projections of a specific primary injection site. This provides a correspondence between the 2333 primary injections and their corresponding target regions. For connectivity data, the 3-D volumetric grid-level information at 100
m are provided in the form of a 13 × 80 × 114 numerical array, as represented in
Figure 6a. Maintaining the spatial reference provided by the Allen Brain Atlas, each 3D matrix was unpacked into a vector of 1,203,840 elements. This way, we obtained 2333 vectors in total, that were stored into a single .csv file along with the source region indication (see last phase of
Figure 7).
4.1.3. Structural Annotation File
An annotation volume is a 3D raster image that partitions the reference space into a number of structures, whose number of voxels depends on the size of the structure as well as on the specific resolution of the model. Each voxel is assigned to a specific brain structure by means of a region ID [
34]. Brain structures in the Allen reference spaces are arranged in trees, with leaf nodes representing very fine anatomical partitioning and nodes closer to the root corresponding to gross partitioning. The annotation file reports region IDs together with the details of the finest anatomical partitioning. Hence, gene expression and connectivity data can be mapped to several common reference spaces. To link each data voxel to the corresponding membership brain region, the Allen Brain Atlas provides a structural annotation file at different resolutions, where the
i-th annotation element allows to map the
i-th voxel in the data array to its brain structure.
Same as for the gene expression data, the annotation is provided at 200 m resolution, in the form of a vector of 159,326 elements. Likewise, the connectivity annotation (CA) is provided at 100 m resolution, reshaped in the form of a vector of 1,203,840 elements. Since, differently from the other data, the primary injection structure annotation is not provided at the finest annotation level, we implemented a procedure to trace both the annotations back to the same resolution level, in the annotation tree. To do so, we exploited a list of dictionaries provided by the Allen Brain Atlas, documenting brain structures and their hierarchical relationships in the form of a structure graph.
4.1.4. Brain Architecture Management System (BAMS)
As mentioned in
Section 3, additional neural circuitry data collected from BAMS were used as reference to decide which brain structures are most significant for our analysis. To date, BAMS includes about 45,000 connection reports between different gray matter regions, leveraging information on connections that were demonstrated by previous studies. The reports can be freely downloaded from the site of BAMS [
35] in the form of an interactive matrix (see
Figure 8), where each element
defines the existence and the intensity (encoded by a value in a
range) of the connection between two specific brain regions
i and
j, identified by the same universal acronyms used by the Allen Brain Atlas. Unknown connections are assigned a 0 value.
4.2. Generation Source-Target Vectors and Corresponding Connectivity Labels
The last step of the dataset generation consists in assigning a unique connectivity label to the Source-Target gene expression vectors, as follows.
All the connectivity values reported for a specific injection ID (i.e., experiment) and a specific Source-Target combination are first aggregated based on their median value. This solution is preferred to others (e.g., mean value), because the median value is inherently robust to the presence of outliers and noise. Nonetheless, as in the nature of this technology, a specific region may be a site of injection of multiple experiments. Hence, for each of these experiments, the connectivity of the axonal projections produced in the corresponding target regions will be stored in a specific SectionDataSet. Then, in case a specific source has targeted the same region in different experiments, that specific combination of Source-Target regions will correspond to more than one median value. To tackle this issue, we implemented a second level of aggregation, and obtained the final connectivity value as the maximum of all the multiple median values. This choice stems from the empirical observation that the connectivity network detected in each experiment (and hence, the corresponding connectivity value) is highly dependent on the specific position of the injection. Hence, using the maximum as the most representative value has a two-fold advantage: (i) it filters out small connectivity values possibly due to peripheral injection sites and (ii) allows to select the experiments with the best spatial conditions as the most representative of a specific source-target combination.
Based on the empirical connectivity thresholds defined in
Section 2, this connectivity value is transformed into a categorical label representing the strength of the connection: either
for multi-class classification, or
for binary classification.
To allow further processing and easy access of the data, in our solution the full and coherent dataset of Source-Target gene expression vectors and the corresponding connectivity labels were stored into four tables of an SQLite database shown in
Figure 9:
Table voxID2Annotation carries the spatial information, and contains the voxel ID and corresponding brain structure annotation.
Table voxID2GenExpr was obtained by filtering out the voxels with gene expression level value equal to 0. It is made of columns reporting gene expression value, voxel ID and gene ID, respectively.
Table injection2regionID was obtained by grouping all the voxels by Source and injection ID. Hence, it reports the Source region ID for each injection.
Table injection2target was obtained by grouping the connection values of each voxels by the Target ID. More specifically, all the voxels belonging to the same Target region were aggregated by the median of the values associated to each of these voxels. Then, the final table is composed of three columns: injection ID, median of the values obtained for a specific Target ID and its annotation ID, respectively.
This database solution allows the quick generation of custom datasets to be given as input to prediction models, avoiding the need to re-process the raw-data.
Leveraging such database, a custom dataset can be built as follows. First,
N Source-Target regions are selected, based on the specific analysis to perform. Gene expression and connectivity data of the selected pairs undergo the following pipeline, as represented in
Figure 1:
for each Source-Target pair, M voxels belonging to the source region and M voxels to the target regions are selected on the expression gene annotation.
for each selected voxel, a vector composed of 3318 elements is generated, where each element corresponds to the expression level of a specific gene. Hence, M vectors representing the gene expression profile of the Source and M vectors representing the gene expression profile of the Target are obtained.
A dataset is created by selecting P combinations among all possible Source-Target voxel combinations. More specifically, the gene expression vector corresponding to the Source voxel is concatenated with the gene expression vector corresponding to the Target voxel. Hence, the obtained dataset will be made of P vectors.
In the end, a unique categorical label representing the Source-Target connectivity is assigned to each combination.
These steps are repeated for all N number of Source-Target regions.
The obtained dataset undergoes a normalization process, by scaling input vectors in a range. Then, they can be divided into training, validation and test sets, to be fed into the predictive model.
4.3. MLP Predictive Model
As a predictive model, we designed a Multilayer Perceptron. In the following, we describe in detail the MLP architectures and corresponding design parameters that provided the best performance values for the multi-class and binary classification tasks discussed in
Section 2.
This MLP architecture, represented in
Figure 10, is composed of a hidden layer with 64 nodes and two hidden layers with 32 nodes each. The first hidden layer applies a ‘sigmoid’ activation function on the entries. In the following hidden layers, nodes apply the ‘ReLu’ (rectified linear unit) activation function on their inputs. Three Dropout layers are placed after the hidden layers to avoid the overfitting phenomenon, occurring when the MLP specializes too much on the training set losing its ability to generalize on the validation set. When the error on the validation set starts to increase, indicating possible overfitting, the dropout layers “drop out” random neurons, temporally removing their contribution to downstream the activation of neurons. This has been widely demonstrated to improve the generalization capabilities of the network [
36]. Notably, two options are given for the activation function of the output layer: softmax and sigmoid, respectively for multi-class and binary classifications.