1. Introduction
In recent years, with the rapid development of computer vision technology, medical image analysis has been widely used in disease diagnosis and treatment planning. Furthermore, medical image segmentation as a part of the medical image analysis process has received more attention.
Medical image segmentation is often the first step in medical image analysis, and it plays an essential role in computer-aided quantitative analysis. Since CNNs have been applied to medical image segmentation tasks, U-shaped structures based on encoders, decoders, and their variants [
1,
2,
3,
4] have shown excellent performance in various medical image segmentation tasks [
5,
6,
7]. For example, U-Net [
1] has achieved good results in the heart segmentation dataset obtained by Magnetic Resonance (MR) technology [
8], the multi-organ segmentation dataset obtained by Computed Tomography (CT) technology [
9,
10,
11], and the polyp segmentation dataset obtained by colonoscopy video [
12]. To date, many of the best medical segmentation network architectures have been based on U-Net.
Although the medical segmentation architecture based on CNNs has achieved good segmentation results in different types of medical datasets, these networks have a common defect; that is, it is difficult for convolutional kernels to conduct remote modeling of features when extracting semantic information for images. If the distance between different foreground elements in an image is considerable, relying solely on convolution kernels [
13] for feature extraction can create a challenge for the network to understand the interrelation between those foreground elements. Due to the inherent limitations of convolution kernels, several researchers have attempted to mitigate this issue by employing dilated convolution kernels [
14,
15,
16], which expand the receptive field of elements. However, the dilated convolution kernel is conducted on the basis of discarding some elements, so this method needs improvement.
In addition, some scholars have tried to apply transformers to CNNs to improve the insufficient long-distance modeling capability of convolutional kernels [
17,
18,
19]. Transformers have initially shown excellent performance in natural language processing (NLP) tasks [
20,
21]. This is because the self-attention module within a transformer can compute the correlation coefficient between each element within a feature graph and all other elements. Through a relevance coefficient, the network can assign an appropriate weight to each element to enhance the importance of foreground information. Therefore, transformers have strong remote modeling capabilities [
22,
23]. For example, CA-Net [
24] is based on a transformer and V-Net, which can learn contextual information from each slice to achieve automatic segmentation of the left atrium. Therefore, many scholars are trying to mitigate the insufficient remote modeling of convolutional kernels by using transformers in encoders [
22,
25,
26]. Transformers can also be transferred to downstream applications [
27]. In this context, some medical image segmentation works [
26,
28] have also achieved satisfactory results, and pure transformer models [
29] have emerged.
To sum up, both convolution kernels and transformers have their own advantages and disadvantages. Of these, the convolutional kernel excels at learning local details, but has certain limitations in its ability to model remote features. Meanwhile, the transformer has great global modeling capabilities, but at the expense of local details. In addition, the transformer often relies on large amounts of data to demonstrate its powerful global context modeling capability. Due to various restrictions in real life, medical imaging data are often difficult to obtain in such large quantities, which limits the performance of transformer in medical image segmentation tasks. Furthermore, when it comes to medical images, foreground information is typically present in the form of local patches, and various types of foreground information tend to be spatially distant from one another. Thus, to enhance the network’s segmentation capability for foreground information, it is crucial for the network to consider correlations between the local details of the feature map and the distant elements. To address these challenges, this paper proposes a novel module called remote feature exploration (RFE). This module can use remote elements to assist in the generation of local features, which, to a certain extent, provides the network with both local detail information extraction and a remote modeling capability.
In summary, this paper delivers two main contributions:
- (1)
In this paper, a new multi-organ segmentation dataset is created, and the advantages and disadvantages of both the convolution operation and the transformer are verified.
- (2)
In this paper, we propose that the remote feature exploration layer can be used to assist the network in learning local elements using remote elements. This capability allows the network to not only capture local details but also model the relationships between distant elements.
2. Related Work
Early medical image segmentation methods included traditional algorithms that relied on machine learning and contour-based techniques. With the development of depth convolution neural networks (ConvNets), U-Net [
1] was proposed for medical image segmentation. U-Net [
1] is a symmetric codec structure that dominates the field of medical image segmentation with excellent performance. However, the skip connections in U-Net [
1] directly combine the shallow features of the encoder with the deep features of the decoder, resulting in a semantic gap. Therefore, more robust variants based on U-Net [
1] networks were created. For example, UNet++ [
12], which has nested and dense skip connections, alleviates the generation gap between U-Net layers to a certain extent, and produces significant performance gains compared with U-Net [
1]. However, UNet++ [
12] cannot capture the semantic features at full scale. Attention U-Net [
30] adds an integrated attention gate on the basis of U-Net [
1], which can eliminate the response of redundant ambiguity in skip connections. In addition, there are many models based on neural networks, such as R50 U-Net [
31], R50 Att-UNet [
31], DARR [
32], and UNet3+ [
33]. In medical image segmentation, a U-shaped network is also used for 3D image segmentation, such as V-Net [
34], which uses a CNN to train end-to-end and directly process 3D NMR. Meanwhile, 3D U-Net [
35] uses elastic deformation to dynamically expand data in real-time so that the network can learn more images during each training iteration.
The transformer originates from machine translation and text embedding. In addition, the transformer has achieved competitive performance in many computer vision tasks, such as image recognition, target detection, semantic segmentation, real-world segmentation, image classification, and medical image segmentation. In the field of computer vision, a typical network based on the transformer model is the Vision Transformer (ViT) [
22]. The ViT outperforms the CNN on recognition tasks, although this is achieved at the cost of a large dataset. Chen et al. designed TransUNet [
26] based on a ViT, which explored the potential of a transformer in medical image segmentation using a transformer as the last encoder layer in the encoder part. A series of network architectures combining a transformer with a CNN in the encoder part, such as MedT [
36] and pmTrans [
37], were subsequently proposed to achieve better feature modeling. However, directly using encoders based on the transformer combined with the CNN to construct global modeling creates significant computational complexity, which increases the difficulty of designing the transformer on high-resolution feature maps. To improve the computational efficiency on high-resolution feature maps, SwinUNet [
29] performs correlation modeling within a series of moving windows. Although the above architectures reduce the computational complexity of the models, there is still a significant local feature loss.
Tang et al. [
38] designed a hybrid self-supervised agent task, including rotation prediction, instance antagonism, and inpainting, and demonstrated the effectiveness of the method through the effect of fine-tuning. Similarly, UNETR [
39] uses a pure transformer as an encoder to learn the sequential representation of input quantities and effectively capture global multi-scale information, while also following the U-shaped structure of the encoder and decoder. PHTrans [
40] mixes a transformer and a CNN in parallel as major components to generate hierarchical representations and adaptive aggregation from global and local features. However, unlike the above methods, our model can not only complete global modeling, but also uses remote elements to help generate local information, achieving an accurate segmentation effect.
4. Experiments and Analysis
Section 4.1 is devoted to present the datasets used in the experiment. This includes a detailed breakdown of the data quantity, access channels, and specific prospect information for the two kinds of datasets. In order to confirm the credibility and reliability of the experiment, a comprehensive account of the experimental details are provided in
Section 4.2, which details the experimental environment and various superparameters employed in this study. Furthermore, in
Section 4.3 and
Section 4.4, we provide an in-depth explanation of the loss functions and evaluation metrics used in the experiment, as well as the rationale behind their use.
Section 4.5 presents the comparative and ablation test results, and
Section 4.6 details the findings of the analytical study. The visualization results are discussed in
Section 4.7. To summarize,
Section 4 utilizes specific experimental results to validate the innovation points presented in this paper. It is important to note that no patients were harmed during the experiments, and patient identities in the experimental image data were removed.
4.1. Datasets
In order to estimate the effects of our proposed method, a significant number of experiments were implemented on a public organ segmentation dataset (Synapse) and another segmentation dataset (ours). The public dataset used was the Synapse multi-organ segmentation dataset, provided by the authors of TransUNet [
26]. It is available for access at
https://www.synapse.org/#!Synapse:syn3193805/wiki/217789 (accessed on 1 November 2022). The multi-organization dataset (MOD) was derived from CT scans of different patients. The MOD dataset was created to explore the performance of methods for multi-organ segmentation. The MOD dataset comes from a previous collection of the hospital, in which all image data were amended to protect patient identity, collection time, and other privacy information. This paper respects the ethics of medical data; the research methods and experiments adopted do not cause any harm to patients and do not involve any conflicts of interest or other issues. We will consider publishing the MOD dataset for further study by scholars at a later date. Below, there will be a detailed explanation regarding the quantity of datasets utilized for both training and testing in the laboratory.
The Synapse dataset is a collection of 3779 clinical CT images of the abdominal region, taken from 30 cases using CT scanners. Each image depicts one or more of the eight abdominal organs; namely, the aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. The dataset was split into two sets; one for training consisting of 18 samples, and another for testing consisting of 12 samples, in line with previous studies [
26,
29].
The multi-organization dataset (MOD) is a dataset obtained from Xinjiang Urumqi Hospital of Traditional Chinese Medicine (Hongshan), China. The dataset includes 430 clinical images from 145 patients acquired from CT scanners. The CT images are a series of routine scanning slices from the upper abdomen to the mid abdomen, with a slice thickness of 7 mm. The 430 images were labeled by LabelImg, and the labeled samples were approved by professional doctors. Each sample’s foreground information comprises eight organs: the aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. The 430 images were split randomly into a training set and a testing set, with a ratio of 7:3.
4.2. Implementation Details
For this experiment, we kept the original feature map resolution fixed at 224 × 224 and used a batch size of 24, except for cases where special instructions required otherwise. We trained our model using the SGD optimizer, setting the learning rate to 0.01, the momentum to 0.9, and the weight decay rate to 0.0001. The default training iteration number was 20,000. All experiments were conducted on Python 3.6 and Torch 1.6.0, with two NVIDIA Tesla V100 GPUs used for parallel training and testing.
4.3. Loss Function
Since one loss function often does not accurately reflect good or bad performance of medical image segmentation, a mixture of two loss functions (cross-entropy loss and dice loss) were used in our network model, which is in line with many current segmentation methods.
Cross-entropy loss is a popular loss function for medical image segmentation, especially multi-classification problems. The smaller the value of cross-entropy is, the better the model prediction will be, with Equation (5) as follows:
where
stands for ground-truth label and
stands for predictive value.
Dice loss is the evolution of the dice coefficient. The dice coefficient is a metric function that evaluates the similarity of two samples, where the larger value means the predictive label is more similar to the ground-truth label. Equation (6) for the dice co-efficient is as follows:
where
denotes the number of elements in the intersection between
and
, and
and
denote the number of elements in
and
, respectively. The dice loss is calculated as shown in Equation (7):
where
stands for ground-truth label and
stands for predictive value.
The mixture-loss of the network is as follows in Equation (8):
where
and
are the weighting coefficients of the cross-entropy loss and dice loss, respectively. As a matter of experience, we set
=
= 0.5 in this paper.
4.4. Evaluation Metrics
The dice similarity coefficient (
DSC) is a kind of set similarity metric, usually used to evaluate the degree of similarity with a value range of 0–1.
DSC (%) is calculated as shown in Equation (9):
where
and
represent the group truth and prediction, respectively.
The Hausdorff distance evaluates the similarity of any two sets in metric space.
HD95 (mm) can be described as a discrete numerical value obtained by quantizing 95% of the maximum difference between the predicted value and the actual value. The methods used to calculate
HD95 (mm) are presented in Equation (10):
where
and
represent the group truth and prediction, respectively.
4.5. Experimental Results
Table 1 and
Table 2 display the findings of the experiments conducted on the Synapse and MOD datasets. The superior performance value is indicated by the bold text in the tables and will not be explicitly discussed further in the following text.
As shown in
Table 1, the performance of the traditional CNN is still better, and the performance of Att-UNet exceeds that of TransUNet. Nonetheless, our methodology exhibits remarkable superiority compared to CNN-based techniques such as U-Net, attention-mechanism-based methods such as Att-UNet, and transformer-based approaches such as TransUNet, among others. On the Synapse dataset, the average DSC (%) of our method (RFE-UNet) reached 79.77%, obtaining the optimal average DSC (%). Compared with other CNN-based methods, RFE-UNet obtained optimal results for four organs (left kidney, right kidney, pancreas and stomach). Our method can not only calculate the correlation of elements at short distances, but it can also model elements at long distances. Therefore, RFE-UNet is better than the CNN model, which only extracts local feature information. Our method outperforms other transformer methods in three organs (left kidney, right kidney, and liver). The reason is that the relevance of some local elements is inevitably ignored in the feature extraction stage of a model based on a transformer, resulting in a certain degree of information loss. In general, the performance of RFE-UNet is more accurate, and it is better able to consider the correlation of different categories of prospect information.
Table 2 demonstrates that on the MOD dataset, our approach (RFE-UNet) achieved mean DSC (%) and
HD95 (mm) values of 75.12% and 7.43, respectively. This represents a 0.54% improvement in DSC (%) performance compared to the baseline (U-Net). Notably, our methodology performs exceptionally well in the segmentation of the left kidney. In this new dataset, it is clear that the CNN-based network segmentation performance is better than the transformer-based network. Compared with the method based on the CNN, we obtained the optimal value of the left kidney. In comparison to the transformer-based method, we obtained the optimal values for four organs (aorta, gallbladder, left kidney, and stomach). Compared with other methods, the overall segmentation effect of our method is optimal. This is because the accepting domain of the convolutional block in the CNN-based method is limited and the long-term dependency cannot be calculated. The transformer-based approach tends to model the global context rather than the local details. In contrast, RFE-UNet can complete local and global modeling at the same time, which strengthens the richness and relevance of the extracted sample features, and leads to excellent performance of image segmentation.
4.6. Analytical Study
Experimental analysis of ablation based on the RFE-Layer.
As shown in
Figure 5, we try to realize the information exchange of remote features only by using a single block feature diagram (for example, one of A, B, C, or D, corresponding to “REF-A”, “REF-B”, “REF-C”, “REF-D”) as the basic unit, with the experimental results shown in
Table 3 and
Table 4.
As can be seen from
Table 3, the effect of “REF-A”, “REF-C”, and “REF-D” on the Synapse dataset is better than that of four blocks working together (RFE-Layer). This is because not all labeled images in this dataset contain all the organs, and some labels contain only a few organs and occupy a small area. Therefore, when only the single block feature map is used as the basic unit, the segmentation of organs with a small area is advantageous. On the MOD dataset (as shown in
Table 4), since all the labeled images of this dataset contain most organ categories, the segmentation effect is better when four blocks work together (RFE-Layer).
Experimental analysis of ablation based on different input resolution.
By default, the input resolution for RFE-UNet is set at 224 × 224. However, we conducted experiments with RFE-UNet trained at resolutions of 256 × 256 on both the Synapse and MOD datasets, as illustrated in
Table 5 and
Table 6. Despite the increase in input resolution, the patch size remains constant at 16, leading to an increase in the sequence length of the input remote feature exploration layer. The increase in DSC (%) on Synapse and MOD datasets is 0.7% and 1%, respectively, compared to the 224 × 224 input size, and at the same time leads to a computational cost increase. The data shown in
Table 5 and
Table 6 illustrate that a segmentation performance increase can be achieved by improving image resolution. However, higher resolution also means that we will pay more computing resource for an increase in average DSC (%). Given our limited GPU memory resources, we opted to conduct experiments at the resolution of 224 × 224 to establish the validity and reliability of RFE-UNet while taking into account the computational costs involved. In
Table 5 and
Table 6, we show the mean DSC (%) of different input image resolutions on the two datasets and the segmentation accuracy of eight organs, respectively. The influence of resolution on the performance of image segmentation is consistent with what is reported by the proposed study.
4.7. Visualizations
Figure 6 and
Figure 7 display a qualitative comparison of the segmentation performance of RFE-UNet on the Synapse and MOD datasets.
The visualized results on the Synapse dataset are shown in
Figure 6. We can see from the analysis of
Figure 6 that: (1) CNN-based methods (such as U-Net) show weaker performance when segmenting organs that are further apart than transformer-based methods (such as TransUNet). For example, U-Net and R50 Att-UNet show significant false positives for the liver in the second and third lines, while the segmentation of the stomach in the first line and the spleen in the fourth line is incomplete. The main reason is that when the CNN uses convolution kernels for feature extraction, it mainly focuses on the correlation modeling of local features while ignoring the remote correlation of samples. Therefore, the CNN class model makes it easy to lose global information, resulting in unsatisfactory segmentation results. (2) The transformer-based approach improves the situation to some extent. For example, TransUNet shows less over-segmentation and under-segmentation than the pure CNN method because of its global element correlation modeling ability. However, the transformer inevitably ignores the importance of local details when modeling global elements. Since the structure of medical images is complex, local details and global correlation are very important for segmentation results. (3) In comparison with other network models, our RFE-UNet segmentation has better results. The main reason is that RFE-UNet uses the proposed remote feature exploration layer to complete the interaction between the elements of the near and far, and to realize the joint modeling of local and global correlations. The qualitative experimental results in
Figure 6 show that our RFE-UNet has excellent segmentation performance considering both local and global correlations.
On the MOD dataset,
Figure 7 shows that: (1) Based on pure CNN methods, such as U-Net and Att-UNet, it is more likely to lead to over-segmentation of organs (e.g., the spleen in the first line, the stomach and gallbladder in the fourth line, etc.) or under-segmentation (e.g., the liver in the fourth line). The reason is that although the CNN-based method has strong local modeling ability, the receptive field of the convolution calculation is local. Therefore, these methods cannot effectively capture the global element correlation in the sample and lose the long-distance dependence relationship. (2) The addition of the transformer to TransUNet improves the situation to some extent. In terms of long-distance correlation modeling, TransUNet has a stronger coding capability than the CNN. However, the overall segmentation effect of the SwinUNet method based on a pure transformer architecture is not ideal. This is because the transformer focuses on modeling global element associations and ignores local element associations. In medical segmentation tasks, organs often appear in the form of regional blocks, the connection between target organs is very close, and there are size differences between organs. Therefore, the lack of local modeling and global modeling may lead to unsatisfactory segmentation results. (3) Compared with other methods, the remote feature exploration layer proposed in RFE-UNet can use remote elements to assist the generation of local features and realize the joint modeling of local and global correlations, so it has a better segmentation effect. Notably, the segmentation of the stomach in the third row exhibits no false positives, and the segmentation of the left kidney is notably superior to the other methods. The comprehensive results in
Figure 7 show that the segmentation results of RFE-UNet are smoother, the target structure is more complete, and it is closer to the tag. This is due to the fact that we consider the importance of both local and global dependencies to perform segmentation tasks.
5. Conclusions
In this paper, we proposed the remote feature exploration layer based on the ad-vantages and disadvantages of the CNN and the transformer. The CNN uses convolution kernels for feature extraction, which affords the network a strong local modeling ability; however, this feature extraction method also greatly reduces the remote modeling ability of the network. Unlike the CNN, the transformer can model global elements, but it inevitably loses some of the local details. Therefore, we propose a remote feature exploration module. This module assists the network in generating local details by using remote elements. Through this feature-learning method, the network develops the joint modeling ability of local and remote elements to a certain extent. Experiments on Synapse and MOD datasets showed that our model has better segmentation performance compared with convolutional series segmentation models, convolutional transformer hybrid segmentation models, and pure transformer segmentation models. In the future, we will explore how to convert static blocks in the remote feature exploration layer into adaptive, dynamically selected blocks.