Efficient Water Segmentation with Transformer and Knowledge Distillation for USVs

Zhang, Jingting; Gao, Jiantao; Liang, Jinshuo; Wu, Yiqiang; Li, Bin; Zhai, Yang; Li, Xiaomao

doi:10.3390/jmse11050901

Open AccessArticle

Efficient Water Segmentation with Transformer and Knowledge Distillation for USVs

by

Jingting Zhang

¹,

Jiantao Gao

¹

,

Jinshuo Liang

¹,

Yiqiang Wu

¹,

Bin Li

²,

Yang Zhai

³ and

Xiaomao Li

^1,*

¹

Research Institute of USV Engineering, School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China

²

National Centre for Archaeology (NACA), Beiiing 100013, China

³

Shanghai Cultural Heritage Conservation and Research Centre, Shanghai 200031, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(5), 901; https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050901

Submission received: 25 March 2023 / Revised: 11 April 2023 / Accepted: 18 April 2023 / Published: 23 April 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Water segmentation is a critical task for ensuring the safety of unmanned surface vehicles (USVs). Most existing image-based water segmentation methods may be inaccurate due to light reflection on the water. The fusion-based method combines the paired 2D camera images and 3D LiDAR point clouds as inputs, resulting in a high computational load and considerable time consumption, with limits in terms of practical applications. Thus, in this study, we propose a multimodal fusion water segmentation method that uses a transformer and knowledge distillation to leverage 3D LiDAR point clouds in order to assist in the generation of 2D images. A local and non-local cross-modality fusion module based on a transformer is first used to fuse 2D images and 3D point cloud information during the training phase. A multi-to-single-modality knowledge distillation module is then applied to distill the fused information into a pure 2D network for water segmentation. Extensive experiments were conducted with a dataset containing various scenes collected by USVs in the water. The results demonstrate that the proposed method achieves approximately 1.5% improvement both in accuracy and MaxF over classical image-based methods, and it is much faster than the fusion-based method, achieving speeds ranging from 15 fps to 110 fps.

Keywords:

water segmentation; cross-modality fusion; vision transformer; knowledge distillation; unmanned surface vehicles

1. Introduction

Unmanned surface vehicles (USVs) are intelligent unmanned maritime devices used for several dangerous maritime tasks [1], including search and rescue missions and coastal facility inspections. They have been widely studied, with recent advancements in robotics and artificial intelligence. As a key perception technology for ensuring the safe navigation of USVs, stable and accurate water segmentation is crucial.

Existing water segmentation studies [2,3,4,5,6,7,8,9,10] are mainly based on the use of a monocular camera due to its low cost. Although the images contain rich representational information, they are susceptible to light reflections and lack 3D spatial perception. Considering the complementarity of data, LiDAR sensors are often used with monocular cameras. Therefore, the water segmentation method [11] consists of fusing the dense RGB semantic information from images with the sparse 3D spatial information from LiDAR point clouds to improve perception ability. However, this fusion-based method requires both the a LiDAR point cloud and a camera image as the input in the training and inference phases. Therefore, USVs should be equipped with two paired sensors to increase the computational load during real-time perception.

To solve this problem, in this study, we propose a novel efficient water segmentation process for USVs in which the 3D LiDAR point cloud is combined with the 2D camera image using a transformer and knowledge distillation. 2DPASS [12] demonstrated the effectiveness of using 2D images in combination with 3D point clouds for semantic segmentation. However, in water environments, the pulsed radar signal is not reflected on most of the water surface. Therefore, a 3D LiDAR point cloud is used as an auxiliary of the 2D images for cross-modality fusion. The fused features are then used to guide the 2D modality for water segmentation.

A local and non-local cross-modality fusion module based on a transformer is first explored to obtain better fusion information. The common fusion architecture involves late fusion, whereby features from different modalities are concatenated (Figure 1a), only considering local features lacking non-local information. On the contrary, the transformer block is more non-local with self-attention, and it more effectively performs feature matching between the 2D images and 3D point clouds. The fusion module with a transformer is shown in Figure 1b. In addition, the experimental results presented in Section 4.3.1 show that this local and non-local cross-modality fusion module is effective. Based on the obtained robust fusion features, a multi-to-single-modality knowledge distillation (KD) module is implemented in the model training phase to transform multimodal fusion information into single-modal features. Consequently, the single-modal features extracted from the camera image can achieve similar performance to that of the multimodal fusion features. Finally, in practical perception, the proposed method only takes the camera image as input, and it is faster than the multimodal fusion method while achieving a similar performance.

The main contributions of this paper are summarized as follows:

A novel local and non-local cross-modality fusion module based on a transformer is proposed to obtain better 2D image and 3D point cloud multimodal fusion information for water segmentation. The transformer block can obtain more non-local information by self-attention;
A multi-to-single-modality knowledge distillation (KD) module for the training phase of the water segmentation model is proposed to transform the multimodal fusion information into single-modal features. Through this module, the 3D-related network can be discarded during the inference phase, and the single-modal water segmentation performance can be improved without real-time loss;
Extensive experiments are conducted on the CMWS [11] datasets collected by USVs. The results demonstrate the effectiveness and efficiency of the proposed method, which achieves an approximately 1.5% improvement in accuracy and MaxF compared with other image-only segmentation methods. It is also much faster than the multimodal fusion methods.

The remainder of this paper is organized as follows. Section 2 introduces the related works on water segmentation. Section 3 details the proposed novel water segmentation process with a transformer and knowledge distillation for USVs. Section 4 shows the experimental results obtained on the CMWS [11] datasets. We discuss the limitations of the proposed method and its influencing factors, as well as some potential future works in Section 5. Finally, in Section 6, we present our main conclusions and future prospects.

2. Related Work

Water segmentation methods for USVs. Most of existing methods are image-based, using 2D images from a monocular camera as the network input [2,3,4,5,6,7,8,9,10]. Some of these methods perform the visual perception of USV water segmentation according to traditional aspects, such as image resolution [6], image color spaces [7], and a superpixel map [8,10]. The authors of other studies [2,3,4,5] attempted to apply well-known semantic segmentation networks, such as FCN [13], SegNet [14], and U-Net [15], to perform water segmentation, leading to satisfactory results. For instance, the authors of [2,3] proposed U-Net a modified version of U-Net as the backbone to extract features from the image, making it suitable for water segmentation via other refined operations. Akiyama, T. S. et al. [4] applied SegNet to automatically segment river water in images acquired by RGB sensors. The DeepLab family [16,17,18,19] proposed by Google is a series of networks applied to semantic segmentation. Among them, the DeepLabV3+ [19] is most representative, which the authors of [8] used as a comparison benchmark for proposed methods. However, these image-only methods heavily rely on image quality, which may be hindered by complex water environments. On the contrary, the proposed method leverages the 3D point cloud to assist the 2D image, which allows the model to learn more information in different dimensions and thus avoid the limitations of a single data type.

Fusion-based methods have also attracted significant attention due to the complementary of the camera and LiDAR vision sensors. Many researchers have attempted to fuse these two sensors in order to propose multisensor semantic segmentation methods and apply them in different domains. The development of fusion-based methods [20,21,22,23] is extremely rapid, especially in the field of unmanned ground vehicles (UGVs). However, in the specific water environment of USVs, few fusion-based water segmentation methods are available. Only the study presented in [11] proposed a camera–LiDAR fusion deep learning method for USV water segmentation. However, this fusion-based method only performs local feature aggregation by concatenation and lacks non-local information. In addition, it requires paired point cloud and image data as the input for the training and inference stages, which increases the computational load and real-time loss. Therefore, in this study, we explore a local and non-local cross-modality fusion module with a transformer block to improve the fusion of the features. This effective fused information is then distilled into the 2D network for water segmentation through a multi-to-single-modality knowledge distillation module.

Cross-modality knowledge distillation. The theory of knowledge distillation (KD) was first proposed by Hinton et al. [24]. KD allows a small student model to effectively learn from a large teacher model. In recent years, KD has been widely used in cross-modality applications [25,26,27,28]. KD networks have also been continuously enhanced [29,30]. They transfer knowledge from the viewpoints of prediction results and feature layers. In addition, the knowledge transfer between different modalities can prevent the loss of modal information. Inspired by the studies presented in [29,30], in this paper, we propose a multi-to-single-modality knowledge distillation module that focuses on transferring multimodel fusion information to a single-model 2D network for water segmentation perception. During the inference phase, the 3D-related network can be discarded, which effectively reduces the additional computational load.

3. Methods

This section details the proposed transformer and knowledge distillation method for USV water segmentation, which focuses on the combination of a 3D LiDAR point cloud with a 2D camera image to improve the segmentation performance of the 2D network. Figure 2 shows that the proposed method comprises three main components: (1) a model architecture with a two-branch network, (2) a novel local and non-local cross-modality fusion module, and (3) a multi-to-single-modality KD module. It first passes images and point clouds independently through a two-branch network to generate multiscale features in parallel. This two-branch network contains the U-Net 2D segmentation network [15] and the PointNet++ 3D segmentation network [31]. The local and non-local cross-modality fusion module is then applied to fuse the 2D and 3D multiscale features. Afterwards, through the multi-to-single-modality KD module, the fused features can guide the 2D network to obtain better 2D features for water segmentation perception. In the inference process, the proposed method discards the 3D branch to reduce the computational load. Moreover, the 2D single-mode method is much faster than the multimodal fusion methods, with similar performance. Therefore, in practical applications, USVs do not require expensive LiDAR devices but only a low-cost camera as the input of the proposed water segmentation network.

3.1. Local and Non-Local Cross-Modality Fusion Module

The images and point clouds are generally represented by separated pixels and points, making it difficult to directly transfer these two modalities of information. Therefore, the point–pixel correspondence is first established through the projection operation. Point-wise 2D features (

{\hat{F}}^{2 D}

) are then generated using both point–pixel mapping (

O^{p 2 p}

) and the original 2D feature map (

F^{2 D}

). Afterwards, the point-wise 3D features (

{\hat{F}}^{3 D}

) from the 3D network and the point-wise 2D features (

{\hat{F}}^{2 D}

) are fed into the transformer block to obtain fused features. The details are shown in Figure 3.

3.1.1. Point–Pixel Correspondence

This module aims to build the correspondence between the 3D points and 2D pixels to obtain paired features of two modalities, which are further used for feature fusion. The process of point–pixel mapping is illustrated on left side of Figure 3. Given the original image (

I \in R^{H \times W \times 3}

) as the input of the 2D network to extract features, the 2D feature map (

F^{2 D} \in R^{H \times W \times C_{I}}

) containing color and texture information is obtained from the 2D encoder. Given the 3D LiDAR point cloud (

P = {\{p_{i}\}}_{i = 1}^{N}

), where N is the number of points, similar to the process presented in [11,12], each 3D point (

p_{i} = (x_{i}, y_{i}, z_{i}) \in R^{3}

) is projected to a 2D image pixel (

{\tilde{p}}_{i} (u_{i}, v_{i}) \in R^{2}

) based on the camera’s intrinsic matrices (

K \in R^{3 \times 4}

) and extrinsic matrices (

T \in R^{4 \times 4}

). Point–pixel mapping is then obtained through:

O^{p 2 p} = {\{(⌊u_{i}⌋, ⌊v_{i}⌋)\}}_{i = 1}^{N} \in R^{N \times 2},

(1)

where

⌊ \cdot ⌋

is the floor operation.

If the pixel in the feature map is included in

O^{p 2 p}

, a point-wise 2D feature (

{\hat{F}}^{2 D} \in R^{N \times C_{I}}

) is extracted from the 2D feature map (

F^{2 D}

).

3.1.2. Transformer Fusion Block

The transformer fusion block mainly contains a multimodal attention mechanism (MA) and a feed-forward network (FFN), as shown in Figure 4. MA allows the multimodal transformer to reinforce 2D modality with the 3D features from the 3D modality by learning attention across the features of the two modalities, which results in a robust and effective fusion strategy. Afterwards, the fusion results are fed into the FFN, which is composed of two fully connected layers. The FFN makes the fusion features more expressive and allows for better representation of the relationship between 2D pixels and other 3D points.

Inspired by the transformer block presented in [32,33], in this study, we propose a novel approach to fuse the information of two modalities (i.e., 2D and 3D). The point-wise 2D features (

{\hat{F}}^{2 D} \in R^{N \times C_{I}}

) and point-wise 3D features (

{\hat{F}}^{3 D} \in R^{N \times C_{P}}

, where

C_{(\cdot)}

) represents the feature dimension of pixels or points, are obtained. The query, keys, and values are then defined as

Q_{2 D} = {\hat{F}}^{2 D} W_{Q}

,

K_{3 D} = {\hat{F}}^{3 D} W_{K}

, and

V_{3 D} = {\hat{F}}^{3 D} W_{V}

, respectively, where

W_{(\cdot)}

denotes the weights:

W_{Q} \in R^{C_{I} \times C_{K}}

,

W_{K} \in R^{C_{P} \times C_{K}}

, and

W_{V} \in R^{C_{P} \times C_{V}}

. Feature adaptation from 3D points to 2D pixels is performed using multimodal attention (

M A_{2 D}

:=

M_{3 D \to 2 D} ({\hat{F}}^{2 D}, {\hat{F}}^{3 D})

):

M A_{2 D} = s o f t m a x (\frac{Q_{2 D} \cdot K_{3 D}^{⊤}}{\sqrt{D_{k}}}) V_{3 D},

(2)

{\hat{F}}^{f u s e} = L i n e a r (M A_{2 D}) \in R^{N \times C_{f}} .

(3)

where

(\cdot)

denotes the dot product, softmax denotes the softmax operation, and

\sqrt{D_{k}}

is used for scaling. As shown in Equation (2),

M A_{2 D}

is passed through a linear layer to output the fusion features. Finally, an FFN is injected to complete the multimodal attention, as shown in Figure 4.

3.2. Multi-to-Single-Modality KD Module

In this study, the multi-to-single-modality KD module aims to transfer useful information from a trained 2D and 3D multimodal fusion teacher network to a single 2D modality student network. Therefore, the distillation loss is designed at the feature level and prediction level, as illustrated in Figure 5. More precisely, two independent classifiers are first applied to point-wise 2D features (

{\hat{F}}^{2 D}

) and fusion features (

{\hat{F}}^{f u s e}

) in order to obtain segmentation prediction scores (

S_{1}^{2 D}

and

S^{f u s e}

). The loss of the feature level between the point-wise 2D features (

{\hat{F}}^{2 D}

) and fusion features (

{\hat{F}}^{f u s e}

) before the softmax operation is computed using the

L_{2}

norm:

L_{f e a t} = {∥ {\hat{F}}^{f u s e} - {\hat{F}}^{2 D} ∥}^{2} .

(4)

Furthermore, the channel dependency of the feature representations (

{\hat{F}}^{2 D}

and

{\hat{F}}^{f u s e}

) is considered in the final classification [34].

{\hat{F}}^{2 D}

is fed into the classifier trained by

{\hat{F}}^{f u s e}

to obtain the output (

S_{2}^{2 D}

). Similar to the study presented in [30], the

L_{S R}

loss between the two outputs (

S_{2}^{2 D}

and

S^{f u s e}

) is computed as:

L_{S R} = - S^{f u s e} log S_{2}^{2 D} .

(5)

Afterwards, the KL divergence is used as the distillation loss (

L_{K L}

) between the 2D-predicted (

S_{1}^{2 D}

) and fusion-predicted (

S^{f u s e}

) losses. In addition, to avoid excessively difficult fusion prediction labels, which could result in poor 2D prediction learning, the

L_{K L}

loss is calculated with a distillation temperate of T = 7:

L_{K L} = D_{K L}^{T} (S_{1}^{2 D} ∥ S^{f u s e}) .

(6)

Thus, the total loss is expressed as:

L_{t o t a l} = α L_{f e a t} + β L_{S R} + γ L_{K L} + λ L_{c e},

(7)

where

α = 0.3

,

β = 0.2

,

γ = 0.3

,

λ = 0.2

, and

L_{c e}

is the cross-entropy loss used to measure the gap between the 2D prediction and the 2D ground truth.

Figure 5. Illustration of the multi-to-single-modality KD module.

The proposed KD module has several advantages. It reduces the gap in the prediction level and enhances the learning of 2D features by fusing features at the feature level. In addition, the fusion features provide intensive spatial information to facilitate the learning of the 2D features without losing 2D-specific information, such as color and texture. Moreover, in practical applications, the proposed model directly applies water segmentation using a 2D network without the 3D network branch, without requiring any additional computational load.

4. Experiments

Section 4.1 briefly introduces the implementation details, including the parameter settings and evaluation metrics. In Section 4.2, the proposed method is evaluated by comparing it with three well-known 2D segmentation networks, and some qualitative results are visualized. In addition, the speed of the proposed method and that of the method presented in [11] are compared. Finally, the comprehensive analysis of the proposed module is presented in Section 4.3.

4.1. Implementation Details

Parameter settings. All the experiments were implemented using the PyTorch deep learning framework and a Ubuntu 18.04 operating system. The Stochastic Gradient Descent (SGD) optimizer was used to update the parameters of backpropagation in the network. During the training of the samples, the learning rate of the model was initialized to 0.1 with a decay of 0.9 for every 10 epochs. The number of samples for one training iteration (i.e., batch size) was 8, and 64 training epochs were applied. The same data partitions and the same samples were applied in all the experiments. The experiments were conducted on a computer equipped with NVIDIA GEFORCE RTX 3090.

Evaluation metrics. Similar to the study presented in [11], we used the accuracy (AC), precision (P), recall (R), max F-measure (MaxF) of P and R, false-positive ratio (FPR), and mean intersection over union (mIoU) as evaluation metrics to evaluate the performance of the proposed model. AC is calculated by dividing the number of correct predictions by the total number of samples. P denotes how many genuine positive samples exist among the total positive samples. R, which is also known as the sensitivity, is calculated by dividing the number of correct predictions by the number of all positive samples in the dataset. FPR denotes the number of negative samples that are incorrectly predicted as positive samples. MaxF is the harmonic index of precision and recall. For the segmentation tasks, the mIoU value visually represents how closely the predicted results match the ground truth. The specific parameters are shown in Table 1. Note that the higher the values of AC, P, R, MaxF, and mIoU, the higher the performance of the model.

4.2. Comparative Experiment

The proposed model is a plug-and-play framework that can be applied to arbitrary segmentation networks. To demonstrate the generalization ability of this framework, it is plugged into three representative image-based segmentation networks: U-Net [15], FCN [13], and DeeplabV3+ [19]. Table 1 shows that the proposed method outperforms different basic segmentation networks. It adequately fuses the 2D image and 3D point cloud information and distills them into a 2D segmentation network. The experimental results show that fusing 3D point cloud information plays an important role in improving the image-based water segmentation methods. Moreover, the FCN [13] and DeeplabV3+ [19] networks outperform the lightweight U-Net [15] network. It can be clearly seen that the AC, MaxF, P, R, and mIoU metrics are all improved, and the FPR value is reduced, which indicates a lower false positive rate. Note that the specific evaluation metrics presented in Table 1, Table 2, Table 3 and Table 4 are detailed in Section 4.1.

The qualitative evaluation results of the proposed method and the U-Net [15] basic segmentation method are shown in Figure 6. Most of the water surface can be segmented using the basic 2D segmentation network. However, the basic method does not perform well for near-shore segmentation due to water reflection and bad weather. On the contrary, the proposed method accurately segments the water surface in different scenes, which is closer to the ground truth, especially in the near-shore areas.

Furthermore, to validate the efficiency of the proposed method, an experimental speed comparison between the fusion method presented in [11] and the proposed method was conducted. Considering that the U-Net [15] is relatively simple, while DeeplabV3+ [19] is more complex, FCN [13] was chosen as the baseline method, and the same hardware environment and dataset were used. Figure 7 shows that the proposed method achieved significant speed improvement from 15 fps to 110 fps while maintaining a comparable accuracy to that of the fusion method presented in [11]. This proves that the proposed method achieves similar performance to that of the multimodal fusion method while being faster.

4.3. Comprehensive Analysis

Some experiments and analyses were conducted to demonstrate the effectiveness of the local and non-local cross-modality fusion module and the multi-to-single-modality KD module in the proposed water segmentation method. All the experiments were based on the same baselines: U-Net [15] for 2D images and PointNet++ [31] for 3D point clouds.

4.3.1. Effectiveness of the Cross-Modality Fusion Module

The component of the cross-modality fusion module (Section 3.1) covers the point–pixel correspondence and the transformer block. The fusion effects of 2D and 3D features with different fusion strategies were compared, as shown in Table 2. Two well-known fusion strategies were used for this comparison experiment: concatenation fusion and the CBAM mechanism [35]. Concatenation fusion was performed, whereby the features from different sources were concatenated. The CBAM mechanism [35] is an integrator of channel attention and spatial attention used to enhance the expressiveness of the network features. Table 2 clearly shows that the proposed fusion module with a transformer block outperforms the concatenation and CBAM methods, which can be verified by all the metrics. This shows that the transformer block plays an effective role in the fusion of the 2D and 3D features.

4.3.2. Comparison with Other KD Methods

To further validate the usefulness of the KD module, it was compared with some classical teacher–student KD architectures and KL divergence (Table 3). KL divergence is usually used to measure the similarity between two distributions. However, in the considered scenario, using only KL divergence as distillation loss is not accurate. In addition, based on the studies of Hinton et al. [24] and Yang et al. [30], the proposed KD module (Section 3.2) not only performs distillation loss on the prediction layer (Equation (6)) but also adds a distillation loss to the feature layer (Equation (4)). Table 3 shows that the proposed KD achieves the highest MaxF of 95.50% and a precision of 95.25%.

4.3.3. Ablation Study

Ablation experiments were performed using different components (Table 4) to study the contribution of each component of the proposed method. Model A is the baseline with only U-Net and PointNet++. Model B adds a cross-modality fusion module, and the AC and mIoU are improved by 0.84% and 1.52%, respectively. In addition, Model C simply uses the KD module to increase the AC by 1.16% and the mIoU by 2.08% relative to Model A. Finally, Model D is equipped with a local and non-local cross-modality fusion module and multi-to-single-modality KD module, achieving 95.91% AC and 92.10% mIoU, demonstrating the effectiveness of the proposed cross-modality fusion module with a transformer and multi-to-single-modality KD module water segmentation method.

5. Discussion

In this article, we proposed a water segmentation method for USVs that uses a local and non-local cross-modality fusion module based on a transformer block and a multi-to-single-modality knowledge distillation module to leverage 3D LiDAR point clouds in combination with 2D images. It achieves improved single-modal water segmentation performance without real-time loss. The effectiveness of the proposed method is demonstrated through the detailed proof of the experiment reported in Section 4. According to the experimental results, the proposed method achieves approximately 1.5% improvement both in accuracy and MaxF over three classical image-based methods, and it is much faster than the fusion-based method [11], achieving an increase in speed from 15 fps to 110 fps, as shown in Figure 6.

Considering the limitations of traditional image processing algorithms in terms of robustness and scalability, we usually choose a deep network with a more powerful learning ability for water segmentation. However, although the proposed method achieved good results in the experiments, we also identified some limitations and influencing factors. For example, the dataset in the field of USVs is not as large as that in the UGV field, which may result in overfitting problems with the limited dataset. Moreover, the CMWS [11] dataset used in our experiments was collected by USVs in inland waterways, which may potentially affect the generalization ability of our model to the sea surface environment. Therefore, we will explore the potential of semisupervised methods to address dataset scarcity to further optimize the USV water segmentation task in the future.

There are some other potential future works proposed. First, we will consider using other data sources, such as radar or sonar, to further improve the accuracy and efficiency of water segmentation methods for USVs. Secondly, the potential of transfer learning techniques will be explored to improve the performance of the proposed multimodal fusion water segmentation method when applied to different water scenes or environments. Thirdly, we will investigate the possibility of using reinforcement learning techniques to optimize the performance of the proposed water segmentation method in dynamic and changing water environments. Finally, we hope these techniques can be integrated into a real-time water segmentation system for practical USV applications in different scenarios, such as search and rescue missions or for environmental monitoring.

6. Conclusions

In this paper, we proposed a method for USV water segmentation that consists of applying a transformer and knowledge distillation to allow the 3D point cloud to assist the 2D image. The proposed method optimizes the performance of image-based methods while being faster than fusion-based methods. More precisely, a local and non-local cross-modality fusion module based on a transformer was used to effectively fuse the 3D point cloud and 2D image during the training phase. The fused information was then used to guide the 2D network in order to improve the water segmentation results through a multi-to-single-modality knowledge distillation module. Moreover, the 3D-related network can be discarded to reduce the computational load during the inference phase. The results of extensive experiments on various datasets of water scenes collected by USVs demonstrate that the proposed method outperforms image-based baselines and that it is faster than the fusion method. In addition, considering the limitations of data collection, in future work, we aim to further optimize the USV water segmentation task from the perspective of semisupervised methods.

Author Contributions

Conceptualization, J.Z. and X.L.; methodology, J.Z. and J.G.; software, J.Z. and J.L.; validation, J.Z., J.L. and Y.W.; formal analysis, J.Z. and J.G.; investigation, J.Z.; resources, J.Z.; data curation, J.Z. and J.L.; writing—original draft preparation, J.Z., J.G. and Y.W.; writing—review and editing, J.Z., J.G. and X.L.; visualization, J.Z. and J.L.; supervision, X.L.; project administration, B.L.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, Research and Development of Key Technologies for Underwater Archaeological Exploration, grant number 2020YFC1521703; the National Outstanding Youth Science Foundation of China, grant number 62225308; and the National Natural Science Foundation of China, grant number 62073075.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bai, X.; Li, B.; Xu, X.; Xiao, Y. A Review of Current Research and Advances in Unmanned Surface Vehicles. J. Mar. Sci. Appl. (JMSA) 2022, 21, 47–58. [Google Scholar] [CrossRef]
Xia, M.; Cui, Y.; Zhang, Y.; Xu, Y.; Liu, J.; Xu, Y. DAU-Net: A novel water areas segmentation structure for remote sensing image. Int. J. Remote Sens. 2021, 42, 2594–2621. [Google Scholar] [CrossRef]
Ling, G.; Suo, F.; Lin, Z.; Li, Y.; Xiang, J. Real-time Water Area Segmentation for USV using Enhanced U-Net. In Proceedings of the 2020 IEEE Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; pp. 2533–2538. [Google Scholar]
Akiyama, T.; Junior, J.M.; Gonçalves, W.; Bressan, P.; Eltner, A.; Binder, F.; Singer, T. Deep learning applied to water segmentation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. (ISPRS Arch.) 2020, 43, 1189–1193. [Google Scholar] [CrossRef]
Adam, M.A.M.; Ibrahim, A.I.; Abidin, Z.Z.; Zaki, H.F.M. Deep Learning-Based Water Segmentation for Autonomous Surface Vessel. In IOP Conference Series: Earth and Environmental Science (EES); IOP Publishing: Bristol, UK, 2020; Volume 540, p. 012055. [Google Scholar]
Taipalmaa, J.; Passalis, N.; Zhang, H.; Gabbouj, M.; Raitoharju, J. High-resolution water segmentation for autonomous unmanned surface vehicles: A novel dataset and evaluation. In Proceedings of the 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), Pittsburgh, PA, USA, 13–16 October 2019; pp. 1–6. [Google Scholar]
Taipalmaa, J.; Passalis, N.; Raitoharju, J. Different color spaces in deep learning-based water segmentation for autonomous marine operations. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3169–3173. [Google Scholar]
Xue, H.; Chen, X.; Zhang, R.; Wu, P.; Li, X.; Liu, Y. Deep Learning-Based Maritime Environment Segmentation for Unmanned Surface Vehicles Using Superpixel Algorithms. J. Mar. Sci. Eng. 2021, 9, 1329. [Google Scholar] [CrossRef]
Zhan, W.; Xiao, C.; Wen, Y.; Zhou, C.; Yuan, H.; Xiu, S.; Zhang, Y.; Zou, X.; Liu, X.; Li, Q. Autonomous visual perception for unmanned surface vehicle navigation in an unknown environment. Sensors 2019, 19, 2216. [Google Scholar] [CrossRef] [PubMed]
Zhan, W.; Xiao, C.; Wen, Y.; Zhou, C.; Yuan, H.; Xiu, S.; Zou, X.; Xie, C.; Li, Q. Adaptive semantic segmentation for unmanned surface vehicle navigation. Electronics 2020, 9, 213. [Google Scholar] [CrossRef]
Gao, J.; Zhang, J.; Liu, C.; Li, X.; Peng, Y. Camera-LiDAR Cross-Modality Fusion Water Segmentation for Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2022, 10, 744. [Google Scholar] [CrossRef]
Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In Computer Vision (ECCV)–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXVIII; Springer: Cham, Switzerland, 2022; pp. 677–695. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. Bevfusion: A simple and robust lidar-camera fusion framework. arXiv 2022, arXiv:2205.13790. [Google Scholar]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. 2021, 23, 722–739. [Google Scholar] [CrossRef]
El Madawi, K.; Rashed, H.; El Sallab, A.; Nasr, O.; Kamel, H.; Yogamani, S. Rgb and lidar fusion based 3d semantic segmentation for autonomous driving. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 7–12. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Hoffman, J.; Gupta, S.; Darrell, T. Learning with side information through modality hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 826–834. [Google Scholar]
Zhao, M.; Li, T.; Abu Alsheikh, M.; Tian, Y.; Zhao, H.; Torralba, A.; Katabi, D. Through-wall human pose estimation using radio signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7356–7365. [Google Scholar]
Garcia, N.C.; Morerio, P.; Murino, V. Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
Thoker, F.M.; Gall, J. Cross-modal knowledge distillation for action recognition. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 6–10. [Google Scholar]
Huang, Z.; Shen, X.; Xing, J.; Liu, T.; Tian, X.; Li, H.; Deng, B.; Huang, J.; Hua, X.S. Revisiting knowledge distillation: An inheritance and exploration framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3579–3588. [Google Scholar]
Yang, J.; Martinez, B.; Bulat, A.; Tzimiropoulos, G. Knowledge Distillation via Softmax Regression Representation Learning. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021; Available online: https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/70425/Tzimiropoulos%20Knowledge%20distillation%20via%202021%20Accepted.pdf?sequence=2 (accessed on 25 March 2023).
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 2019, p. 6558. [Google Scholar]
Komodakis, N.; Zagoruyko, S. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]

Figure 1. Illustration of different multisensor fusion methods. The solid line represents the training and inference phases, and the dashed line represents only the training phase.

Figure 2. Overall framework of the proposed transformer and knowledge distillation method for USV water segmentation. The system first inputs 2D images and 3D point clouds that pass through U-Net [15] and PointNet++ [31] to extract features in parallel. A local and non-local cross-modality fusion module then integrates the features from two modalities to obtain effective fusion features. Afterwards, the fusion features are transferred to guide the 2D network in order to extract better 2D features through a multi-to-single-modality KD module. Finally, these features are used to generate water segmentation scores via the 2D segmentation network.

Figure 3. Illustration of the local and non-local cross-modality fusion module.

Figure 4. Details of the transformer fusion block.

Figure 6. Qualitative evaluation results of the proposed method and the U-Net basic segmentation network.

Figure 7. Speed comparison between the fusion method presented in [11] and the proposed method based on the same baseline (FCN).

Table 1. Results of the comparison between the proposed method and some representative image-based segmentation methods.

Method	AC $(%)$	MaxF $(%)$	P $(%)$	R $(%)$	FPR $(%)$ ↓	mIOU $(%)$
U-Net [15] $^{1}$	94.20	94.47	93.31	95.65	7.64	89.0
Ours-U-Net $^{2}$	95.91	95.71	93.85	97.65	7.13	92.1
FCN [13] $^{1}$	95.38	96.38	97.13	95.64	4.36	95.3
Ours-FCN $^{2}$	97.76	97.38	97.14	97.62	3.2	95.61
DeepLabV3+ [19] $^{1}$	96.33	96.69	96.54	96.83	3.17	95.12
Ours-DeeplabV3+ $^{2}$	97.84	97.67	97.2	98.14	3.14	95.76

¹ Only the basic 2D segmentation networks; ² our proposed the plug-and-play module is added to the basic 2D segmentation networks to further improve the segmentation performance.

Table 2. Comparison between different fusion strategies for 2D and 3D features. (Trans. denotes the transformer).

Fusion Strategy	AC $(%)$	MaxF $(%)$	P $(%)$	mIoU $(%)$
Concatenation	95.11	94.79	92.93	90.63
CBAM [35]	95.68	94.85	93.75	91.67
Ours-Trans.	95.91	95.71	93.85	92.1

Table 3. Comparison with different KD methods.

Method	AC $(%)$	MaxF $(%)$	P $(%)$	mIoU $(%)$
U-Net (no KD)	94.20	94.47	93.31	89.00
KL Divergence	94.80	94.96	93.96	90.09
Hinton et al. [24]	95.17	94.85	94.20	90.73
Yang et al. [30]	95.38	94.90	93.58	91.14
Ours KD	95.36	95.50	95.25	91.08

Table 4. Results of different ablation experiences.

Model	Local and Non-Local Cross-Modality Fusion Module	Multi-to-Single-Modality KD Module	AC $(%)$	mIoU $(%)$
A	–	–	94.20	89.00
B	✓	–	95.04	90.52
C	–	✓	95.36	91.08
D	✓	✓	95.91	92.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Gao, J.; Liang, J.; Wu, Y.; Li, B.; Zhai, Y.; Li, X. Efficient Water Segmentation with Transformer and Knowledge Distillation for USVs. J. Mar. Sci. Eng. 2023, 11, 901. https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050901

AMA Style

Zhang J, Gao J, Liang J, Wu Y, Li B, Zhai Y, Li X. Efficient Water Segmentation with Transformer and Knowledge Distillation for USVs. Journal of Marine Science and Engineering. 2023; 11(5):901. https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050901

Chicago/Turabian Style

Zhang, Jingting, Jiantao Gao, Jinshuo Liang, Yiqiang Wu, Bin Li, Yang Zhai, and Xiaomao Li. 2023. "Efficient Water Segmentation with Transformer and Knowledge Distillation for USVs" Journal of Marine Science and Engineering 11, no. 5: 901. https://0-doi-org.brum.beds.ac.uk/10.3390/jmse11050901

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Water Segmentation with Transformer and Knowledge Distillation for USVs

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Local and Non-Local Cross-Modality Fusion Module

3.1.1. Point–Pixel Correspondence

3.1.2. Transformer Fusion Block

3.2. Multi-to-Single-Modality KD Module

4. Experiments

4.1. Implementation Details

4.2. Comparative Experiment

4.3. Comprehensive Analysis

4.3.1. Effectiveness of the Cross-Modality Fusion Module

4.3.2. Comparison with Other KD Methods

4.3.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI