1. Introduction
Synthetic aperture radar (SAR) has evolved into a crucial active microwave remote imaging sensor due to its operational characteristics of high resolution and its ability to work around the clock and in all weather conditions [
1,
2]. SAR remote sensing images have a wide range of potential applications, including resource survey, disaster evaluation, marine supervision, and environmental monitoring [
3,
4,
5]. Ship detection is one of the important applications in maritime supervision, marine traffic control and marine environmental protection. Numerous object detection techniques have achieved outstanding detection performances since the creation and advancement of the convolutional neural network (CNN). Ship detection has been a major topic in the world of SAR image processing. However, there are still a lot of issues that need to be resolved for the CNN-based SAR object detection tasks because of the variety of ships and interference from background and noise.
The candidate region-based and regression-based methods, which are also known as two-stage and one-stage detection models, respectively, are two categories into which CNN-based object detection techniques can be subdivided. The candidate region-based method first extracts all the regions of interest (ROI) roughly, and then realizes the accurate localization and classification of the object. In the early years, candidate region-based methods developed rapidly due to their high precision. To name just a few, aiming at the complex rotation motions of SAR ships, Li et al. [
6] proposed a two-stage spatial-frequency features fusion network, which obtains the ships’ rotation-invariant features in the frequency domain through the polar Fourier transform. Zhang et al. [
7] proposed a novel quad-feature pyramid network (Quad-FPN), which aims to efficiently emphasize multi-scale ship object features while suppressing background interference. Considering interference of the diversiform scattering and complex backgrounds in SAR imagery, Li et al. [
8] designed an adjacent feature fusion (AFF) module to selectively and adaptively fuse shallow features into adjacent high-level features. To solve the unfocused images caused by motion, Zhou et al. [
9] proposed a multilayer feature pyramid network, which builds a feature matrix to describe the motion states of ship objects via the Doppler center frequency and offset of frequency modulation rate.
Differing from the candidate region-based methods (two-stage), the regression-based detection (one-stage) framework is relatively simple and can realize object location and classification in one step. Therefore, the one-stage methods are particularly appropriate for real-time object detection tasks. For stance, Li et al. [
10] combined the architecture of YOLOv3 and spatial pyramid pooling to establish a one-stage network, aiming to break the scale restrictions of the object. Aiming at the challenge of arbitrary directions of a ship, Jiang et al. [
11] proposed a long-edge decomposition rotated bounding box (RBB) encoder, which takes the horizontal and vertical components calculated by the orthogonal decomposition of the long-edge vector to represent the orientation information. However, what is unpleasant is that the detection accuracy decreases to varying degrees as the detection speed increases. For this reason, scholars have launched an exploration to strike a balance between models’ effects and speed. You only look once (YOLO) is a representative detection framework with prominent effects and competitive speed, and it has occupied a place in the object detection field. On the basis of YOLO, some new detection methods have been proposed in succession. For example, in view of the poor detection performance of small-scale objects, Gao et al. [
12] developed scale-equalizing pyramid convolution (SEPC) to replace the path aggregation network (PAN) structure in the YOLOv4 framework. Considering the difference in the scattering properties between ships and sea clutter, Tang et al. [
13] first designed a Salient Otsu (S-Otsu) threshold segmentation method to reduce noise, and then proposed two main modules, i.e., a feature enhancement module and a land burial module, to heighten the ship features and restrain the background clutter, respectively. With the goal of detecting small-scale objects against complex backgrounds, Yu et al. [
14] proposed a step-wise locating bidirectional pyramid network, which can effectively locate small-scale objects via global and local channel attentions. For ease of deployment on satellites, Xu et al. [
15] devised a lightweight airborne SAR ship detector based on YOLOv5, which introduced a histogram-based pure background classification (HPBC) module to filter out pure background samples.
Although the detection method mentioned above has achieved relatively good performance, the issue of SAR ship detection has not been well solved due to the multi-scale and weak saliency of the target features and the complex background noise. Recently, the emergence of the attention mechanism provides a useful solution for scholars to enhance the effectiveness of ship detection in SAR imagery [
16]. In particular, the attention mechanism is very effective at enhancing object features and suppressing background noise. According to the difference in working principles, attention modules can be broadly categorized into three groups: spatial, channel and hybrid attentions. It is noted that hybrid attention simultaneously considers spatial and channel information. Some hybrid attention-based detection methods emerged recently and have made further improvements in detection performance, such as YOLO-DSD [
17], CAA-YOLO [
18], FEHA-OD [
19] and so on. However, the computing resource including internal storage capacity and computing power is limited in practical applications, while the demand for the inference speed is great. There is an urgent need to design a lightweight framework with outstanding detection performance. More importantly, most of the existing attention-based detection methods employ common channel and spatial attention mechanisms to emphasize ships according to the global relations among features [
20]. As a matter of fact, global attention modules are computationally inferior to local attention ones. As a consequence, a major problem and obstacle in the current ship-detecting duty in SAR images is how to achieve a compromise between accuracy and speed.
To address the aforementioned issue, this paper proposes an innovative lightweight radar ship detection framework with hybrid attention mechanisms presented in SAR imagery. In light of the merits of YOLOv5 [
21] with high precision and rapid inference speed, the designed method takes YOLOv5 as the baseline. First, a unique hybrid attention residual module is created to enhance the model’s feature learning capabilities and ensure high detection precision. Second, an attention-based feature fusion scheme is introduced to further highlight the features of the object. At the same time, considering the uniformity and redundancy of the convolutions, the hybrid attention feature fusion module is used in the attention-based feature fusion scheme to ensure the model’s practicability and efficiency. The vital contributions of this paper are as follows:
This paper designs a novel lightweight radar ship detector with multiple hybrid attention mechanisms, which is named multiple hybrid attentions ship detector (MHASD). It is proposed to obtain high detection precision while achieving fast inference for SAR ship detection. Extensive qualitative and quantitative experiments on two benchmark SAR ship detection datasets reveal that the designed method can strike a better balance in speed and precision more effectively than some state-of-the-art approaches.
Considering the inconspicuous features of the ship object and strong background clutter in SAR images, a hybrid attention residual module (HARM) is developed, which enhances the features of ship objects in both channel and spatial levels by means of local attention and global attention integration to ensure high detection precision.
To further enhance the discriminability of ship object features, an attention-based feature fusion scheme (AFFS) is developed in the model neck. Meanwhile, to constrain the model’s computational complexity, a novel module called the hybrid attention feature fusion module (HAFFM) is introduced to the lightweight in the AFFS component.
2. Methodology
The proposed method obtains diversified features via CSP-DarkNet and fuses diverse features to obtain multi-scale receptive fields using the feature pyramid network (FPN) [
22]. According to the difference in modules’ functions, the proposed framework is composed of three components, i.e., feature extraction, feature fusion, and object prediction, as shown in
Figure 1. Furthermore, this paper designs multiple novel attention-based modules during the feature extraction and fusion to enhance ships’ features and ensure the multi-scale ship detection competency in the SAR images. Each component of the designed method is described in detail below.
2.1. Feature Extraction Module
Due to its excellent feature learning capabilities, CNN has been extensively employed as an important model of feature embedding, but its obvious deficiency is that each position on the feature map is sampled with indistinctive weight via convolutions [
23]. That is to say, during the feature convolution operation, it is not possible to suppress background interference while focusing on the object’s region of interest. Such deficiency makes the SAR ship detection based on CNN unable to effectively extract the discriminative features of the object. In addition, CNN extracts deep semantic information by stacking multi-layer convolution. The objects in large-scale SAR scene show a multi-scale phenomenon. The network with many layers runs the risk of losing small-scale target properties during information transmission, which will make the model more complex. This work develops a novel hybrid attention residual module (HARM) to lessen the background clutter and improve the objects’ relevant information to solve the aforementioned issues. According to
Figure 2, the suggested HARM primarily consists of local channel and global spatial attention operations.
Considering that the shape information of the object in SAR images is inconspicuous, the proposed HARM first considers enhancing the object features at the channel level. Generally, the channel attention is realized by the global correlation among all channel features. It is a fact that the correlation between remote channels is weak, and the correlation between nearby channels is strong. Since the features of ship objects are not inconspicuous in SAR images and the influence of noise is large, we argue that the correlation between long-distance channels may not be important for ship detection in SAR images. In the procedure of attention operation, the range of local channels depends on the channel dimension. Consequentially, drawing on the idea of efficient channel attention (ECA) [
24], the proposed HARM first focuses on some vital channels according to local channel correlation. Assume that in a one-dimensional convolution process,
k is the size of the convolution kernel. The nonlinear mapping of the feature’s channel dimension
C represents the value of
k. Mathematically, the channel selection expression is as follows [
24]:
where
b and
are coefficients set to 1 and 2, respectively, and
means to take the odd number closest to the result of the operation.
Furthermore, considering the small-scale characteristics of SAR objects in large-scale scenes, this paper draws on the idea of multi-head self-attention (MHSA) in Transformer [
25] to promote the effectiveness of feature extraction at the spatial dimension.
Differing from other examples of local spatial attention, MHSA captures the correlation of each pixel with other pixels. More significantly, this global spatial attention can obtain multiple groups of parameters to learn the ships’ features from various perspectives, which is conducive for SAR ship detection tasks of both inshore and offshore situations. Notably, the output of this self-attention module can be computed simultaneously and is independent of the prior output. That is to say, the operation speed is also faster.
Remarkably, the proposed HARM adopts the residual structure to promote the feature learning ability of the model. Differing from other residual modules, the proposed module replaces the original 3 × 3 convolution operation with an attentions group, which makes the network focus on the ship and reduces the object information loss caused by the convolution during sampling operation. In light of the abundant semantic feature at the deep-level network, the proposed HARM is deployed in the deep layer of the feature extraction model.
2.2. Feature Fusion Module
The feature pyramid network (FPN) [
22] is employed in the conventional feature fusion module to integrate multi-scale discriminative features. However, such FPN transfers information in one way, which limits its capability for feature fusion [
26]. For this purpose, this paper develops a brand new attention-based feature fusion module, which integrates the merits of attention and bidirectional transmission to efficiently fuse multi-scale features from the feature extraction module.
In the bidirectional feature pyramid network, the first feature pyramid transfers and fuses strong semantic features by up-sampling operations. The goal of the second pyramid is to achieve strong localization features by down-sampling operations. The feature fusion module is established on a bidirectional pyramid network and attention mechanism, as shown in
Figure 3. To restrict the complexity of the model, the lightweight attention module, i.e., HAFFM, combines the local channel and the local spatial attentions, as depicted in
Figure 4. The channel attention first squeezes input features
into feature
using the global average pooling [
27] operation shown as Equation (
2).
where
is the c-th channel feature of the input feature
X,
is the c-th feature of
Z, and
is the squeeze mapping.
Then, the local channel attention gains the weights
via 1D convolution and the sigmoid function. The details are described as:
where
is the convolution operation whose kernel size is
and
k is described as Equation (
1). Meanwhile,
is the Sigmoid function denoted as:
The output feature
Y of the local channel attention is expressed as follows:
where
is the c-th channel feature of
Y and
is the weight of the c-th channel feature.
Mathematically, the local spatial attention operation is expressed as follows:
where Conv denotes the convolution operation with the kernel size of 7 × 7 and Avgpool and Maxpool denote average and maximum pooling operations [
27], respectively.
2.3. Prediction Module
Classification loss
, objects loss
, and localization loss
make up the proposed MHASD’s loss function, which is denoted as:
The loss functions of classification and objects are developed to the BCEWithLogitsLoss, which combines binary cross entropy loss (BCELoss) function and sigmoid function. The BCEWithLogitsLoss
between
x and
z is described as:
where
denotes the loss value of
and
.
Because the orientation of SAR ship objects is generally arbitrary, it is effective to consider the uniformity of aspect ratios among the predicted and the ground-truth boxes. Due to its consideration of the box aspect ratio and convergence speed, the complete intersection over union (CIoU) [
28] is exploited as the localization loss.
Figure 5 describes the CIoU loss of the predicted boundary A and the ground-truth boundary B.
Theoretically, the localization loss is defined as [
28]:
where
stands for the Euclidean distance operation,
c is the diagonal line length of the smallest enclosing box that encompasses the two boxes, and
and
stand for the center points of the predicted and the ground-truth boxes. The
is denoted [
28]:
The
is the aspect ratio’s constancy, as shown [
28]:
where
w and
are the widths of the predicted and the ground-truth boundaries.
h and
are the lengths of the predicted and the ground-truth boundaries, respectively.
The
is a positive trade-off parameter, as follows [
28]: