A Lightweight Man-Overboard Detection and Tracking Model Using Aerial Images for Maritime Search and Rescue

Zhang, Yijian; Tao, Qianyi; Yin, Yong

doi:10.3390/rs16010165

Open AccessArticle

A Lightweight Man-Overboard Detection and Tracking Model Using Aerial Images for Maritime Search and Rescue

by

Yijian Zhang

,

Qianyi Tao

and

Yong Yin

^*

Navigation College, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(1), 165; https://0-doi-org.brum.beds.ac.uk/10.3390/rs16010165

Submission received: 26 October 2023 / Revised: 27 December 2023 / Accepted: 28 December 2023 / Published: 30 December 2023

(This article belongs to the Special Issue Artificial Intelligence-Driven Methods for Remote Sensing Target and Object Detection II)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned rescue systems have become an efficient means of executing maritime search and rescue operations, ensuring the safety of rescue personnel. Unmanned aerial vehicles (UAVs), due to their agility and portability, are well-suited for these missions. In this context, we introduce a lightweight detection model, YOLOv7-FSB, and its integration with ByteTrack for real-time detection and tracking of individuals in maritime distress situations. YOLOv7-FSB is our lightweight detection model, designed to optimize the use of computational resources on UAVs. It comprises several key components: FSNet serves as the backbone network, reducing redundant computations and memory access to enhance the overall efficiency. The SP-ELAN module is introduced to ensure operational speed while improving feature extraction capabilities. We have also enhanced the feature pyramid structure, making it highly effective for locating individuals in distress within aerial images captured by UAVs. By integrating this lightweight model with ByteTrack, we have created a system that improves detection accuracy from 86.9% to 89.2% while maintaining a detection speed similar to YOLOv7-tiny. Additionally, our approach achieves a MOTA of 85.5% and a tracking speed of 82.7 frames per second, meeting the demanding requirements of maritime search and rescue missions.

Keywords:

unmanned aerial vehicles (UAVs); object tracking; object detection; tracking by detection; maritime search and rescue

1. Introduction

In recent years, there has been a surge in maritime accidents worldwide, resulting in significant human and economic losses. Since 2014, the number of maritime accidents has steadily increased, averaging 2647 casualties and incidents annually [1]. Maritime search and rescue (SAR) operations play a vital role in national emergency response systems, with the primary challenge being swiftly and accurately locating and identifying objects at sea. The emergence of unmanned aerial vehicle (UAV) technology has revolutionized various fields, including robotics, security surveillance, intelligent transportation, wildlife conservation, and geospatial information [2,3,4,5,6,7]. Owing to their agility, portability, and aerial accessibility, UAVs offer rapid deployment, high data capacity, and outstanding spatial resolution, making them particularly effective for executing maritime SAR missions [8].

However, identifying individuals in maritime distress within UAV imagery presents unique challenges due to their small scale. Operator fatigue and distractions can also lead to missed sightings of man-overboard situations in aerial images. The application of deep learning-based object detection algorithms has significantly improved this situation. UAVs equipped with such algorithms can rapidly cover extensive maritime areas while promptly identifying individuals in distress, ensuring that they can be rescued faster, thereby increasing the probability of survival. Nonetheless, UAV image object detectors must strike a balance between lightweight design and real-time processing due to hardware limitations and specific application scenarios. This study focuses on enhancing the efficiency of identifying man-overboard situations in UAV imagery by addressing these challenges.

Several scholars have demonstrated the exceptional performance of the You Only Look Once (YOLO) series in the field of object detection, particularly in UAV-based detection systems [9,10,11]. Therefore, we have selected YOLOv7-tiny [12] as the UAV visual framework for maritime search and rescue operations involving man-overboard situations.

Our primary contributions to this work are as follows:

The introduction of a lightweight object detector, “YOLOv7-FSB,” tailored to efficiently identify man-overboard situations for rescue missions.
To address the challenge of small objects in images, we propose a lightweight backbone network named “FSNet” to enhance the model’s global information perception and improve its robustness.
In response to the limited feature information of submerged individuals in aerial images, we introduce the module design called “SP-ELAN.” This module integrates channel reconstruction units, reducing feature redundancy and enhancing algorithm efficiency.
To enhance the utilization efficiency of feature information for submerged individuals in aerial images, we implement an improved Bidirectional Feature Pyramid Network. This network performs a bidirectional fusion of features related to submerged individuals in aerial images, integrating both local and global features.
The selection of ByteTrack as the tracking component, combined with the improved detection algorithm, demonstrates real-time detection and tracking capabilities through experimentation.

The remainder of this paper is organized as follows: Section 2 reviews related research, Section 3 introduces the algorithm for identifying maritime man-overboard situations based on aerial imagery, and detailed information about the experiments and analysis is provided in Section 4. In Section 5, conclusions are drawn, and future research directions are outlined.

2. Relate Works

2.1. Object Detection

In the early stages of computer vision, conventional object detection algorithms heavily relied on intricately handcrafted feature engineering. These methods required the meticulous design of complex feature representations and the incorporation of various acceleration techniques, all within an era characterized by a lack of truly effective image representations [13,14,15]. In subsequent years, progress in object detection research experienced a significant slowdown as the performance of handcrafted features reached a saturation point. However, a pivotal turning point occurred in 2014 when convolutional neural networks (CNNs) emerged as a solution to this challenge [16]. CNN-based object detection methods can be broadly categorized into one- and two-stage detectors.

2.1.1. Two-Stage Detectors

Two-stage algorithms typically follow a sequence where they first identify candidate regions within an image and then proceed to classify and precisely locate the target within these regions. This approach has seen significant evolution in the field of object detection. The journey of two-stage object detection algorithms can be traced back to the pioneering work of Girshick et al. [17], who introduced the region-based CNN (R-CNN) algorithm. R-CNN marked the initial steps in employing deep learning for object detection. It laid the foundational concepts for subsequent CNN-based object detection techniques. The field continued to advance with the work of He et al. [18], who made notable improvements to the CNN architecture. They introduced spatial pyramid pooling (SPP) layers, a critical innovation that led to the development of the faster SPP-Net detection algorithm. This introduction of SPP layers significantly improved the speed and efficiency of object detection compared to the earlier R-CNN model. Building on this progress, Girshick further refined the methodology with the introduction of the Fast R-CNN model [19]. Fast R-CNN streamlined the detection process by discerning all potential candidate boxes directly from the feature maps extracted from the image. This innovation not only enhanced the efficiency of training and detection but also marked a leap forward compared to R-CNN. The field reached new heights with the presentation of the Faster R-CNN algorithm by Ren et al. [20]. Faster R-CNN introduced a two-stage detection approach, further optimizing training and detection speeds when compared to the Fast R-CNN model.

2.1.2. One-Stage Detectors

On the other hand, one-stage detectors, in contrast to two-stage detectors, rely on a global regression-based classification approach to directly predict the position and category of a target. These one-stage methods are well suited for real-time object detection, where speed is of utmost importance. Several influential algorithms have been developed in this category, each with its own set of features and trade-offs. One notable algorithm is the “You Only Look Once” detection algorithm introduced by Redmon et al. [21]. YOLO employs a neural network to predict both the positions and categories of objects within images. While it excels in real-time detection, it tends to exhibit lower accuracy compared to some two-stage detectors. Liu et al. introduced the Single-Shot MultiBox Detector (SSD) algorithm [22], an improved iteration of YOLO. SSD utilizes the VGG-16 [23] deep convolutional neural network for extracting multiscale feature maps and directly outputs target positions. This refinement aims to strike a balance between speed and accuracy. The YOLO series, pioneered by Redmon, continued to evolve with YOLOv3 [24], which is characterized by high accuracy and speed. YOLOv4, introduced in 2020, incorporated CSPDarkNet-53 as the backbone network, further enhancing performance. YOLOv5 emphasizes engineering considerations, prioritizing model flexibility and ease of deployment, although it may involve certain performance trade-offs. Li et al. introduced YOLOv6 [25], which incorporates the RepBlock module inspired by the renowned RepVGG network [26], significantly enhancing training and inference speeds. YOLOv7 [12] introduces trainable freebies, reparameterizes the model structure, and integrates “expansion” and “compound scaling” techniques to enhance inference speed and accuracy. YOLOv8 builds upon YOLOv5 by incorporating the latest idea of decoupled detection heads. Additionally, it enhances the backbone network and feature pyramid, resulting in improved performance. Furthermore, Lin et al. [27] proposed an innovative “focal loss” method to address class imbalance issues in dense object detectors. This innovation enabled the creation of a high-speed, high-precision single-stage detector named RetinaNet. Tan et al. [28] introduced a composite scale transformation method for object detection models, achieving superior results under different computational resource constraints compared to existing models.

2.1.3. Lightweight Detectors

To meet the demands of effective detection in industrial applications with limited memory and computational resources, researchers have introduced a range of lightweight object detection algorithms. These algorithms are crucial for scenarios where efficiency and real-time processing are paramount. One noteworthy approach was presented by Huang et al. [29] with their novel 3FL-Net. This algorithm significantly enhances the performance of lightweight object detectors, particularly under adverse weather conditions. 3FL-Net achieves this by closely integrating various components, including feature enhancement, feature extraction, feature adaptation, and lightweight detection subnetworks. It is a valuable solution for industrial applications that require reliable object detection in challenging environmental conditions. Wu et al. [30] introduced a lightweight backbone network and an efficient feature fusion network tailored for road damage object detection. They leveraged lightweight modules (LWCs), optimized attention mechanisms, and activation functions to build a system that efficiently identifies road damage, even with limited computational resources. Pang et al. [31] took a unique approach by designing structural reparameterization blocks (SRB) at the network module level. This innovation is particularly aimed at improving inference accuracy, especially for applications like satellite orbit calculations where precision is critical. Liu et al. [32] addressed the challenges of lightweight object detection with their structurally reparameterized LightNet design. This approach not only improves feature extraction capabilities but also reduces model inference complexity. Furthermore, they employed knowledge distillation techniques to meet the accuracy and real-time requirements of industrial defect detection. These lightweight object detection algorithms are invaluable for a wide range of industrial applications where resource constraints, real-time processing, and the need for efficient yet accurate object detection are key considerations.

2.2. Object Tracking

Target tracking is a vital computer vision task that involves the real-time localization and tracking of specific objects within video sequences. This task comes with the added challenge of not only tracking objects but also preserving their identities over time. Unlike object detection, target tracking focuses on maintaining object identity throughout a sequence of frames. In the realm of multi-object tracking (MOT), the task typically involves two crucial components: a detection module responsible for identifying objects in each frame and a data association module that links the detections across frames. With advancements in object detection technology, “tracking by detection” has become a predominant framework in MOT. Two widely recognized tracking-by-detection frameworks are ByteTrack [33] and DeepSORT [34]. DeepSORT is a robust tracking framework that relies on the Kalman filter for object tracking. It consists of two essential components: SORT [35], which utilizes Kalman filtering for tracking, and a re-identification module that employs deep neural networks to extract unique features. DeepSORT’s straightforward yet effective architecture makes it a standout performer, particularly in high-frame-rate scenarios. On the other hand, ByteTrack excels in the data association part. It introduces a simple and efficient data association method known as BYTE. BYTE leverages the similarity between detection boxes and tracking trajectories. It retains high-scoring detection results while effectively eliminating background noise from low-scoring detections. This approach significantly improves the consistency of object tracking.

In summary, deep-learning-based object detection and tracking algorithms play a pivotal role in the field of computer vision. These algorithms enable the efficient utilization of UAVs for various applications. In our specific context, we combine the improved version of YOLOv7-tiny, which strikes a balance between accuracy and speed, with the straightforward and efficient ByteTrack framework for detecting and tracking individuals in maritime distress in aerial imagery. This combination leverages lightweight visual models, meeting the dual requirements of speed and accuracy in search and rescue operations using UAVs.

3. Materials and Methods

YOLOv7-tiny, like other detectors in the YOLO series, features a structured architecture comprising a backbone, neck, and head. The structure of YOLOv7-tiny is illustrated in Figure 1.

The backbone of YOLOv7-tiny includes multiple CBL modules, ELAN structures, and MP convolution layers. Each CBL module consists of convolution layers, batch normalization layers, and an activation function. YOLOv7-tiny employs LeakyReLU as its activation function, which is an evolution from the ReLU (Rectified Linear Unit) activation function. The ELAN structure enhances the network’s learning capacity by controlling the shortest and longest gradient paths, effectively extracting features. The downsampling structure is designed with MP convolution layers, incorporating both max-pooling and convolution layers to facilitate parallel feature extraction and compression.

The neck of YOLOv7-tiny adheres to the PANet structure from YOLOv5. It involves operations for upsampling and downsampling features of different scales obtained from the backbone. This enables feature fusion, enhancing the model’s ability to capture information across scales. The neck comprises modules such as SPPCSPC and ELAN-Y. The SPPCSP structure connects to the final layer of the backbone, introducing a substantial residual branch to optimize feature extraction, reduce computational load, and expand the receptive field.

The head module of YOLOv7-tiny encompasses detection heads operating at various scales, including large, medium, and small dimensions. These heads serve as the network’s classifier and regressor, managing classification and regression tasks for object detection. Ultimately, this module achieves the dual objectives of object classification and precise localization in the context of object detection.

YOLOv7 encompasses GIoU, DIoU, CIoU, and

\propto

IoU. Notably,

\propto

IoU is well suited for training sets with small samples, but given the abundant number of images in our dataset,

\propto

IoU is not aligned with our circumstances. CIoU outperforms GIoU and DIoU by considering geometric parameters such as overlapping area, center-point distance, and aspect ratio. It effectively addresses issues related to inaccurate convergence and slow convergence speed. Hence, we employ CIoU as the model’s loss function to ensure accurate and rapid convergence.

Building upon this model, we introduce YOLOv7-FSB, tailored for real-time detection and tracking of individuals in maritime distress within aerial imagery captured by UAVs.

3.1. Improvement-Based FSNet

In aerial images captured by drones, the targets are often small, placing higher demands on the backbone network of a model. However, the onboard computational resources are limited, necessitating a lightweight yet powerful feature extraction backbone network. Therefore, we propose an innovative lightweight backbone network named FSNet, specifically designed for efficiently extracting features relevant to individuals in distress in aerial images captured by unmanned aerial vehicles.

FasterNet [36], a crucial component of FSNet, comprises an inverted residual block that includes a partial convolution (PConv) layer, two 1 × 1 convolution layers, batch normalization, and ReLU activation. PConv has proven to be highly effective in extracting spatial features of individuals in maritime distress within aerial imagery. It not only reduces redundant computations but also minimizes memory access. Traditionally, lightweight model design focused on reducing floating-point operations (FLOPs). However, it was observed that simply reducing FLOPs did not significantly improve speed, as memory access often became the bottleneck. The introduction of PConv effectively addresses this issue.

The Simple Attention Module (SimAM) [37], incorporated into FSNet, introduces three-dimensional attention weights that consider both spatial and channel dimensions. Unlike previous attention modules, SimAM excels in capturing both channel and spatial features, offering enhanced flexibility and modularity without the need for additional parameters in the base network.

As illustrated in Figure 2, each FSNetBlock in FSNet cleverly integrates FasterNet with the SimAM attention module. Within an FSNetBlock, a combination of 3 × 3 PConv and 1 × 1 Conv is employed, where the use of 1 × 1 convolutional layers reduces the parameter count, accelerates the training speed, and enhances the model’s nonlinear fitting capability. However, these layers exhibit a limited receptive field, impeding the acquisition of global features. Leveraging the lightweight attention mechanism of the SimAM module addresses this issue, resulting in an improved receptive field for the model. The FSNet lightweight backbone network is ultimately composed of multiple FSNetBlocks. This design significantly enhances the extraction of relevant features for individuals in distress at sea, enabling FSNet to efficiently and rapidly extract these crucial features from aerial images.

FSNet offers several advantages, including reduced parameters, lower computational requirements, and superior feature extraction efficiency. These attributes are particularly relevant for real-time feature extraction in the context of maritime distress detection using UAV-captured aerial imagery.

3.2. Improvement-Based SP-ELAN

When detecting individuals in aerial images, the submerged person is often only partially visible above the water surface, limiting the available features in the image. This poses a greater demand for the feature extraction capabilities of the model. To enhance feature extraction in computer vision applications within the context of aerial image analysis, we drew inspiration from ScConv [38] and PConv, leading to the development of the SP-ELAN module.

ScConv, following a split–transform–merge strategy, extracts features from multiple parallel branches with distinct roles and concatenates these outputs for the final result. It divides the input into two parts, each processed by branches dedicated to extracting diverse types of contextual information. One branch involves the adaptive calibration of input features through convolution filters, facilitating communication between the filters, while the other branch preserves the original spatial context. PConv, on the other hand, judiciously applies regular convolution to a subset of input channels, leaving the rest untouched. This approach optimizes computational resources and enhances the capability for extracting spatial features from aerial imagery.

As depicted in Figure 3, the SP-ELAN module leverages the advantages of ScConv and PConv. Specifically, we strategically replace certain Convs in the original ELAN with PConv, and the fused results undergo adaptive calibration through ScConv to yield the final output. This approach effectively integrates self-calibration with original spatial context information, generating highly discriminative output content while simplifying parameters. Such integration significantly enhances the precision of feature extraction, positioning it as a valuable complement to computer vision applications.

3.3. Improvement-Based BiFPN-S

As a critical component bridging the gap between the backbone and the head, the neck plays a pivotal role in processing and amalgamating features extracted from the main trunk to better suit the requirements of object detection tasks. While the Feature Pyramid Network (FPN) has been a fundamental element in recognizing objects of varying sizes, its traditional top–down structure is limited by the unidirectional flow of information. To address this limitation, the Path Aggregation Network (PAN) introduced a bottom–up aggregation path, which improved accuracy but added complexity in terms of parameters and computational requirements.

In aerial images, the feature information of man-overboard situations holds particular significance. Traditional methods of feature fusion face challenges such as limitations in unidirectional information flow and high search costs. To establish a lightweight feature pyramid that strikes a balance between efficiency and accuracy, this paper introduces an enhanced version of the Bidirectional Feature Pyramid Network (BiFPN) [28], referred to as BiFPN-S in this paper. The original concept of BiFPN aimed to enhance pathways through efficient bidirectional cross-scale connections and weighted feature fusion. However, higher-level feature layers offer limited assistance in detecting small targets in aerial images. To address this issue and achieve the goal of eliminating redundant feature extraction calculations without compromising the model’s ability to extract features related to man-overboard situations, we propose BiFPN-S.

In BiFPN-S, fusion is conducted as follows:

O = \sum_{i} \frac{w_{i}}{ε + \sum_{j} w_{j}} \cdot I_{i}

(1)

Here, a ReLU activation function is added after each set of weights,

w_{i},

to ensure

w_{i} \geq 0, ε = 0.0001

is introduced to avoid numerical instability. Taking the fifth layer as an example:

P_{5}^{t d} = C o n v (\frac{w_{1} \cdot P_{5}^{i n} + w_{2} \cdot Re s i z e (P_{6}^{i n})}{w_{1} + w_{2} + ε})

(2)

P_{5}^{o u t} = C o n v (\frac{w_{1}^{'} \cdot P_{5}^{i n} + w_{2}^{'} \cdot P_{5}^{t d} + w_{3}^{'} \cdot Re s i z e (P_{5}^{o u t})}{w_{1}^{'} + w_{2}^{'} + w_{3}^{'} + ε})

(3)

where

P_{5}^{t d}

represents the intermediate feature of level 5 in the top–down path, and

P_{5}^{o u t}

is the output feature of level 5 in the bottom–up path. In this context,

C o n v (\cdot)

denotes depthwise separable convolution, with batch normalization and activation functions added after each convolution. In our optimized version, we reduce the number of feature extraction layers, specifically eliminating the highest-level feature extraction layer, resulting in a lightweight model, as depicted in Figure 4. BiFPN-S provides an effective solution to enhance feature extraction in drone imagery without introducing unnecessary complexity.

3.4. Tracking Model

Building upon our well-optimized object detection network, the next crucial step involves selecting an appropriate tracking method. In our evaluation, ByteTrack stands out as an exceptional choice, demonstrating outstanding performance and offering a streamlined solution for practical applications.

ByteTrack’s approach places a strong emphasis on low-scoring detection boxes. It efficiently reassigns these low-scoring boxes for matching with previously unassociated tracking trajectories once the high-scoring boxes have been matched. Furthermore, when ByteTrack encounters detection boxes with sufficiently high scores but cannot find a match, it initiates the creation of new tracking trajectories. During data matching, ByteTrack has a low dependency on ReID features for appearance similarity calculations. This is a crucial advantage, particularly in scenarios such as aerial images depicting individuals in water rescue situations, where the features available for identifying distressed individuals are notably limited.

Considering the specific challenges posed by water rescue scenarios in aerial images and the emphasis on efficient tracking, we have chosen ByteTrack as our tracking model. It aligns well with our objectives, and as a tracking-by-detection method, ByteTrack’s tracking effectiveness is highly contingent on the detector’s performance. When the detector performs well, it yields favorable tracking results.

In the future, we will refer to the initials FSNet, SP-ELAN, and BiFPN-S as our method, denoted as ‘YOLOv7-FSB.’ In summary, to achieve efficient and effective detection and tracking of individuals in water rescue scenarios, we leverage the optimized object detection algorithm presented in this paper. This involves the seamless integration of YOLOv7-FSB with ByteTrack to accomplish the tracking task. The collaboration between these two components results in a lightweight and efficient model, perfectly suited for the demanding requirements of water rescue scenarios.

4. Experiments

4.1. Dataset

To validate the enhanced performance of YOLOv7-FSB, we conducted experiments using a meticulously curated dataset that combines selected MOBDrone [39] and SeeDronesSea [40] datasets.

The MOBDrone dataset comprises 49 videos captured with a DJI FC2 camera mounted on a Phantom 6310 Pro V4 drone. These videos portray various scenarios simulating individuals falling into water, encompassing both conscious and unconscious individuals, as well as other objects. This dataset has been post-processed to a resolution of 1080p. Professional annotators manually labeled the bounding boxes for objects falling into five categories: person, boat, surfboard, wood, and lifebuoy. The SeaDronesSee dataset showcases a diverse range of situations, with altitudes spanning from 5 m to 260 m and varying viewing angles from 0° to 90°. Each frame is accompanied by the pertinent altitude, angle, and other metadata. This dataset is captured using multiple cameras, providing a wide range of scenarios. The dataset included annotations covering various categories, such as swimmers, boats, jet skis, life-saving equipment, and buoys.

MOBDrone focuses on individuals in maritime man-overboard situations who are not wearing life jackets, whereas SeaDronesSee encompasses a wide range of scenarios related to an entire rescue process. These datasets have been amalgamated and processed for joint utilization in validating the proposed methodology. By combining the scenarios from MOBDrone and the diverse conditions from SeaDronesSee, we achieve a more comprehensive assessment of the methodology’s performance in maritime search and rescue scenarios.

4.2. Experimental Setup

This research was conducted using the Linux operating system. The configuration included an Intel(R) Xeon(R) Gold 6338 CPU @ 2.00 GHz with a minimum clock frequency of 0.8 GHz, complemented by 512 GB of memory. We harnessed the power of the NVIDIA A100-PCIE-40GB graphics processing unit (GPU) with 40 GB of memory capacity. To leverage GPU acceleration, the system ran on CUDA 11.7, and we primarily employed PyTorch 2.0.1 as the deep learning framework.

The interplay between software and hardware is crucial for model performance, and equally significant in our experiments are the hyperparameter settings. Model performance and speed are significantly impacted by image size. The step size for each parameter update is determined by the learning rate, which should be carefully adjusted according to the specific problem and the model at hand. To facilitate improved model convergence, modifications to the learning rate decay frequency are necessary. The batch size should be adjusted to an optimal value to maximize memory utilization. Increasing the number of workers is required to expedite the data preparation process, but it should be carried out cautiously to prevent memory leaks. Finally, the overall duration of model training should be determined by selecting an appropriate number of epochs based on the task at hand and the complexity of the model. The experimental settings are summarized in Table 1.

4.3. Evaluation Metrics

Confusion matrices are fundamental tools in the assessment of deep learning models, particularly in the context of computer vision. They offer a comprehensive means of evaluating a model’s performance by quantifying correct and incorrect predictions for each class.

Typically, a confusion matrix comprises four quadrants: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). True positives represent instances where the model accurately predicts positive cases. False positives refer to cases where the model incorrectly classifies negatives as positives. True negatives represent accurate predictions of negatives, while false negatives signify incorrect classifications of positives as negatives.

To provide a more comprehensive and intuitive evaluation of target detection and tracking algorithms, advanced evaluation metrics have been developed based on the foundation established by confusion matrices. These metrics significantly enhance our ability to understand and assess the performance of these algorithms.

4.3.1. Detection Evaluation Metrics

The evaluation of object detection algorithms hinges on two critical aspects: accurate object localization and correct object classification. In the context of deep learning for maritime applications involving target detection and tracking, evaluation metrics play a pivotal role in assessing the performance of detection models. IoU is a simple measure, but it focuses solely on overlap areas, disregarding considerations of object size and shape. While accuracy is easy to comprehend, it can be misleading in unbalanced datasets. Precision emphasizes correct positive predictions while ignoring false positives, and recall prioritizes correct positive detections while disregarding false negatives, making both less suitable for unbalanced data. AP calculates single-class accuracy, and mean average precision (mAP) aggregates multiple-class AP.

The primary evaluation metrics based on these criteria are presented in Table 2. In the “Note” column of the table, “↑” indicates that a higher value corresponds to better model performance, while “↓” indicates that a lower value signifies better performance. “Perfect” denotes the theoretical value for optimal performance.

4.3.2. Tracking Evaluation Metrics

Evaluating the performance of object tracking algorithms requires consideration of two fundamental principles. Firstly, it involves a thorough examination of the algorithm’s ability to accurately locate target positions. Secondly, it evaluates the algorithm’s effectiveness in maintaining the individual identities of each target over time. Based on these principles, several advanced evaluation metrics have been developed to provide a comprehensive assessment.

Bernardin and Stiefelhagen introduced CLEAR MOT [41], which serves as a metric for measuring the tracking model’s localization accuracy and association matching capabilities. Lisanti et al. proposed the ID Score [42], focusing on the stability and durability of the tracker’s object tracking.

Table 3 summarizes these evaluation metrics.

4.4. Detection Results Discussion

4.4.1. Activation Function Comparison

We conducted a performance comparison of YOLOv7-tiny using two different activation functions: LeakyReLU and SiLU. SiLU is the default activation function used in YOLOv5 and YOLOv7, and it is recognized as an improved version of ReLU with a smoother behavior near zero. The experimental results, as shown in Table 4, reveal that LeakyReLU outperforms SiLU as an activation function. Based on these results, we have selected LeakyReLU as the activation function for our model.

4.4.2. Comparison of FasterNet and FSNet Performance

To validate the effectiveness of FSNet, we conducted a performance comparison between FSNet and FasterNet as backbone network models. The experimental results, as presented in Table 5, demonstrate that FSNet, being an enhanced version of FasterNet, excels in the detection of individuals in maritime distress in aerial imagery. This superiority is attributed to FSNet’s reduction in redundant computations and memory access while maintaining efficient computational processes and excellent feature extraction capabilities.

4.4.3. Ablation Test Comparison

To analyze the effectiveness of different methods, a series of ablation experiments were designed. Each set of experiments employed the same dataset, training parameters, and methods for training. YOLOv7 served as the baseline for the initial experiments. Subsequent experiments introduced the following enhancements: FSNet was integrated into YOLOv7-A, YOLOv7-B used SP-ELAN to improve accuracy, YOLOv7-C combined with BiFPN-S, YOLOv7-D integrated both FSNet and BiFPN-S, YOLOv7-E combined SP-ELAN and BiFPN-S, YOLOv7-F combined FSNet and SP-ELAN, and YOLOv7-FSB synergistically utilized all three methods.

As depicted in Figure 5 and detailed in Table 6, it is evident that, compared to the original YOLOv7-tiny model, the detection speed of each modular method has not been significantly affected, while detection accuracy has improved. This underscores the effectiveness of the lightweight techniques employed in this study for personnel overboard detection. Bold font indicates the best-performing results, and the same applies throughout.

The analysis of experiments 1–3 demonstrates that each method, when added, enhances both the model’s detection accuracy and speed compared to the baseline YOLOv7-tiny model. Among these methods, FSNet stands out with the highest speed of 98.1 frames per second and an mAP of 87.3%. The introduction of SP-ELAN effectively integrates self-calibration and original spatial context information, making more efficient use of the device’s computational capabilities, resulting in an mAP of 87.9% and a speed of 96.2 frames per second. The inclusion of BiFPN-S strengthens the model’s capability for extracting features of individuals in maritime distress and improves the inference speed, leading to an mAP of 87.5% and a speed of 92 frames per second.
Experiments 4–5 reveal that, in comparison to YOLOv7, the combination of FSNet with SP-ELAN and BiFPN-S individually boosts the mAP by 1.3%, 1.8%, and 1.5%, respectively. The integration of all three modules contributes to improved model performance while maintaining model detection speed, resulting in a more lightweight model without sacrificing accuracy.
Based on the summarized optimization methods, we developed a lightweight search and rescue algorithm tailored for personnel overboard scenarios. This algorithm combines FSNet as the backbone network, integrates SP-ELAN for model enhancement, and incorporates BiFPN-S for feature fusion. The proposed method maintains the same outstanding detection speed as the original YOLOv7-tiny model while enhancing the mAP by 2.3%.

4.4.4. Comparison of Different Object Detection Models

To further analyze the performance of the lightweight algorithm in detecting individuals in maritime distress in aerial imagery, a comparison was conducted on the test dataset between YOLOv7-FSB and other networks, including YOLOv8n, YOLOv7-tiny, YOLOv5s, RetinaNet, SSD, and EfficientDet. Given that two-stage detection models have longer inference times, which may not meet real-time requirements, we opted to compare them with faster single-stage detectors. The recognition results of each network model are presented in Table 7.

When comparing the lightweight model YOLOv7-FSB to other models, the results indicate significant differences. The SSD model experiences a 47.6% decrease in speed, a parameter increase of 17.8 million, and a 22.5% drop in the mAP. The RetinaNet model shows a 28.3% reduction in speed, a parameter increase of 30.5 million, and a 21.1% decrease in the mAP. The EfficientDet model’s speed decreases by 51.1%, with a parameter increase of 14.7 million and a 36.1% drop in the mAP. The YOLOv5s model’s speed drops by 12.9%, with a parameter increase of 1.21 million and a 4.1% reduction in the mAP. The YOLOv7-tiny model experiences a negligible 0.01% reduction in speed, an increase of 0.2 million parameters, and a 2.3% decrease in the mAP.

By integrating FSNet, our model achieves heightened non-linear fitting capabilities, improved receptive fields, and an optimized parameter count and training speed. The incorporation of the SP-ELAN module adeptly allocates computational resources, seamlessly integrating self-calibration and spatial contextual information. Even with streamlined parameters, the model sustains a high degree of discriminative power in its outputs. With support from the BiFPN-S module, the model reinforces pathways through efficient bidirectional cross-scale connections and weighted feature fusion, ensuring a delicate balance between enhanced performance and swift inference speeds. In the context of personnel overboard detection and tracking, our proposed YOLOv7-FSB method surpasses all other algorithms in terms of detection accuracy and speed, making it a practical choice for real-world applications. These results underscore the effectiveness of the lightweight techniques employed in this study, enabling real-time detection tasks based on UAV imagery of individuals in maritime distress.

4.4.5. Results and Visualization

To enhance transparency and facilitate a more intuitive evaluation and comparison of the proposed small-target detection methods, we incorporated the Grad-CAM (Gradient-weighted Class Activation Mapping) technique [43]. This approach visualizes the heat maps corresponding to detected objects, providing a visual representation of the network’s focus areas. Grad-CAM computes the gradients of the target class output based on the final convolutional layer’s feature maps. Subsequently, these gradients are leveraged to perform a weighted summation, resulting in activation maps that emphasize regions of interest. The visualization of these attention regions is crucial for understanding the network’s decision-making process, highlighting areas where the network is most confident about object detection or areas with high activation values.

Figure 6 illustrates Grad-CAM images for YOLOv7-tiny and YOLOv7-FSB across various scenarios, while Figure 7 showcases corresponding Grad-CAM images for different enhancement methods in the same scene. In these images, brighter regions denote specific areas prioritized by the network. The enhanced models exhibit remarkable feature extraction capabilities, particularly in recognizing individuals in distress at sea and mitigating the impact of noise on the model.

As a result, the YOLOv7-FSB model demonstrates outstanding performance in the tasks of locating and rescuing individuals in maritime distress. We have employed this model as the detector in our tracking-by-detection visual framework.

4.5. Tracking Results Discussion

4.5.1. Comparison of Different Tracking Models

To validate the performance of the improved YOLOv7-FSB model when combined with ByteTrack and DeepSORT separately, we conducted experiments using three sets of image sequences. The results are presented in Table 8.

In sequences 1 and 3, ByteTrack outperformed DeepSORT in terms of MOTA by 14.9% and 8.7%, respectively. Throughout the entire tracking process, the number of ID switches when using ByteTrack was significantly lower than when using DeepSORT. Trackers need to run in tandem with the detector, and the number of frames per second achieved using the ByteTrack algorithm was around 82. In sequence 2, ByteTrack’s MOTA was 5.3% higher than DeepSORT’s, with a similar number of ID switches. Typically, real-time monitoring requires processing at least 30 frames per second, ensuring the system can keep up with the flow of data. When combining the improved YOLOv7-FSC model with ByteTrack for validating video sequences, the average processing speed reached approximately 82.7 frames per second. Based on the data in the table, it is evident that ByteTrack, which relies on motion features alone, is more suitable for tracking maritime individuals in distress, offering significant advantages in terms of tracking efficiency and ID switching reduction.

4.5.2. Comparison of Different Detection Models

To assess the impact of the improved YOLOv7-FSB model on tracking results, we conducted experiments by combining YOLOv7-tiny and YOLOv7-FSB separately with ByteTrack. The results are displayed in Table 9.

Across all the image sequences, when using YOLOv7-FSB as the detector for tracking, the number of ID switches was higher compared to when using YOLOv7-tiny as the detector. However, the MOTA value when combining YOLOv7-FSB with ByteTrack was higher than when using YOLOv7-tiny with ByteTrack. Typically, an increase in ID switches leads to a decrease in MOTA because it indicates a less stable tracking process. The situation where both the number of ID switches and the MOTA values increase is due to YOLOv7-FSB having fewer instances of missed detections, meaning it provides better object detection capabilities compared to YOLOv7-tiny. This further validates the effectiveness of the YOLOv7-FSB model proposed in this paper.

In conclusion, the lightweight solution presented in this paper, YOLOv7-FSC, reduces the number of parameters and computations, allowing it to overcome the constraints of UAV computing resources. When detecting individuals in maritime distress in aerial images, this algorithm achieves detection speeds comparable to YOLOv7-tiny while significantly improving detection accuracy. Combining YOLOv7-FSC with ByteTrack results in excellent tracking performance, meeting the practical engineering requirements for UAV-based search and rescue operations. The model proposed in this article can find people who have fallen into water faster and improve the possibility of their rescue, which is of great significance.

4.6. Future Research Directions

In the future, our research will continue to focus on improving detector accuracy by exploring multi-sensor fusion techniques. We aim to integrate data from multiple sources, such as visible light, thermal imaging, near-infrared, etc., to enhance the system’s detection robustness. Additionally, we plan to refine the architecture and expand functionalities to create a more powerful and comprehensive solution.

5. Conclusions

This study presents YOLOv7-FSC, a novel algorithm designed for real-time detection and tracking of individuals in maritime distress at sea when they fall overboard. The algorithm is built upon the YOLOv7-tiny framework with a focus on lightweight design to reduce detector size, maintain recognition speed, and enhance detection accuracy, thereby ensuring real-time recognition of individuals in maritime distress in aerial images. YOLOv7-FSB, the proposed model, employs FSNet as the backbone network, reducing redundant computations and memory access. The SP-ELAN module optimizes device computational capabilities. Additionally, the enhanced feature pyramid structure, BiFPN-S, bolsters its feature extraction capability and inference speed. To validate the effectiveness of YOLOv7-FSC, rigorous testing was conducted using datasets selected from MOBDrone and SeaDronesSee as benchmarks. The testing included ablation experiments and comparative trials. Subsequently, we combined the lightweight YOLOv7-FSC model with ByteTrack as a detection-based tracker, ensuring that the tracking performance meets the real-time and accuracy requirements for detecting and tracking individuals in maritime distress in aerial images. The visual model proposed in this paper can accurately perform real-time detection and tracking tasks, offering a suitable technological solution for large-scale and rapid search and rescue operations for individuals in maritime distress.

Author Contributions

Conceptualization, Y.Z., Y.Y. and Q.T.; formal analysis, Y.Z. and Y.Y.; funding acquisition, Y.Y. and Q.T.; investigation, Y.Z. and Q.T.; methodology, Y.Z., Y.Y. and Q.T.; project administration, Y.Z.; software, Y.Z. and Q.T.; supervision, Y.Y.; validation, Y.Z. and Y.Y.; writing the original draft, Y.Z. and Q.T.; writing—review and editing, Y.Z. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Ship Maneuvering Simulation in Yunnan Inland Navigation, grant number 851333J; the National Key R&D Program of China, grant number 2022YFB4300803; the National Key R&D Program of China, grant number 2022YFB4301402; and the Liaoning Provincial Science and Technology Plan (Key) project, grant number 2022JH1/10800096.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

European Maritime Safety Agency (EMSA). Annual Overview of Marine Casualties and Incidents. 2022. Available online: https://emsa.europa.eu/csn-menu/items.html?cid=14&id=4867 (accessed on 30 November 2022).
Tomic, T.; Schmid, K.; Lutz, P.; Domel, A.; Kassecker, M.; Mair, E.; Grixa, I.; Ruess, F.; Suppa, M.; Burschka, D. Toward a Fully Autonomous UAV: Research Platform for Indoor and Outdoor Urban Search and Rescue. IEEE Robot. Automat. Mag. 2012, 19, 46–56. [Google Scholar] [CrossRef]
Manyam, S.G.; Rasmussen, S.; Casbeer, D.W.; Kalyanam, K.; Manickam, S. Multi-UAV Routing for Persistent Intelligence Surveillance & Reconnaissance Missions. In Proceedings of the 2017 International Conference on Unmanned Aircraft Systems (ICUAS), Miami, FL, USA, 13–16 June 2017; pp. 573–580. [Google Scholar]
Jung, S.; Hwang, S.; Shin, H.; Shim, D.H. Perception, Guidance, and Navigation for Indoor Autonomous Drone Racing Using Deep Learning. IEEE Robot. Autom. Lett. 2018, 3, 2539–2544. [Google Scholar] [CrossRef]
Ammar, A.; Koubaa, A.; Ahmed, M.; Saad, A.; Benjdira, B. Vehicle Detection from Aerial Images Using Deep Learning: A Comparative Study. Electronics 2021, 10, 820. [Google Scholar] [CrossRef]
Yao, H.; Qin, R.; Chen, X. Unmanned Aerial Vehicle for Remote Sensing Applications—A Review. Remote Sens. 2019, 11, 1443. [Google Scholar] [CrossRef]
Ravindran, R.; Santora, M.J.; Jamali, M.M. Multi-Object Detection and Tracking, Based on DNN, for Autonomous Vehicles: A Review. IEEE Sens. J. 2021, 21, 5668–5677. [Google Scholar] [CrossRef]
Yang, T.; Jiang, Z.; Sun, R.; Cheng, N.; Feng, H. Maritime Search and Rescue Based on Group Mobile Computing for Unmanned Aerial Vehicles and Unmanned Surface Vehicles. IEEE Trans. Ind. Inf. 2020, 16, 7700–7708. [Google Scholar] [CrossRef]
Bomantara, Y.A.; Mustafa, H.; Bartholomeus, H.; Kooistra, L. Detection of Artificial Seed-like Objects from UAV Imagery. Remote Sens. 2023, 15, 1637. [Google Scholar] [CrossRef]
Zhao, X.; Xia, Y.; Zhang, W.; Zheng, C.; Zhang, Z. YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection. Remote Sens. 2023, 15, 3778. [Google Scholar] [CrossRef]
Wang, Y.; Zou, H.; Yin, M.; Zhang, X. SMFF-YOLO: A Scale-Adaptive YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes. Remote Sens. 2023, 15, 4580. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors 2022. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Viola, P.; Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. I-511–I-518. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Felzenszwalb, P.; McAllester, D.; Ramanan, D. A Discriminatively Trained, Multiscale, Deformable Part Model. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition 2015. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement 2018. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications 2022. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection 2020. arXiv 2019, arXiv:1911.09070. [Google Scholar]
Huang, S.-C.; Jaw, D.-W.; Hoang, Q.-V.; Le, T.-H. 3FL-Net: An Efficient Approach for Improving Performance of Lightweight Detectors in Rainy Weather Conditions. IEEE Trans. Intell. Transport. Syst. 2023, 24, 4293–4305. [Google Scholar] [CrossRef]
Wu, C.; Ye, M.; Zhang, J.; Ma, Y. YOLO-LWNet: A Lightweight Road Damage Object Detection Network for Mobile Terminal Devices. Sensors 2023, 23, 3268. [Google Scholar] [CrossRef]
Pang, Y.; Zhang, Y.; Kong, Q.; Wang, Y.; Chen, B.; Cao, X. SOCDet: A Lightweight and Accurate Oriented Object Detection Network for Satellite On-Orbit Computing. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608115. [Google Scholar] [CrossRef]
Liu, J.; Li, H.; Zuo, F.; Zhao, Z.; Lu, S. KD-LightNet: A Lightweight Network Based on Knowledge Distillation for Industrial Defect Detection. IEEE Trans. Instrum. Meas. 2023, 72, 3525713. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2022; Volume 13682, pp. 1–21. ISBN 978-3-031-20046-5. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. arXiv 2021, arXiv:2110.06534. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 6153–6162. [Google Scholar]
Cafarelli, D.; Ciampi, L.; Vadicamo, L.; Gennaro, C.; Berton, A.; Paterni, M.; Benvenuti, C.; Passera, M.; Falchi, F. MOBDrone: A Drone Video Dataset for Man OverBoard Rescue. In Image Analysis and Processing—ICIAP 2022; Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2022; Volume 13232, pp. 633–644. ISBN 978-3-031-06429-6. [Google Scholar]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3686–3696. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Computer Vision—ECCV 2016 Workshops; Hua, G., Jégou, H., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9914, pp. 17–35. ISBN 978-3-319-48880-6. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]

Figure 1. The general framework of YOLOv7-tiny.

Figure 2. The left side is the FSNetBlock structure, and the right side is the structure and details of FSNet.

Figure 3. The structure of SP-ELAN.

Figure 4. The details of BiFPN-S.

Figure 5. The average precision across various experiments.

Figure 6. Grad-CAM diagrams of YOLOv7-tiny and YOLOv7-FSB.

Figure 7. Grad-CAM diagram: (a) YOLOv7_A; (b) YOLOv7_B; (c) YOLOv7_C; (d) YOLOv7_D; (e) YOLOv7_E; (f) YOLOv7_F; (g) YOLOv7-tiny; (h) YOLOv7-FSB.

Table 1. Configuration.

Configuration	Name	Type
Hardware	CPU	Intel(R) Xeon(R) Gold 6338 CPU @ 2.00 GHz
	GPU	NVIDIA A100-PCIE-40GB
	Memory	512 GB
Software	CUDA	11.7
Software	PyTorch	2.0.1.
Hyperparameters	Image Size	1280 × 1280
	Learning Rate	0.01
	Learning Rate Decay Frequency	0.1
	Batch Size	64
	Workers	32
	Maximum Training Epochs	100

Table 2. Object detection evaluation metric.

Metric	Description	Formula	Note
IoU	The degree of overlap between ground truth, $B_{G T},$ and predicted value, $B_{p}$ .	$\frac{a r e a (B_{p} \cap B_{G T})}{a r e a (B_{p} \cup B_{G T})}$	$↑$
Precision	The proportion of true positive samples among the predicted positive samples.	$\frac{T P}{T P + F P}$	$↑$
Recall	The proportion of detected regions in true positive samples	$\frac{T P}{T P + F N}$	$↑$
AP	The area enclosed by the precision, P, and recall curve, R, with the coordinate axes, where $r_{1}, r_{2}, \dots, r_{n}$ represents the recall rate corresponding to different precision interpolations.	$\sum_{i = 1}^{n - 1} (r_{i + 1} - r_{i}) p_{interp} (r_{i + 1})$	$↑$
mAP	The mean average precision is the average of the average precisions for each class, where $K$ represents the total number of target classes.	$\frac{\sum_{i = 1}^{K} A P_{i}}{K}$	$↑$

Table 3. Object tracking evaluation metric.

Metric	Description	Formula	Note
ID_SW	The number of times the identity of the target changes.	/	$↓$
MOTA	The ability of a tracking algorithm to detect targets and maintain their trajectories is measured. Here, $t$ represents the video frame corresponding to the $t$ moment in the sequence.	$1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D_S W_{t})}{\sum_{t} G T_{t}}$	$↑$

Table 4. Results of activation function comparison test.

Detector	Precision (%)	Recall (%)	mAP (%)
YOLOv7_SiLU	88.2	83.4	86.4
YOLOv7_ReLU	88.8	85.0	86.9

Table 5. Results of the backbone comparison test.

Detector	Precision (%)	Recall (%)	mAP (%)
YOLOv7_FasterNet	88.9	84.4	87.1
YOLOv7_FSNet	89.8	85.1	87.3

Table 6. Results of ablation experiments with different methods.

Detector	FSNet	SP-ELAN	BiFPN-S	mAP (%)	Parameters/M	FPS
YOLOv7-tiny				86.9	6.02	95.3
YOLOv7_A	✓			87.3	5.64	98.1
YOLOv7_B		✓		87.9	6.18	96.2
YOLOv7_C			✓	87.5	6.02	92
YOLOv7_D	✓		✓	88.2	5.66	94.5
YOLOv7_E		✓	✓	88.7	6.19	93.1
YOLOv7_F	✓	✓		88.4	5.81	97.4
YOLOv7-FSB	✓	✓	✓	89.2	5.82	96.5

Table 7. Comparison of detection performance for different methods.

Detector	mAP (%)	Parameters/M	FPS
SSD	66.7	23.6	50.6
RetinaNet	68.1	36.3	69.2
EfficientDet	53.1	20.5	47.2
YOLOv5s	85.1	7.03	84
YOLOv7-tiny	86.9	6.02	95.3
YOLOv8n	85.3	3.1	96.1
YOLOv7-FSB	89.2	5.82	96.5

Table 8. Comparison of tracking performance for different tracking methods.

Video Sequence	Tracker	MOTA (%)	ID Switch	FPS
Sequence 1	ByteTrack	83.4%	26	82
Sequence 1	DeepSORT	68.5%	68	23.6
Sequence 2	ByteTrack	87.6%	18	83.4
Sequence 2	DeepSORT	82.3%	26	24.5
Sequence 3	ByteTrack	85.4%	34	82.7
Sequence 3	DeepSORT	76.8%	84	21.7

Table 9. Comparison of tracking performance for different detection methods.

Video Sequence	Tracker	MOTA (%)	ID Switch	FPS
Sequence 1	YOLOv7-tiny	76.3%	21	80.5
Sequence 1	YOLOv7-FSB	83.4%	26	82
Sequence 2	YOLOv7-tiny	81.1%	14	81.9
Sequence 2	YOLOv7-FSB	87.6%	18	83.4
Sequence 3	YOLOv7-tiny	79.3%	28	82.1
Sequence 3	YOLOv7-FSB	85.4%	34	82.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Tao, Q.; Yin, Y. A Lightweight Man-Overboard Detection and Tracking Model Using Aerial Images for Maritime Search and Rescue. Remote Sens. 2024, 16, 165. https://0-doi-org.brum.beds.ac.uk/10.3390/rs16010165

AMA Style

Zhang Y, Tao Q, Yin Y. A Lightweight Man-Overboard Detection and Tracking Model Using Aerial Images for Maritime Search and Rescue. Remote Sensing. 2024; 16(1):165. https://0-doi-org.brum.beds.ac.uk/10.3390/rs16010165

Chicago/Turabian Style

Zhang, Yijian, Qianyi Tao, and Yong Yin. 2024. "A Lightweight Man-Overboard Detection and Tracking Model Using Aerial Images for Maritime Search and Rescue" Remote Sensing 16, no. 1: 165. https://0-doi-org.brum.beds.ac.uk/10.3390/rs16010165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Man-Overboard Detection and Tracking Model Using Aerial Images for Maritime Search and Rescue

Abstract

1. Introduction

2. Relate Works

2.1. Object Detection

2.1.1. Two-Stage Detectors

2.1.2. One-Stage Detectors

2.1.3. Lightweight Detectors

2.2. Object Tracking

3. Materials and Methods

3.1. Improvement-Based FSNet

3.2. Improvement-Based SP-ELAN

3.3. Improvement-Based BiFPN-S

3.4. Tracking Model

4. Experiments

4.1. Dataset

4.2. Experimental Setup

4.3. Evaluation Metrics

4.3.1. Detection Evaluation Metrics

4.3.2. Tracking Evaluation Metrics

4.4. Detection Results Discussion

4.4.1. Activation Function Comparison

4.4.2. Comparison of FasterNet and FSNet Performance

4.4.3. Ablation Test Comparison

4.4.4. Comparison of Different Object Detection Models

4.4.5. Results and Visualization

4.5. Tracking Results Discussion

4.5.1. Comparison of Different Tracking Models

4.5.2. Comparison of Different Detection Models

4.6. Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI