Deep Learning for Computer Vision and Pattern Recognition

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: closed (31 May 2021) | Viewed by 45559

Special Issue Editor

School of Electrical and Computer Engineering, National Technical University of Athens, 9, Iroon Polytechniou st., 157 80 Athens, Greece
Interests: machine learning; image & signal processing; computer vision; artificial intelligence; multimedia analysis
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Deep learning is a rich family of methods, encompassing neural networks, hierarchical probabilistic models, and a variety of unsupervised and supervised feature learning algorithms. The recent surge of interest in deep learning methods is due to the fact that they have been shown to outperform previous state-of-the-art techniques in several tasks, in addition to the abundance of complex data from different sources. A variety of models and techniques have been proposed in recent years based on convolutional neural networks (CNNs), the “Boltzmann family” including deep belief networks (DBNs) and deep Boltzmann machines (DBMs), stacked denoising autoencoders, deep recurrent neural networks (long short-term memory, gated recurrent units, etc.), generative adversarial networks, and other deep models. Deep learning has fueled great strides in a variety of computer vision problems, such as object detection, motion tracking, action and activity recognition, human pose estimation, face recognition, multimedia annotation, and semantic segmentation.

The purpose of this Special Issue is to present recent advances in deep learning for computer vision and pattern recognition, providing a forum to present new academic research and industrial development. The Special Issue solicits original research papers in the field, covering new theories, algorithms, and systems, as well as new implementations and applications incorporating state-of-the-art deep learning techniques for computer vision and pattern recognition techniques. Review articles and works on performance evaluation and benchmark datasets are also welcome.

Dr. Athanasios Voulodimos
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • computer vision
  • visual understanding
  • object detection
  • tracking
  • action recognition
  • pose estimation
  • semantic segmentation
  • convolutional neural networks

Published Papers (14 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 5424 KiB  
Article
TMD-BERT: A Transformer-Based Model for Transportation Mode Detection
by Ifigenia Drosouli, Athanasios Voulodimos, Paris Mastorocostas, Georgios Miaoulis and Djamchid Ghazanfarpour
Electronics 2023, 12(3), 581; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12030581 - 24 Jan 2023
Cited by 1 | Viewed by 2009
Abstract
Aiming to differentiate various transportation modes and detect the means of transport an individual uses, is the focal point of transportation mode detection, one of the problems in the field of intelligent transport which receives the attention of researchers because of its interesting [...] Read more.
Aiming to differentiate various transportation modes and detect the means of transport an individual uses, is the focal point of transportation mode detection, one of the problems in the field of intelligent transport which receives the attention of researchers because of its interesting and useful applications. In this paper, we present TMD-BERT, a transformer-based model for transportation mode detection based on sensor data. The proposed transformer-based approach processes the entire sequence of data, understand the importance of each part of the input sequence and assigns weights accordingly, using attention mechanisms, to learn global dependencies in the sequence. The experimental evaluation shows the high performance of the model compared to the state of the art, demonstrating a prediction accuracy of 98.8%. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

16 pages, 3681 KiB  
Article
Towards Hybrid Multimodal Manual and Non-Manual Arabic Sign Language Recognition: mArSL Database and Pilot Study
by Hamzah Luqman and El-Sayed M. El-Alfy
Electronics 2021, 10(14), 1739; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10141739 - 20 Jul 2021
Cited by 18 | Viewed by 2858
Abstract
Sign languages are the main visual communication medium between hard-hearing people and their societies. Similar to spoken languages, they are not universal and vary from region to region, but they are relatively under-resourced. Arabic sign language (ArSL) is one of these languages that [...] Read more.
Sign languages are the main visual communication medium between hard-hearing people and their societies. Similar to spoken languages, they are not universal and vary from region to region, but they are relatively under-resourced. Arabic sign language (ArSL) is one of these languages that has attracted increasing attention in the research community. However, most of the existing and available works on sign language recognition systems focus on manual gestures, ignoring other non-manual information needed for other language signals such as facial expressions. One of the main challenges of not considering these modalities is the lack of suitable datasets. In this paper, we propose a new multi-modality ArSL dataset that integrates various types of modalities. It consists of 6748 video samples of fifty signs performed by four signers and collected using Kinect V2 sensors. This dataset will be freely available for researchers to develop and benchmark their techniques for further advancement of the field. In addition, we evaluated the fusion of spatial and temporal features of different modalities, manual and non-manual, for sign language recognition using the state-of-the-art deep learning techniques. This fusion boosted the accuracy of the recognition system at the signer-independent mode by 3.6% compared with manual gestures. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

14 pages, 2812 KiB  
Article
Object Identification and Localization Using Grad-CAM++ with Mask Regional Convolution Neural Network
by Xavier Alphonse Inbaraj, Charlyn Villavicencio, Julio Jerison Macrohon, Jyh-Horng Jeng and Jer-Guang Hsieh
Electronics 2021, 10(13), 1541; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10131541 - 25 Jun 2021
Cited by 12 | Viewed by 6501
Abstract
One of the fundamental advancements in the deployment of object detectors in real-time applications is to improve object recognition against obstruction, obscurity, and noises in images. In addition, object detection is a challenging task since it needs the correct detection of objects from [...] Read more.
One of the fundamental advancements in the deployment of object detectors in real-time applications is to improve object recognition against obstruction, obscurity, and noises in images. In addition, object detection is a challenging task since it needs the correct detection of objects from images. Semantic segmentation and localization are an important module to recognizing an object in an image. The object localization method (Grad-CAM++) is mostly used by researchers for object localization, which uses the gradient with a convolution layer to build a localization map for important regions on the image. This paper proposes a method called Combined Grad-CAM++ with the Mask Regional Convolution Neural Network (GC-MRCNN) in order to detect objects in the image and also localization. The major advantage of proposed method is that they outperform all the counterpart methods in the domain and can also be used in unsupervised environments. The proposed detector based on GC-MRCNN provides a robust and feasible ability in detecting and classifying objects exist and their shapes in real time. It is found that the proposed method is able to perform highly effectively and efficiently in a wide range of images and provides higher resolution visual representation than existing methods (Grad-CAM, Grad-CAM++), which was proven by comparing various algorithms. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

11 pages, 6899 KiB  
Article
FASSD-Net Model for Person Semantic Segmentation
by Luis Brandon Garcia-Ortiz, Jose Portillo-Portillo, Aldo Hernandez-Suarez, Jesus Olivares-Mercado, Gabriel Sanchez-Perez, Karina Toscano-Medina, Hector Perez-Meana and Gibran Benitez-Garcia
Electronics 2021, 10(12), 1393; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10121393 - 10 Jun 2021
Cited by 2 | Viewed by 2022
Abstract
This paper proposes the use of the FASSD-Net model for semantic segmentation of human silhouettes, these silhouettes can later be used in various applications that require specific characteristics of human interaction observed in video sequences for the understanding of human activities or for [...] Read more.
This paper proposes the use of the FASSD-Net model for semantic segmentation of human silhouettes, these silhouettes can later be used in various applications that require specific characteristics of human interaction observed in video sequences for the understanding of human activities or for human identification. These applications are classified as high-level task semantic understanding. Since semantic segmentation is presented as one solution for human silhouette extraction, it is concluded that convolutional neural networks (CNN) have a clear advantage over traditional methods for computer vision, based on their ability to learn the representations of appropriate characteristics for the task of segmentation. In this work, the FASSD-Net model is used as a novel proposal that promises real-time segmentation in high-resolution images exceeding 20 FPS. To evaluate the proposed scheme, we use the Cityscapes database, which consists of sundry scenarios that represent human interaction with its environment (these scenarios show the semantic segmentation of people, difficult to solve, that favors the evaluation of our proposal), To adapt the FASSD-Net model to human silhouette semantic segmentation, the indexes of the 19 classes traditionally proposed for Cityscapes were modified, leaving only two labels: One for the class of interest labeled as person and one for the background. The Cityscapes database includes the category “human” composed for “rider” and “person” classes, in which the rider class contains incomplete human silhouettes due to self-occlusions for the activity or transport used. For this reason, we only train the model using the person class rather than human category. The implementation of the FASSD-Net model with only two classes shows promising results in both a qualitative and quantitative manner for the segmentation of human silhouettes. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

18 pages, 4958 KiB  
Article
Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM
by Miguel Rivera-Acosta, Juan Manuel Ruiz-Varela, Susana Ortega-Cisneros, Jorge Rivera, Ramón Parra-Michel and Pedro Mejia-Alvarez
Electronics 2021, 10(9), 1035; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10091035 - 27 Apr 2021
Cited by 13 | Viewed by 3588
Abstract
In this paper, we present a novel approach that aims to solve one of the main challenges in hand gesture recognition tasks in static images, to compensate for the accuracy lost when trained models are used to interpret completely unseen data. The model [...] Read more.
In this paper, we present a novel approach that aims to solve one of the main challenges in hand gesture recognition tasks in static images, to compensate for the accuracy lost when trained models are used to interpret completely unseen data. The model presented here consists of two main data-processing stages. A deep neural network (DNN) for performing handshape segmentation and classification is used in which multiple architectures and input image sizes were tested and compared to derive the best model in terms of accuracy and processing time. For the experiments presented in this work, the DNN models were trained with 24,000 images of 24 signs from the American Sign Language alphabet and fine-tuned with 5200 images of 26 generated signs. The system was real-time tested with a community of 10 persons, yielding a mean average precision and processing rate of 81.74% and 61.35 frames-per-second, respectively. As a second data-processing stage, a bidirectional long short-term memory neural network was implemented and analyzed for adding spelling correction capability to our system, which scored a training accuracy of 98.07% with a dictionary of 370 words, thus, increasing the robustness in completely unseen data, as shown in our experiments. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

22 pages, 3023 KiB  
Article
Multimodal Low Resolution Face and Frontal Gait Recognition from Surveillance Video
by Sayan Maity, Mohamed Abdel-Mottaleb and Shihab S. Asfour
Electronics 2021, 10(9), 1013; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10091013 - 24 Apr 2021
Cited by 8 | Viewed by 3034
Abstract
Biometric identification using surveillance video has attracted the attention of many researchers as it can be applicable not only for robust identification but also personalized activity monitoring. In this paper, we present a novel multimodal recognition system that extracts frontal gait and low-resolution [...] Read more.
Biometric identification using surveillance video has attracted the attention of many researchers as it can be applicable not only for robust identification but also personalized activity monitoring. In this paper, we present a novel multimodal recognition system that extracts frontal gait and low-resolution face images from frontal walking surveillance video clips to perform efficient biometric recognition. The proposed study addresses two important issues in surveillance video that did not receive appropriate attention in the past. First, it consolidates the model-free and model-based gait feature extraction approaches to perform robust gait recognition only using the frontal view. Second, it uses a low-resolution face recognition approach which can be trained and tested using low-resolution face information. This eliminates the need for obtaining high-resolution face images to create the gallery, which is required in the majority of low-resolution face recognition techniques. Moreover, the classification accuracy on high-resolution face images is considerably higher. Previous studies on frontal gait recognition incorporate assumptions to approximate the average gait cycle. However, we quantify the gait cycle precisely for each subject using only the frontal gait information. The approaches available in the literature use the high resolution images obtained in a controlled environment to train the recognition system. However, in our proposed system we train the recognition algorithm using the low-resolution face images captured in the unconstrained environment. The proposed system has two components, one is responsible for performing frontal gait recognition and one is responsible for low-resolution face recognition. Later, score level fusion is performed to fuse the results of the frontal gait recognition and the low-resolution face recognition. Experiments conducted on the Face and Ocular Challenge Series (FOCS) dataset resulted in a 93.5% Rank-1 for frontal gait recognition and 82.92% Rank-1 for low-resolution face recognition, respectively. The score level multimodal fusion resulted in 95.9% Rank-1 recognition, which demonstrates the superiority and robustness of the proposed approach. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

15 pages, 3640 KiB  
Article
Scale-Adaptive KCF Mixed with Deep Feature for Pedestrian Tracking
by Yang Zhou, Wenzhu Yang and Yuan Shen
Electronics 2021, 10(5), 536; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10050536 - 25 Feb 2021
Cited by 9 | Viewed by 1942
Abstract
Pedestrian tracking is an important research content in the field of computer vision. Tracking is achieved by predicting the position of a specific pedestrian in each frame of a video. Pedestrian tracking methods include neural network-based methods and traditional template matching-based methods, such [...] Read more.
Pedestrian tracking is an important research content in the field of computer vision. Tracking is achieved by predicting the position of a specific pedestrian in each frame of a video. Pedestrian tracking methods include neural network-based methods and traditional template matching-based methods, such as the SiamRPN (Siamese region proposal network), the DASiamRPN (distractor-aware SiamRPN), and the KCF (kernel correlation filter). The KCF algorithm has no scale-adaptive capability and cannot effectively solve the occlusion problem, and because of many defects of the HOG (histogram of oriented gradient) feature that the KCF uses, the tracking target is easy to lose. For those defects of the KCF algorithm, an improved KCF model, the SKCFMDF (scale-adaptive KCF mixed with deep feature) algorithm was designed. By introducing deep features extracted by a newly designed neural network and by introducing the YOLOv3 (you only look once version 3) object detection algorithm, which was also improved for more accurate detection, the model was able to achieve scale adaptation and to effectively solve the problem of occlusion and defects of the HOG feature. Compared with the original KCF, the success rate of pedestrian tracking under complex conditions was increased by 36%. Compared with the mainstream SiamRPN and DASiamRPN models, it was still able to achieve a small improvement. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

15 pages, 26018 KiB  
Article
Colorization of Logo Sketch Based on Conditional Generative Adversarial Networks
by Nannan Tian, Yuan Liu, Bo Wu and Xiaofeng Li
Electronics 2021, 10(4), 497; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10040497 - 20 Feb 2021
Cited by 5 | Viewed by 2577
Abstract
Logo design is a complex process for designers and color plays a very important role in logo design. The automatic colorization of logo sketch is of great value and full of challenges. In this paper, we propose a new logo design method based [...] Read more.
Logo design is a complex process for designers and color plays a very important role in logo design. The automatic colorization of logo sketch is of great value and full of challenges. In this paper, we propose a new logo design method based on Conditional Generative Adversarial Networks, which can output multiple colorful logos only by providing one logo sketch. We improve the traditional U-Net structure, adding channel attention and spatial attention in the process of skip-connection. In addition, the generator consists of parallel attention-based U-Net blocks, which can output multiple logo images. During the model optimization process, a style loss function is proposed to improve the color diversity of the logos. We evaluate our method on the self-built edges2logos dataset and the public edges2shoes dataset. Experimental results show that our method can generate more colorful and realistic logo images based on simple sketches. Compared to the classic networks, the logos generated by our network are also superior in visual effects. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

9 pages, 2192 KiB  
Article
Efficient Depth Map Creation with a Lightweight Deep Neural Network
by Join Kang and Seong-Won Lee
Electronics 2021, 10(4), 479; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10040479 - 18 Feb 2021
Viewed by 2624
Abstract
Finding depth information with stereo matching using a deep learning algorithm for embedded systems has recently gained significant attention owing to emerging high-performance mobile graphics processing units (GPUs). Several researchers have proposed feasible small-scale CNNs that can run on a local GPU, but [...] Read more.
Finding depth information with stereo matching using a deep learning algorithm for embedded systems has recently gained significant attention owing to emerging high-performance mobile graphics processing units (GPUs). Several researchers have proposed feasible small-scale CNNs that can run on a local GPU, but they still suffer from low accuracy and/or high computational requirements. In the method proposed in this study, pooling layers with padding and an asymmetric convolution filter are used to reduce computational costs and simultaneously maintain the accuracy of disparity. The patch size and number of layers are adjusted by analyzing the feature and activation maps. The proposed method forms a small-scale network algorithm suitable for a vision system at the edge and still exhibits high-disparity accuracy and low computational loads as compared to existing stereo-matching networks. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

15 pages, 3342 KiB  
Article
Bi-Directional Pyramid Network for Edge Detection
by Kai Li, Yingjie Tian, Bo Wang, Zhiquan Qi and Qi Wang
Electronics 2021, 10(3), 329; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10030329 - 01 Feb 2021
Cited by 15 | Viewed by 2237
Abstract
Multi-scale representation plays a critical role in the field of edge detection. However, most of the existing research focuses on one of two aspects: fast training and accurate testing. In this paper, we propose a novel multi-scale method to resolve the balance between [...] Read more.
Multi-scale representation plays a critical role in the field of edge detection. However, most of the existing research focuses on one of two aspects: fast training and accurate testing. In this paper, we propose a novel multi-scale method to resolve the balance between them. Specifically, according to multi-stream structures and the image pyramid principle, we construct a down-sampling pyramid network and a lightweight up-sampling pyramid network to enrich the multi-scale representation from the encoder and decoder, respectively. Next, these two pyramid networks and a backbone network constitute our overall architecture, a bi-directional pyramid network (BDP-Net). Extensive experiments show that compared with the state-of-the-art model, our method could improve the training speed by about one time while retaining a similar test accuracy. Especially, under the single-scale test, our approach also reaches human perception (F1 score of 0.803) on the BSDS500 database. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

19 pages, 2056 KiB  
Article
Scene Recognition Based on Recurrent Memorized Attention Network
by Xi Shao, Xuan Zhang, Guijin Tang and Bingkun Bao
Electronics 2020, 9(12), 2038; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics9122038 - 01 Dec 2020
Cited by 5 | Viewed by 1582
Abstract
We propose a new end-to-end scene recognition framework, called a Recurrent Memorized Attention Network (RMAN) model, which performs object-based scene classification by recurrently locating and memorizing objects in the image. Based on the proposed framework, we introduce a multi-task mechanism that contiguously attends [...] Read more.
We propose a new end-to-end scene recognition framework, called a Recurrent Memorized Attention Network (RMAN) model, which performs object-based scene classification by recurrently locating and memorizing objects in the image. Based on the proposed framework, we introduce a multi-task mechanism that contiguously attends on the different essential objects in a scene image and recurrently performs memory fusion of the features of object focused by an attention model to improve the scene recognition accuracy. The experimental results show that the RMAN model has achieved better classification performance on the constructed dataset and two public scene datasets, surpassing state-of-the-art image scene recognition approaches. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

20 pages, 2115 KiB  
Article
Ship Classification Based on Attention Mechanism and Multi-Scale Convolutional Neural Network for Visible and Infrared Images
by Yongmei Ren, Jie Yang, Zhiqiang Guo, Qingnian Zhang and Hui Cao
Electronics 2020, 9(12), 2022; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics9122022 - 30 Nov 2020
Cited by 12 | Viewed by 2757
Abstract
Visible image quality is very susceptible to changes in illumination, and there are limitations in ship classification using images acquired by a single sensor. This study proposes a ship classification method based on an attention mechanism and multi-scale convolutional neural network (MSCNN) for [...] Read more.
Visible image quality is very susceptible to changes in illumination, and there are limitations in ship classification using images acquired by a single sensor. This study proposes a ship classification method based on an attention mechanism and multi-scale convolutional neural network (MSCNN) for visible and infrared images. First, the features of visible and infrared images are extracted by a two-stream symmetric multi-scale convolutional neural network module, and then concatenated to make full use of the complementary features present in multi-modal images. After that, the attention mechanism is applied to the concatenated fusion features to emphasize local details areas in the feature map, aiming to further improve feature representation capability of the model. Lastly, attention weights and the original concatenated fusion features are added element by element and fed into fully connected layers and Softmax output layer for final classification output. Effectiveness of the proposed method is verified on a visible and infrared spectra (VAIS) dataset, which shows 93.81% accuracy in classification results. Compared with other state-of-the-art methods, the proposed method could extract features more effectively and has better overall classification performance. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

13 pages, 2926 KiB  
Article
Salient Object Detection Combining a Self-Attention Module and a Feature Pyramid Network
by Guangyu Ren, Tianhong Dai, Panagiotis Barmpoutis and Tania Stathaki
Electronics 2020, 9(10), 1702; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics9101702 - 16 Oct 2020
Cited by 14 | Viewed by 2644
Abstract
Salient object detection has achieved great improvements by using the Fully Convolutional Networks (FCNs). However, the FCN-based U-shape architecture may cause dilution problems in the high-level semantic information during the up-sample operations in the top-down pathway. Thus, it can weaken the ability of [...] Read more.
Salient object detection has achieved great improvements by using the Fully Convolutional Networks (FCNs). However, the FCN-based U-shape architecture may cause dilution problems in the high-level semantic information during the up-sample operations in the top-down pathway. Thus, it can weaken the ability of salient object localization and produce degraded boundaries. To this end, in order to overcome this limitation, we propose a novel pyramid self-attention module (PSAM) and the adoption of an independent feature-complementing strategy. In PSAM, self-attention layers are equipped after multi-scale pyramid features to capture richer high-level features and bring larger receptive fields to the model. In addition, a channel-wise attention module is also employed to reduce the redundant features of the FPN and provide refined results. Experimental analysis demonstrates that the proposed PSAM effectively contributes to the whole model so that it outperforms state-of-the-art results over five challenging datasets. Finally, quantitative results show that PSAM generates accurate predictions and integral salient maps, which can provide further help to other computer vision tasks, such as object detection and semantic segmentation. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

14 pages, 2428 KiB  
Article
GC-YOLOv3: You Only Look Once with Global Context Block
by Yang Yang and Hongmin Deng
Electronics 2020, 9(8), 1235; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics9081235 - 31 Jul 2020
Cited by 33 | Viewed by 7786
Abstract
In order to make the classification and regression of single-stage detectors more accurate, an object detection algorithm named Global Context You-Only-Look-Once v3 (GC-YOLOv3) is proposed based on the You-Only-Look-Once (YOLO) in this paper. Firstly, a better cascading model with learnable semantic fusion between [...] Read more.
In order to make the classification and regression of single-stage detectors more accurate, an object detection algorithm named Global Context You-Only-Look-Once v3 (GC-YOLOv3) is proposed based on the You-Only-Look-Once (YOLO) in this paper. Firstly, a better cascading model with learnable semantic fusion between a feature extraction network and a feature pyramid network is designed to improve detection accuracy using a global context block. Secondly, the information to be retained is screened by combining three different scaling feature maps together. Finally, a global self-attention mechanism is used to highlight the useful information of feature maps while suppressing irrelevant information. Experiments show that our GC-YOLOv3 reaches a maximum of 55.5 object detection mean Average Precision (mAP)@0.5 on Common Objects in Context (COCO) 2017 test-dev and that the mAP is 5.1% higher than that of the YOLOv3 algorithm on Pascal Visual Object Classes (PASCAL VOC) 2007 test set. Therefore, experiments indicate that the proposed GC-YOLOv3 model exhibits optimal performance on the PASCAL VOC and COCO datasets. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision and Pattern Recognition)
Show Figures

Figure 1

Back to TopTop