Computer Vision and Pattern Recognition Based on Deep Learning

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 August 2023) | Viewed by 68000

Special Issue Editor

School of Control Science and Engineering, Shandong University, Jinan 250061, China
Interests: intelligent coding and processing of immersive media (point cloud, light field); computer vision; dash-based video (especially 360° panoramic video) streaming media transmission control

Special Issue Information

Dear Colleagues,

This Special Issue is devoted to the computer vision and pattern recognition methods based on deep learning (CVPR-DL). 

In recent years, with the benefits gained from deep learning, great achievements on computer vision and pattern recognition tasks have been made. Nowadays, the solutions of computer vision and pattern recognitions tasks, such as intelligent communication control and management, human face and figureprint recognition, autonomous driving, multimedia communication and smart phones, are becoming integral to people’s daily life and more and more popular. It is our opinion, the participation of leading researchers is critically important to guide users and technical researchers towards an intelligent society.

Therefore, in this Special Issue, we invite submissions exploring deep learning research and recent advances in the fields of computer vision and pattern recognition. Both theoretical and experimental studies are welcome, as well as comprehensive review and survey papers.

Prof. Dr. Hui Yuan
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • object detection
  • segmentation
  • saliency analysis
  • image processing
  • computer vision
  • audio processing
  • image coding
  • video compression
  • point cloud analysis
  • light field processing
  • etc.

Published Papers (33 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review, Other

12 pages, 8023 KiB  
Article
GP-Net: Image Manipulation Detection and Localization via Long-Range Modeling and Transformers
by Jin Peng, Chengming Liu, Haibo Pang, Xiaomeng Gao, Guozhen Cheng and Bing Hao
Appl. Sci. 2023, 13(21), 12053; https://0-doi-org.brum.beds.ac.uk/10.3390/app132112053 - 05 Nov 2023
Viewed by 953
Abstract
With the rise of image manipulation techniques, an increasing number of individuals find it easy to manipulate image content. Undoubtedly, this presents a significant challenge to the integrity of multimedia data, thereby fueling the advancement of image forgery detection research. A majority of [...] Read more.
With the rise of image manipulation techniques, an increasing number of individuals find it easy to manipulate image content. Undoubtedly, this presents a significant challenge to the integrity of multimedia data, thereby fueling the advancement of image forgery detection research. A majority of current methods employ convolutional neural networks (CNNs) for image manipulation localization, yielding promising outcomes. Nevertheless, CNN-based approaches possess limitations in establishing explicit long-range relationships. Consequently, addressing the image manipulation localization task necessitates a solution that adeptly builds global context while preserving a robust grasp of low-level details. In this paper, we propose GPNet to address this challenge. GPNet combines Transformer and CNN in parallel which can build global dependency and capture low-level details efficiently. Additionally, we devise an effective fusion module referred to as TcFusion, which proficiently amalgamates feature maps generated by both branches. Thorough extensive experiments conducted on diverse datasets showcase that our network outperforms prevailing state-of-the-art manipulation detection and localization approaches. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

9 pages, 417 KiB  
Article
SSDLiteX: Enhancing SSDLite for Small Object Detection
by Hyeong-Ju Kang
Appl. Sci. 2023, 13(21), 12001; https://0-doi-org.brum.beds.ac.uk/10.3390/app132112001 - 03 Nov 2023
Viewed by 473
Abstract
Object detection in many real applications requires the capability of detecting small objects in a system with limited resources. Convolutional neural networks (CNNs) show high performance in object detection, but they are not adequate to resource-limited environments. The combination of MobileNet V2 and [...] Read more.
Object detection in many real applications requires the capability of detecting small objects in a system with limited resources. Convolutional neural networks (CNNs) show high performance in object detection, but they are not adequate to resource-limited environments. The combination of MobileNet V2 and SSDLite is one of the common choices in such environments, but it has a problem in detecting small objects. This paper analyzes the structure of SSDLite and proposes variations leading to small object detection improvement. The feature maps with the higher resolution are utilized more, and the base CNN is modified to have more layers in the high resolution. Experiments have been performed for the various configurations and the results show the proposed CNN, SSDLiteX, improves the detection accuracy AP of small objects by 1.5 percent points in the MS COCO data set. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

18 pages, 11482 KiB  
Article
A CNN-Based Approach for Driver Drowsiness Detection by Real-Time Eye State Identification
by Ruben Florez, Facundo Palomino-Quispe, Roger Jesus Coaquira-Castillo, Julio Cesar Herrera-Levano, Thuanne Paixão and Ana Beatriz Alvarez
Appl. Sci. 2023, 13(13), 7849; https://0-doi-org.brum.beds.ac.uk/10.3390/app13137849 - 04 Jul 2023
Cited by 5 | Viewed by 5958
Abstract
Drowsiness detection is an important task in road safety and other areas that require sustained attention. In this article, an approach to detect drowsiness in drivers is presented, focusing on the eye region, since eye fatigue is one of the first symptoms of [...] Read more.
Drowsiness detection is an important task in road safety and other areas that require sustained attention. In this article, an approach to detect drowsiness in drivers is presented, focusing on the eye region, since eye fatigue is one of the first symptoms of drowsiness. The method used for the extraction of the eye region is Mediapipe, chosen for its high accuracy and robustness. Three neural networks were analyzed based on InceptionV3, VGG16 and ResNet50V2, which implement deep learning. The database used is NITYMED, which contains videos of drivers with different levels of drowsiness. The three networks were evaluated in terms of accuracy, precision and recall in detecting drowsiness in the eye region. The results of the study show that all three convolutional neural networks have high accuracy in detecting drowsiness in the eye region. In particular, the Resnet50V2 network achieved the highest accuracy, with a rate of 99.71% on average. For better visualization of the data, the Grad-CAM technique is used, with which we obtain a better understanding of the performance of the algorithms in the classification process. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

16 pages, 766 KiB  
Article
A Comprehensive Framework for Industrial Sticker Information Recognition Using Advanced OCR and Object Detection Techniques
by Gabriella Monteiro, Leonardo Camelo, Gustavo Aquino, Rubens de A. Fernandes, Raimundo Gomes, André Printes, Israel Torné, Heitor Silva, Jozias Oliveira and Carlos Figueiredo
Appl. Sci. 2023, 13(12), 7320; https://0-doi-org.brum.beds.ac.uk/10.3390/app13127320 - 20 Jun 2023
Cited by 4 | Viewed by 1642
Abstract
Recent advancements in Artificial Intelligence (AI), deep learning (DL), and computer vision have revolutionized various industrial processes through image classification and object detection. State-of-the-art Optical Character Recognition (OCR) and object detection (OD) technologies, such as YOLO and PaddleOCR, have emerged as powerful solutions [...] Read more.
Recent advancements in Artificial Intelligence (AI), deep learning (DL), and computer vision have revolutionized various industrial processes through image classification and object detection. State-of-the-art Optical Character Recognition (OCR) and object detection (OD) technologies, such as YOLO and PaddleOCR, have emerged as powerful solutions for addressing challenges in recognizing textual and non-textual information on printed stickers. However, a well-established framework integrating these cutting-edge technologies for industrial applications still needs to be discovered. In this paper, we propose an innovative framework that combines advanced OCR and OD techniques to automate visual inspection processes in an industrial context. Our primary contribution is a comprehensive framework adept at detecting and recognizing textual and non-textual information on printed stickers within a company, harnessing the latest AI tools and technologies for sticker information recognition. Our experiments reveal an overall macro accuracy of 0.88 for sticker OCR across three distinct patterns. Furthermore, the proposed system goes beyond traditional Printed Character Recognition (PCR) by extracting supplementary information, such as barcodes and QR codes present in the image, significantly streamlining industrial workflows and minimizing manual labor demands. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

12 pages, 1493 KiB  
Article
Action Recognition Network Based on Local Spatiotemporal Features and Global Temporal Excitation
by Shukai Li, Xiaofang Wang, Dongri Shan and Peng Zhang
Appl. Sci. 2023, 13(11), 6811; https://0-doi-org.brum.beds.ac.uk/10.3390/app13116811 - 03 Jun 2023
Viewed by 773
Abstract
Temporal modeling is a key problem in action recognition, and it remains difficult to accurately model temporal information of videos. In this paper, we present a local spatiotemporal extraction module (LSTE) and a channel time excitation module (CTE), which are specially designed to [...] Read more.
Temporal modeling is a key problem in action recognition, and it remains difficult to accurately model temporal information of videos. In this paper, we present a local spatiotemporal extraction module (LSTE) and a channel time excitation module (CTE), which are specially designed to accurately model temporal information in video sequences. The LSTE module first obtains difference features by computing the pixel-wise differences between adjacent frames within each video segment and then obtains local motion features by stressing the effect of the feature channels sensitive to difference information. The local motion features are merged with the spatial features to represent local spatiotemporal features of each segment. The CTE module adaptively excites time-sensitive channels by modeling the interdependencies of channels in terms of time to enhance the global temporal information. Further, the above two modules are embedded into the existing 2DCNN baseline methods to build an action recognition network based on local spatiotemporal features and global temporal excitation (LSCT). We conduct experiments on the temporal-dependent Something-Something V1 and V2 datasets. We compare the recognition results with those obtained by the current methods, which proves the effectiveness of our methods. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

15 pages, 6933 KiB  
Article
Emergency Evacuation Simulation Study Based on Improved YOLOv5s and Anylogic
by Chuanxi Niu, Weihao Wang, Hebin Guo and Kexin Li
Appl. Sci. 2023, 13(9), 5812; https://0-doi-org.brum.beds.ac.uk/10.3390/app13095812 - 08 May 2023
Cited by 1 | Viewed by 1784
Abstract
With the development of the social economy and the continuous growth of the population, emergencies within field stations are becoming more frequent. To improve the efficiency of emergency evacuation of field stations and further protect people’s lives, this paper proposes a method based [...] Read more.
With the development of the social economy and the continuous growth of the population, emergencies within field stations are becoming more frequent. To improve the efficiency of emergency evacuation of field stations and further protect people’s lives, this paper proposes a method based on improved YOLOv5s target detection and Anylogic emergency evacuation simulation. This method applies the YOLOv5s target detection network to the emergency evacuation problem for the first time, using the stronger detection capability of YOLOv5s to solve the problem of unstable data collection under unexpected conditions. This paper first uses YOLOv5s, which incorporates the SE attention mechanism, to detect pedestrians inside the site. Considering the height of the camera and the inability to capture the whole body of the pedestrian when the site is crowded, this paper adopts the detection of the pedestrian’s head to determine the specific location of the pedestrian inside the site. To ensure that the evacuation task is completed in the shortest possible time, Anylogic adopts the principle of closest distance evacuation, so that each pedestrian can leave through the exit closest to him or her. The experimental results show that the average accuracy of the YOLOv5s target detection model incorporating the SE attention mechanism can reach 94.01%; the constructed Anylogic emergency evacuation model can quickly provide an evacuation plan to guide pedestrians to leave from the nearest exit in an emergency, effectively verifying the feasibility of the method. The method can be extended and applied to research related to the construction of emergency evacuation aid decision-making systems in field stations. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

18 pages, 4610 KiB  
Article
Lightweight Tennis Ball Detection Algorithm Based on Robomaster EP
by Yuan Zhao, Ling Lu, Wu Yang, Qizheng Li and Xiujie Zhang
Appl. Sci. 2023, 13(6), 3461; https://0-doi-org.brum.beds.ac.uk/10.3390/app13063461 - 08 Mar 2023
Cited by 1 | Viewed by 2251
Abstract
To address the problems of poor recognition effect, low detection accuracy, many model parameters and computation, complex network structure, and unfavorable portability to embedded devices in traditional tennis ball detection algorithms, this study proposes a lightweight tennis ball detection algorithm, YOLOv5s-Z, based on [...] Read more.
To address the problems of poor recognition effect, low detection accuracy, many model parameters and computation, complex network structure, and unfavorable portability to embedded devices in traditional tennis ball detection algorithms, this study proposes a lightweight tennis ball detection algorithm, YOLOv5s-Z, based on the YOLOv5s algorithm and Robomater EP. The main work is as follows: firstly, the lightweight network G-Backbone and G-Neck network layers are constructed to reduce the number of parameters and computation of the network structure. Secondly, convolutional coordinate attention is incorporated into the G-Backbone to embed location information into channel attention, which enables the network to obtain location information of a larger area through multiple convolutions and enhances the expression ability of mobile network learning features. In addition, the Concat module in the original feature fusion is modified into a weighted bi-directional feature pyramid W-BiFPN with settable learning weights to improve the feature fusion capability and achieve efficient weighted feature fusion and bi-directional cross-scale connectivity. Finally, the Loss function EIOU Loss is introduced to split the influence factor of the aspect ratio and calculate the length and width of the target frame and anchor frame, respectively, combined with Focal-EIOU Loss to solve the problem of imbalance between complex and easy samples. Meta-ACON’s activation function is introduced to achieve an adaptive selection of whether to activate neurons and improve the detection accuracy. The experimental results show that compared with the YOLOv5s algorithm, the YOLOv5s-Z algorithm reduces the number of parameters and computation by 42% and 44%, respectively, reduces the model size by 39%, and improves the mean accuracy by 2%, verifying the effectiveness of the improved algorithm and the lightweight of the model, adapting to Robomaster EP, and meeting the deployment requirements of embedded devices for the detection and identification of tennis balls. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

13 pages, 758 KiB  
Article
Neural Network-Based Reference Block Quality Enhancement for Motion Compensation Prediction
by Yanhan Chu, Hui Yuan, Shiqi Jiang and Congrui Fu
Appl. Sci. 2023, 13(5), 2795; https://0-doi-org.brum.beds.ac.uk/10.3390/app13052795 - 22 Feb 2023
Cited by 1 | Viewed by 1165
Abstract
Inter prediction is a crucial part of hybrid video coding frameworks, and it is used to eliminate redundancy in adjacent frames and improve coding performance. During inter prediction, motion estimation is used to find the reference block that is most similar to the [...] Read more.
Inter prediction is a crucial part of hybrid video coding frameworks, and it is used to eliminate redundancy in adjacent frames and improve coding performance. During inter prediction, motion estimation is used to find the reference block that is most similar to the current block, and the following motion compensation is used to shift the reference block fractionally to obtain the prediction block. The closer the reference block is to the original block, the higher the coding efficiency is. To improve the quality of reference blocks, a quality enhancement network (RBENN) that is dedicated to reference blocks is proposed. The main body of the network consists of 10 residual modules, with two convolution layers for preprocessing and feature extraction. Each residual module consists of two convolutional layers, one ReLU activation, and a shortcut. The network uses the luma reference block as input before motion compensation, and the enhanced reference block is then filtered by the default fractional interpolation. Moreover, the proposed method can be used for both conventional motion compensation and affine motion compensation. Experimental results showed that RBENN could achieve a −1.35% BD rate on average under the low-delay P (LDP) configuration compared with the latest H.266/VVC. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

25 pages, 14900 KiB  
Article
Automatic Ship Object Detection Model Based on YOLOv4 with Transformer Mechanism in Remote Sensing Images
by Bowen Sun, Xiaofeng Wang, Ammar Oad, Amjad Pervez and Feng Dong
Appl. Sci. 2023, 13(4), 2488; https://0-doi-org.brum.beds.ac.uk/10.3390/app13042488 - 15 Feb 2023
Cited by 5 | Viewed by 1964
Abstract
Despite significant advancements in object detection technology, most existing detection networks fail to investigate global aspects while extracting features from the inputs and cannot automatically adjust based on the characteristics of the inputs. The present study addresses this problem by proposing a detection [...] Read more.
Despite significant advancements in object detection technology, most existing detection networks fail to investigate global aspects while extracting features from the inputs and cannot automatically adjust based on the characteristics of the inputs. The present study addresses this problem by proposing a detection network consisting of three stages: preattention, attention, and prediction. In the preattention stage, the network framework is automatically selected based on the features of the images’ objects. In the attention stage, the transformer structure is introduced. Taking into account the global features of the target, this study combines a self-attention module in the transformer model and convolution operation to integrate image features from global to local and for detection, thus improving the ship target accuracy. This model uses mathematical methods to obtain results of predictive testing in the prediction stage. The above improvements are based on the You Only Look Once version 4 (YOLOv4) framework, named “Auto-T-YOLO”. The model achieves the highest accuracy of 96.3% on the SAR Ship Detection dataset (SSDD) compared to the other state-of-the-art (SOTA) model. It achieves 98.33% and 91.78% accuracy in the offshore and inshore scenes, respectively. The experimental results verify the practicality, validity, and robustness of the proposed model. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

14 pages, 3911 KiB  
Article
A Kitchen Standard Dress Detection Method Based on the YOLOv5s Embedded Model
by Ziyun Zhou, Chengjiang Zhou, Anning Pan, Fuqing Zhang, Chaoqun Dong, Xuedong Liu, Xiangshuai Zhai and Haitao Wang
Appl. Sci. 2023, 13(4), 2213; https://0-doi-org.brum.beds.ac.uk/10.3390/app13042213 - 09 Feb 2023
Cited by 2 | Viewed by 1338
Abstract
In order to quickly and accurately detect whether a chef is wearing a hat and mask, a kitchen standard dress detection method based on the YOLOv5s embedded model is proposed. Firstly, a complete kitchen scene dataset was constructed, and the introduction of images [...] Read more.
In order to quickly and accurately detect whether a chef is wearing a hat and mask, a kitchen standard dress detection method based on the YOLOv5s embedded model is proposed. Firstly, a complete kitchen scene dataset was constructed, and the introduction of images for the wearing of masks and hats allows for the low reliability problem caused by a single detection object to be effectively avoided. Secondly, the embedded detection system based on Jetson Xavier NX was introduced into kitchen standard dress detection for the first time, which accurately realizes real-time detection and early warning of non-standard dress. Among them, the combination of YOLOv5 and DeepStream SDK effectively improved the accuracy and effectiveness of standard dress detection in the complex kitchen background. Multiple sets of experiments show that the detection system based on YOLOv5s has the highest average accuracy of 0.857 and the fastest speed of 31.42 FPS. Therefore, the proposed detection method provided strong technical support for kitchen hygiene and food safety. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

16 pages, 2216 KiB  
Article
Cycle Generative Adversarial Network Based on Gradient Normalization for Infrared Image Generation
by Xing Yi, Hao Pan, Huaici Zhao, Pengfei Liu, Canyu Zhang, Junpeng Wang and Hao Wang
Appl. Sci. 2023, 13(1), 635; https://0-doi-org.brum.beds.ac.uk/10.3390/app13010635 - 03 Jan 2023
Cited by 3 | Viewed by 2356
Abstract
Image generation technology is currently one of the popular directions in computer vision research, especially regarding infrared imaging, bearing critical applications in the military field. Existing algorithms for generating infrared images from visible images are usually weak in perceiving the salient regions of [...] Read more.
Image generation technology is currently one of the popular directions in computer vision research, especially regarding infrared imaging, bearing critical applications in the military field. Existing algorithms for generating infrared images from visible images are usually weak in perceiving the salient regions of images and cannot effectively highlight the ability to generate texture details in infrared images, resulting in less texture details and poorer generated image quality. In this study, a cycle generative adversarial network method based on gradient normalization was proposed to address the current problems of poor infrared image generation, lack of texture detail and unstable models. First, to address the problem of limited feature extraction capability of the UNet generator network that makes the generated IR images blurred and of low quality, the use of the residual network with better feature extraction capability in the generator was employed to make the generated infrared images highly defined. Secondly, in order to solve issues concerning severe lack of detailed information in the generated infrared images, channel attention and spatial attention mechanisms were introduced into the ResNet with the attention mechanism used to weight the generated infrared image features in order to enhance feature perception of the prominent regions of the image, helping to generate image details. Finally, to tackle the problem where the current training models of adversarial generator networks are insufficiently stable, which leads to easy collapse of the model, a gradient normalization module was introduced in the discriminator network to stabilize the model and render it less prone to collapse during the training process. The experimental results on several datasets showed that the proposed method obtained satisfactory data in terms of objective evaluation metrics. Compared with the cycle generative adversarial network method, the proposed method in this work exhibited significant improvement in data validity on multiple datasets. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

13 pages, 2616 KiB  
Article
High-Accuracy Insulator Defect Detection for Overhead Transmission Lines Based on Improved YOLOv5
by Yourui Huang, Lingya Jiang, Tao Han, Shanyong Xu, Yuwen Liu and Jiahao Fu
Appl. Sci. 2022, 12(24), 12682; https://0-doi-org.brum.beds.ac.uk/10.3390/app122412682 - 10 Dec 2022
Cited by 10 | Viewed by 1855
Abstract
As a key component in overhead cables, insulators play an important role. However, in the process of insulator inspection, due to background interference, small fault area, limitations of manual detection, and other factors, detection is difficult, has low accuracy, and is prone to [...] Read more.
As a key component in overhead cables, insulators play an important role. However, in the process of insulator inspection, due to background interference, small fault area, limitations of manual detection, and other factors, detection is difficult, has low accuracy, and is prone to missed detection and false detection. To detect insulator defects more accurately, the insulator defect detection algorithm based on You Only Look Once version 5 (YOLOv5) is proposed. A backbone network was built with lightweight modules to reduce network computing overhead. The small-scale network detection layer was increased to improve the network for small target detection accuracy. A receptive field module was designed to replace the original spatial pyramid pooling (SPP) module so that the network can obtain feature information and improve network performance. Finally, experiments were carried out on the insulator image dataset. The experimental results show that the average accuracy of the algorithm is 97.4%, which is 7% higher than that of the original YOLOv5 network, and the detection speed is increased by 10 fps, which improves the accuracy and speed of insulator detection. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

15 pages, 4667 KiB  
Article
Classification and Object Detection of 360° Omnidirectional Images Based on Continuity-Distortion Processing and Attention Mechanism
by Xin Zhang, Degang Yang, Tingting Song, Yichen Ye, Jie Zhou and Yingze Song
Appl. Sci. 2022, 12(23), 12398; https://0-doi-org.brum.beds.ac.uk/10.3390/app122312398 - 04 Dec 2022
Cited by 2 | Viewed by 2216
Abstract
The use of 360° omnidirectional images has occurred widely in areas where comprehensive visual information is required due to their large visual field coverage. However, many extant convolutional neural networks based on 360° omnidirectional images have not performed well in computer vision tasks. [...] Read more.
The use of 360° omnidirectional images has occurred widely in areas where comprehensive visual information is required due to their large visual field coverage. However, many extant convolutional neural networks based on 360° omnidirectional images have not performed well in computer vision tasks. This occurs because 360° omnidirectional images are processed into plane images by equirectangular projection, which generates discontinuities at the edges and can result in serious distortion. At present, most methods to alleviate these problems are based on multi-projection and resampling, which can result in huge computational overhead. Therefore, a novel edge continuity distortion-aware block (ECDAB) for 360° omnidirectional images is proposed here, which prevents the discontinuity of edges and distortion by recombining and segmenting features. To further improve the performance of the network, a novel convolutional row-column attention block (CRCAB) is also proposed. CRCAB captures row-to-row and column-to-column dependencies to aggregate global information, enabling stronger representation of the extracted features. Moreover, to reduce the memory overhead of CRCAB, we propose an improved convolutional row-column attention block (ICRCAB), which can adjust the number of vectors in the row-column direction. Finally, to verify the effectiveness of the proposed networks, we conducted experiments on both traditional images and 360° omnidirectional image datasets. The experimental results demonstrated that better performance than for the baseline model was obtained by the network using ECDAB or CRCAB. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

19 pages, 4627 KiB  
Article
An Improved Differentiable Binarization Network for Natural Scene Street Sign Text Detection
by Manhuai Lu, Yi Leng, Chin-Ling Chen and Qiting Tang
Appl. Sci. 2022, 12(23), 12120; https://0-doi-org.brum.beds.ac.uk/10.3390/app122312120 - 27 Nov 2022
Viewed by 1317
Abstract
The street sign text information from natural scenes usually exists in a complex background environment and is affected by natural light and artificial light. However, most of the current text detection algorithms do not effectively reduce the influence of light and do not [...] Read more.
The street sign text information from natural scenes usually exists in a complex background environment and is affected by natural light and artificial light. However, most of the current text detection algorithms do not effectively reduce the influence of light and do not make full use of the relationship between high-level semantic information and contextual semantic information in the feature extraction network when extracting features from images, and they are ineffective at detecting text in complex backgrounds. To solve these problems, we first propose a multi-channel MSER (Maximally Stable Extreme Regions) method to fully consider color information in text detection, which separates the text area in the image from the complex background, effectively reducing the influence of the complex background and light on street sign text detection. We also propose an enhanced feature pyramid network text detection method, which includes a feature pyramid route enhancement (FPRE) module and a high-level feature enhancement (HLFE) module. The two modules can make full use of the network’s low-level and high-level semantic information to enhance the network’s effectiveness in localizing text information and detecting text with different shapes, sizes, and inclined text. Experiments showed that the F-scores obtained by the method proposed in this paper on ICDAR 2015 (International Conference on Document Analysis and Recognition 2015) dataset, ICDAR2017-MLT (International Conference on Document Analysis and Recognition 2017- Competition on Multi-lingual scene text detection) dataset, and the Natural Scene Street Signs (NSSS) dataset constructed in this study are 89.5%, 84.5%, and 73.3%, respectively, which confirmed the performance advantage of the method proposed in street sign text detection. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

21 pages, 1415 KiB  
Article
Persian Optical Character Recognition Using Deep Bidirectional Long Short-Term Memory
by Zohreh Khosrobeigi, Hadi Veisi, Ehsan Hoseinzade and Hanieh Shabanian
Appl. Sci. 2022, 12(22), 11760; https://0-doi-org.brum.beds.ac.uk/10.3390/app122211760 - 19 Nov 2022
Cited by 1 | Viewed by 2830
Abstract
Optical Character Recognition (OCR) is a system of converting images, including text,into editable text and is applied to various languages such as English, Arabic, and Persian. While these languages have similarities, their fundamental differences can create unique challenges. In Persian, continuity between Characters, [...] Read more.
Optical Character Recognition (OCR) is a system of converting images, including text,into editable text and is applied to various languages such as English, Arabic, and Persian. While these languages have similarities, their fundamental differences can create unique challenges. In Persian, continuity between Characters, the existence of semicircles, dots, oblique, and left-to-right characters such as English words in the context are some of the most important challenges in designing Persian OCR systems. Our proposed framework, Bina, is designed in a special way to address the issue of continuity by utilizing Convolution Neural Network (CNN) and deep bidirectional Long-Short Term Memory (BLSTM), a type of LSTM networks that has access to both past and future context. A huge and diverse dataset, including about 2M samples of both Persian and English contexts,consisting of various fonts and sizes, is also generated to train and test the performance of the proposed model. Various configurations are tested to find the optimal structure of CNN and BLSTM. The results show that Bina successfully outperformed state of the art baseline algorithm by achieving about 96% accuracy in the Persian and 88% accuracy in the Persian and English contexts. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

17 pages, 5304 KiB  
Article
Supervised Learning-Based Image Classification for the Detection of Late Blight in Potato Crops
by Marco Javier Suarez Baron, Angie Lizeth Gomez and Jorge Enrique Espindola Diaz
Appl. Sci. 2022, 12(18), 9371; https://0-doi-org.brum.beds.ac.uk/10.3390/app12189371 - 19 Sep 2022
Cited by 9 | Viewed by 1519
Abstract
This article presents the application of supervised learning and image classification for the early detection of late blight disease in potato using convolutional neural network and support vector machine SVM. The study was realized in the Boyacá department, Colombia. An initial dataset is [...] Read more.
This article presents the application of supervised learning and image classification for the early detection of late blight disease in potato using convolutional neural network and support vector machine SVM. The study was realized in the Boyacá department, Colombia. An initial dataset is created with the acquisition of a large number of images directly from the crops. These images are pre-processed in order to extract the main characteristics of the late blight disease. A classification model is developed to identify the potato plants as healthy or infected. Several performance, efficiency, and quality metrics were applied in the learning and classification tasks to determine the best machine learning algorithms. Then, an additional data set was used for validation, image classification, and detection of late blight disease in potato crops in the department of Boyacá, Colombia. The results obtained in the AUC curve show that the CNN trained with the data set obtained an AUC equal to 0.97; and the analysis through SVM obtained an AUC equal to 0.87. Future work requires the development of a mobile application with advanced features as a technological tool for precision agriculture that supports farmers with increased agricultural productivity. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

25 pages, 28437 KiB  
Article
Benchmarking Deep Learning Models for Instance Segmentation
by Sunguk Jung, Hyeonbeom Heo, Sangheon Park, Sung-Uk Jung and Kyungjae Lee
Appl. Sci. 2022, 12(17), 8856; https://0-doi-org.brum.beds.ac.uk/10.3390/app12178856 - 03 Sep 2022
Cited by 9 | Viewed by 4217
Abstract
Instance segmentation has gained attention in various computer vision fields, such as autonomous driving, drone control, and sports analysis. Recently, many successful models have been developed, which can be classified into two categories: accuracy- and speed-focused. Accuracy and inference time are important for [...] Read more.
Instance segmentation has gained attention in various computer vision fields, such as autonomous driving, drone control, and sports analysis. Recently, many successful models have been developed, which can be classified into two categories: accuracy- and speed-focused. Accuracy and inference time are important for real-time applications of this task. However, these models just present inference time measured on different hardware, which makes their comparison difficult. This study is the first to evaluate and compare the performances of state-of-the-art instance segmentation models by focusing on their inference time in a fixed experimental environment. For precise comparison, the test hardware and environment should be identical; hence, we present the accuracy and speed of the models in a fixed hardware environment for quantitative and qualitative analyses. Although speed-focused models run in real-time on high-end GPUs, there is a trade-off between speed and accuracy when the computing power is insufficient. The experimental results show that a feature pyramid network structure may be considered when designing a real-time model, and a balance between the speed and accuracy must be achieved for real-time application. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

12 pages, 814 KiB  
Article
Multi-Scale Convolutional Network for Space-Based ADS-B Signal Separation with Single Antenna
by Yan Bi and Chuankun Li
Appl. Sci. 2022, 12(17), 8816; https://0-doi-org.brum.beds.ac.uk/10.3390/app12178816 - 01 Sep 2022
Cited by 3 | Viewed by 1160
Abstract
Automatic Dependent Surveillance-Broadcast (ADS-B) signals are very vital in air traffic control. However, the space-based ADS-B signals are easily overlapped and their message cannot be correctly received. It is challenge to separate overlapped signals especially for a single antenna. The existing methods have [...] Read more.
Automatic Dependent Surveillance-Broadcast (ADS-B) signals are very vital in air traffic control. However, the space-based ADS-B signals are easily overlapped and their message cannot be correctly received. It is challenge to separate overlapped signals especially for a single antenna. The existing methods have a low decoding accuracy for small power difference, carrier frequency difference and relative time delay between overlapped signals. In order to solve these problems, we apply the deep learning method to single antenna ADS-B signal separation. A multi-scale Conv-TasNet (MConv-TasNet) is proposed to capture long temporal information of the ADS-B signal. In MConv-TasNet, a multi-scale convolutional separation (MCS) network is proposed to fuse different scale temporal features extracted from overlapping ADS-B signals and generate an effective separation mask to separate signals. Moreover, a large dataset is created by using the real ADS-B data. In addition, the proposed method has been evaluated on the dataset. The average decoding accuracy on the test set is 90.34%. It has achieved the state-of-the-art results. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

14 pages, 951 KiB  
Article
Adaptive Multi-Modal Ensemble Network for Video Memorability Prediction
by Jing Li, Xin Guo, Fumei Yue, Fanfu Xue and Jiande Sun
Appl. Sci. 2022, 12(17), 8599; https://0-doi-org.brum.beds.ac.uk/10.3390/app12178599 - 27 Aug 2022
Cited by 1 | Viewed by 1354
Abstract
Video memorability prediction aims to quantify the credibility of being remembered according to the video content, which provides significant value in advertising design, social media recommendation, and other applications. However, the main attributes that affect the memorability prediction have not been determined so [...] Read more.
Video memorability prediction aims to quantify the credibility of being remembered according to the video content, which provides significant value in advertising design, social media recommendation, and other applications. However, the main attributes that affect the memorability prediction have not been determined so that making the design of the prediction model more challenging. Therefore, in this study, we analyze and experimentally verify how to select the most impact factors to predict video memorability. Furthermore, we design a new framework, Adaptive Multi-modal Ensemble Network, based on the chosen vital impact factors to predict video memorability efficiently. Specifically, we first conduct three main impact factors that affect video memorability, i.e., temporal 3D information, spatial information and semantics derived from video, image and caption, respectively. Then, the Adaptive Multi-modal Ensemble Network integrates the three individual base learners (i.e., ResNet3D, Deep Random Forest and Multi-Layer Perception) into a weighted ensemble framework to score the video memorability. In addition, we also design an adaptive learning strategy to update the weights based on the importance of memorability, which is predicted by the base learners rather than assigning weights manually. Finally, the experiments on the public VideoMem dataset demonstrate that the proposed method provides competitive results and high efficiency for video memorability prediction. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

15 pages, 21556 KiB  
Article
Cloud Gaming Video Coding Optimization Based on Camera Motion-Guided Reference Frame Enhancement
by Yifan Wang, Hao Wang, Kaijie Wang and Wei Zhang
Appl. Sci. 2022, 12(17), 8504; https://0-doi-org.brum.beds.ac.uk/10.3390/app12178504 - 25 Aug 2022
Cited by 1 | Viewed by 1715
Abstract
Recent years have witnessed tremendous advances in clouding gaming. To alleviate the bandwidth pressure due to transmissions of high-quality cloud gaming videos, this paper optimized existing video codecs with deep learning networks to reduce the bitrate consumption of cloud gaming videos. Specifically, a [...] Read more.
Recent years have witnessed tremendous advances in clouding gaming. To alleviate the bandwidth pressure due to transmissions of high-quality cloud gaming videos, this paper optimized existing video codecs with deep learning networks to reduce the bitrate consumption of cloud gaming videos. Specifically, a camera motion-guided network, i.e., CMGNet, was proposed for the reference frame enhancement, leveraging the camera motion information of cloud gaming videos and the reconstructed frames in the reference frame list. The obtained high-quality reference frame was then added to the reference frame list to improve the compression efficiency. The decoder side performs the same operation to generate the reconstructed frames using the updated reference frame list. In the CMGNet, camera motions were used as guidance to estimate the frame motion and weight masks to achieve more accurate frame alignment and fusion, respectively. As a result, the quality of the reference frame was significantly enhanced and thus being more suitable as a prediction candidate for the target frame. Experimental results demonstrate the effectiveness of the proposed algorithm, which achieves 4.91% BD-rate reduction on average. Moreover, a cloud gaming video dataset with camera motion data was made available to promote research on game video compression. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

22 pages, 1831 KiB  
Article
Adaptive Thresholding of CNN Features for Maize Leaf Disease Classification and Severity Estimation
by Harry Dzingai Mafukidze, Godliver Owomugisha, Daniel Otim, Action Nechibvute, Cloud Nyamhere and Felix Mazunga
Appl. Sci. 2022, 12(17), 8412; https://0-doi-org.brum.beds.ac.uk/10.3390/app12178412 - 23 Aug 2022
Cited by 9 | Viewed by 1805
Abstract
Convolutional neural networks (CNNs) are the gold standard in the machine learning (ML) community. As a result, most of the recent studies have relied on CNNs, which have achieved higher accuracies compared with traditional machine learning approaches. From prior research, we learned that [...] Read more.
Convolutional neural networks (CNNs) are the gold standard in the machine learning (ML) community. As a result, most of the recent studies have relied on CNNs, which have achieved higher accuracies compared with traditional machine learning approaches. From prior research, we learned that multi-class image classification models can solve leaf disease identification problems, and multi-label image classification models can solve leaf disease quantification problems (severity analysis). Historically, maize leaf disease severity analysis or quantification has always relied on domain knowledge—that is, experts evaluate the images and train the CNN models based on their knowledge. Here, we propose a unique system that achieves the same objective while excluding input from specialists. This avoids bias and does not rely on a “human in the loop model” for disease quantification. The advantages of the proposed system are many. Notably, the conventional system of maize leaf disease quantification is labor intensive, time-consuming and prone to errors since it lacks standardized diagnosis guidelines. In this work, we present an approach to quantify maize leaf disease based on adaptive thresholding. The experimental work of our study is in three parts. First, we train a wide variety of well-known deep learning models for maize leaf disease classification, then we compare the performance of the deep learning models and finally extract the class activation heatmaps from the prediction layers of the CNN models. Second, we develop an adaptive thresholding technique that automatically extracts the regions of interest from the class activation maps without any prior knowledge. Lastly, we use these regions of interest to estimate image leaf disease severity. Experimental results show that transfer learning approaches can classify maize leaf diseases with up to 99% accuracy. With a high quantification accuracy, our proposed adaptive thresholding method for CNN class activation maps can be a valuable contribution to quantifying maize leaf diseases without relying on domain knowledge. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

13 pages, 5004 KiB  
Article
Variable Rate Point Cloud Attribute Compression with Non-Local Attention Optimization
by Xiao Huo, Saiping Zhang and Fuzheng Yang
Appl. Sci. 2022, 12(16), 8179; https://0-doi-org.brum.beds.ac.uk/10.3390/app12168179 - 16 Aug 2022
Cited by 2 | Viewed by 1295
Abstract
Point clouds are widely used as representations of 3D objects and scenes in a number of applications, including virtual and mixed reality, autonomous driving, antiques reconstruction. To reduce the cost for transmitting and storing such data, this paper proposes an end-to-end learning-based point [...] Read more.
Point clouds are widely used as representations of 3D objects and scenes in a number of applications, including virtual and mixed reality, autonomous driving, antiques reconstruction. To reduce the cost for transmitting and storing such data, this paper proposes an end-to-end learning-based point cloud attribute compression (PCAC) approach. The proposed network adopts a sparse convolution-based variational autoencoder (VAE) structure to compress the color attribute of point clouds. Considering the difficulty of stacked convolution operations in capturing long range dependencies, the attention mechanism is incorporated in which a non-local attention module is developed to capture the local and global correlations in both spatial and channel dimensions. Towards the practical application, an additional modulation network is offered to achieve the variable rate compression purpose in a single network, avoiding the memory cost of storing multiple networks for multiple bitrates. Our proposed method achieves state-of-the-art compression performance compared to other existing learning-based methods and further reduces the gap with the latest MPEG G-PCC reference software TMC13 version 14. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

29 pages, 13354 KiB  
Article
A Computer Vision Model to Identify the Incorrect Use of Face Masks for COVID-19 Awareness
by Fabricio Crespo, Anthony Crespo, Luz Marina Sierra-Martínez, Diego Hernán Peluffo-Ordóñez and Manuel Eugenio Morocho-Cayamcela
Appl. Sci. 2022, 12(14), 6924; https://0-doi-org.brum.beds.ac.uk/10.3390/app12146924 - 08 Jul 2022
Cited by 7 | Viewed by 2465
Abstract
Face mask detection has become a great challenge in computer vision, demanding the coalition of technology with COVID-19 awareness. Researchers have proposed deep learning models to detect the use of face masks. However, the incorrect use of a face mask can be as [...] Read more.
Face mask detection has become a great challenge in computer vision, demanding the coalition of technology with COVID-19 awareness. Researchers have proposed deep learning models to detect the use of face masks. However, the incorrect use of a face mask can be as harmful as not wearing any protection at all. In this paper, we propose a compound convolutional neural network (CNN) architecture based on two computer vision tasks: object localization to discover faces in images/videos, followed by an image classification CNN to categorize the faces and show if someone is using a face mask correctly, incorrectly, or not at all. The first CNN is built upon RetinaFace, a model to detect faces in images, whereas the second CNN uses a ResNet-18 architecture as a classification backbone. Our model enables an accurate identification of people who are not correctly following the COVID-19 healthcare recommendations on face mask use. To enable further global use of our technology, we have released both the dataset used to train the classification model and our proposed computer vision pipeline to the public, and optimized it for embedded systems deployment. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

11 pages, 3250 KiB  
Article
A Deep Attention Model for Environmental Sound Classification from Multi-Feature Data
by Jinming Guo, Chuankun Li, Zepeng Sun, Jian Li and Pan Wang
Appl. Sci. 2022, 12(12), 5988; https://0-doi-org.brum.beds.ac.uk/10.3390/app12125988 - 12 Jun 2022
Cited by 7 | Viewed by 2877
Abstract
Automated environmental sound recognition has clear engineering benefits; it allows audio to be sorted, curated, and searched. Unlike music and language, environmental sound is loaded with noise and lacks the rhythm and melody of music or the semantic sequence of language, making it [...] Read more.
Automated environmental sound recognition has clear engineering benefits; it allows audio to be sorted, curated, and searched. Unlike music and language, environmental sound is loaded with noise and lacks the rhythm and melody of music or the semantic sequence of language, making it difficult to find common features representative enough of various environmental sound signals. To improve the accuracy of environmental sound recognition, this paper proposes a recognition method based on multi-feature parameters and time–frequency attention module. It begins with a pretreatment that relies on multi-feature parameters to extract the sound, which supplements the phase information lost by the Log-Mel spectrogram in the current mainstream methods, and enhances the expressive ability of input features. A time–frequency attention module with multiple convolutions is designed to extract the attention weight of the input feature spectrogram and reduce the interference coming from the background noise and irrelevant frequency bands in the audio. Comparative experiments were conducted on three general datasets: environmental sound classification datasets (ESC-10, ESC-50) and an UrbanSound8K dataset. Experiments demonstrated that the proposed method performs better. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

13 pages, 4130 KiB  
Article
A Fast Identification Method of Gunshot Types Based on Knowledge Distillation
by Jian Li, Jinming Guo, Xiushan Sun, Chuankun Li and Lingpeng Meng
Appl. Sci. 2022, 12(11), 5526; https://0-doi-org.brum.beds.ac.uk/10.3390/app12115526 - 29 May 2022
Cited by 4 | Viewed by 1664
Abstract
To reduce the large size of a gunshot recognition network model and to improve the insufficient real-time detection in urban combat, this paper proposes a fast gunshot type recognition method based on knowledge distillation. First, the muzzle blast and the shock wave generated [...] Read more.
To reduce the large size of a gunshot recognition network model and to improve the insufficient real-time detection in urban combat, this paper proposes a fast gunshot type recognition method based on knowledge distillation. First, the muzzle blast and the shock wave generated by the gunshot are preprocessed, and the quality of the gunshot recognition dataset is improved using Log-Mel spectrum corresponding to these two signals. Second, a teacher network is constructed using 10 two-dimensional residual modules, and a student network is designed using depth wise separable convolution. Third, the lightweight student network is made to learn the gunshot features under the guidance of the pre-trained large-scale teacher network. Finally, the network’s accuracy, model size, and recognition time are tested using the AudioSet dataset and the NIJ Grant 2016-DN-BX-0183 gunshot dataset. The findings demonstrate that the proposed algorithm achieved 95.6% and 83.5% accuracy on the two datasets, the speed was 0.5 s faster, and the model size was reduced to 2.5 MB. The proposed method is of good practical value in the field of gunshot recognition. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

22 pages, 4873 KiB  
Article
Editable Image Generation with Consistent Unsupervised Disentanglement Based on GAN
by Gaoming Yang, Yuanjin Qu and Xianjin Fang
Appl. Sci. 2022, 12(11), 5382; https://0-doi-org.brum.beds.ac.uk/10.3390/app12115382 - 26 May 2022
Viewed by 1445
Abstract
Generative adversarial networks (GANs) are often used to generate realistic images, and GANs are effective in fitting high-dimensional probability distributions. However, during training, they often produce model collapse, which is the inability of the generative model to map the input noise to the [...] Read more.
Generative adversarial networks (GANs) are often used to generate realistic images, and GANs are effective in fitting high-dimensional probability distributions. However, during training, they often produce model collapse, which is the inability of the generative model to map the input noise to the real data distribution. In this work, we propose a model for disentanglement and mitigating model collapse inspired by the relationship between Hessian and Jacobian matrices. This is a concise framework for producing few modifications to the original model while facilitating the disentanglement. Compared to the pre-improvement generative models, our approach modifies the original model architecture only marginally and does not change the training method. Our method shows consistent resistance to model collapse on some image datasets, while outperforming the pre-improvement method in terms of disentanglement. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

18 pages, 7468 KiB  
Article
CloudRCNN: A Framework Based on Deep Neural Networks for Semantic Segmentation of Satellite Cloud Images
by Gonghe Shi and Baohe Zuo
Appl. Sci. 2022, 12(11), 5370; https://0-doi-org.brum.beds.ac.uk/10.3390/app12115370 - 26 May 2022
Cited by 3 | Viewed by 1465
Abstract
Shadow cumulus clouds are widely distributed globally. They carry critical information to analyze environmental and climate changes. They can also shape the energy and water cycles of the global ecosystem at multiple scales by impacting solar radiation transfer and precipitation. Satellite images are [...] Read more.
Shadow cumulus clouds are widely distributed globally. They carry critical information to analyze environmental and climate changes. They can also shape the energy and water cycles of the global ecosystem at multiple scales by impacting solar radiation transfer and precipitation. Satellite images are an important source of cloud data. The accurate detection and segmentation of clouds is of great significance for climate and environmental monitoring. In this paper, we propose an improved MaskRCNN framework for the semantic segmentation of satellite images. We also explore two deep neural network architectures using auxiliary loss and feature fusion functions. We conduct comparative experiments on the dataset called “Understanding Clouds from Satellite Images”, sourced from the Kaggle competition. Compared to the baseline model, MaskRCNN, the mIoU of the CloudRCNN (auxiliary loss) model improves by 15.24%, and that of the CloudRCNN (feature fusion) model improves by 12.77%. More importantly, the two neural network architectures proposed in this paper can be widely applied to various semantic segmentation neural network models to improve the distinction between the foreground and the background. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

20 pages, 5714 KiB  
Article
An Image Processing Protocol to Extract Variables Predictive of Human Embryo Fitness for Assisted Reproduction
by Dóris Spinosa Chéles, André Satoshi Ferreira, Isabela Sueitt de Jesus, Eleonora Inácio Fernandez, Gabriel Martins Pinheiro, Eloiza Adriane Dal Molin, Wallace Alves, Rebeca Colauto Milanezi de Souza, Lorena Bori, Marcos Meseguer, José Celso Rocha and Marcelo Fábio Gouveia Nogueira
Appl. Sci. 2022, 12(7), 3531; https://0-doi-org.brum.beds.ac.uk/10.3390/app12073531 - 30 Mar 2022
Cited by 2 | Viewed by 2058
Abstract
Despite the use of new techniques on embryo selection and the presence of equipment on the market, such as EmbryoScope® and Geri®, which help in the evaluation of embryo quality, there is still a subjectivity between the embryologist’s classifications, which [...] Read more.
Despite the use of new techniques on embryo selection and the presence of equipment on the market, such as EmbryoScope® and Geri®, which help in the evaluation of embryo quality, there is still a subjectivity between the embryologist’s classifications, which are subjected to inter- and intra-observer variability, therefore compromising the successful implantation of the embryo. Nonetheless, with the acquisition of images through the time-lapse system, it is possible to perform digital processing of these images, providing a better analysis of the embryo, in addition to enabling the automatic analysis of a large volume of information. An image processing protocol was developed using well-established techniques to segment the image of blastocysts and extract variables of interest. A total of 33 variables were automatically generated by digital image processing, each one representing a different aspect of the embryo and describing a different characteristic of the blastocyst. These variables can be categorized into texture, gray-level average, gray-level standard deviation, modal value, relations, and light level. The automated and directed steps of the proposed processing protocol exclude spurious results, except when image quality (e.g., focus) prevents correct segmentation. The image processing protocol can segment human blastocyst images and automatically extract 33 variables that describe quantitative aspects of the blastocyst’s regions, with potential utility in embryo selection for assisted reproductive technology (ART). Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

14 pages, 375 KiB  
Article
Variable Rate Independently Recurrent Neural Network (IndRNN) for Action Recognition
by Yanbo Gao, Chuankun Li, Shuai Li, Xun Cai, Mao Ye and Hui Yuan
Appl. Sci. 2022, 12(7), 3281; https://0-doi-org.brum.beds.ac.uk/10.3390/app12073281 - 23 Mar 2022
Cited by 2 | Viewed by 1651
Abstract
Recurrent neural networks (RNNs) have been widely used to solve sequence problems due to their capability of modeling temporal dependency. Despite the rich varieties of RNN models proposed in the literature, the problem of different sampling rates or performing speeds in sequence tasks [...] Read more.
Recurrent neural networks (RNNs) have been widely used to solve sequence problems due to their capability of modeling temporal dependency. Despite the rich varieties of RNN models proposed in the literature, the problem of different sampling rates or performing speeds in sequence tasks has not been explicitly considered in the network and the corresponding training and testing processes. This paper addresses the problem of different sampling rates or performing speeds in the skeleton-based action recognition with RNNs. Specifically, the recently proposed independently recurrent neural network (IndRNN) is used as the RNN network due to its well-behaved and easily regulated gradient backpropagation through time. Samples are extracted with variable sampling rates and thus of different lengths, then processed by IndRNN with different time steps. In order to accommodate the differences in terms of gradients introduced by the backpropagation through time under variable time steps, a learning rate adjustment method is further proposed in the paper. Different learning rate adjustment factors are obtained for different layers by analyzing the gradient behavior under IndRNN. Experiments on skeleton-based action recognition are conducted to verify its effectiveness, and the results show that the proposed variable rate IndRNN network can significantly improve the performance over the RNN models under the conventional training strategies. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

18 pages, 6402 KiB  
Article
Adaptive Deconvolution-Based Stereo Matching Net for Local Stereo Matching
by Xin Ma, Zhicheng Zhang, Danfeng Wang, Yu Luo and Hui Yuan
Appl. Sci. 2022, 12(4), 2086; https://0-doi-org.brum.beds.ac.uk/10.3390/app12042086 - 17 Feb 2022
Cited by 3 | Viewed by 1462
Abstract
In deep learning-based local stereo matching methods, larger image patches usually bring better stereo matching accuracy. However, it is unrealistic to increase the size of the image patch size without restriction. Arbitrarily extending the patch size will change the local stereo matching method [...] Read more.
In deep learning-based local stereo matching methods, larger image patches usually bring better stereo matching accuracy. However, it is unrealistic to increase the size of the image patch size without restriction. Arbitrarily extending the patch size will change the local stereo matching method into the global stereo matching method, and the matching accuracy will be saturated. We simplified the existing Siamese convolutional network by reducing the number of network parameters and propose an efficient CNN based structure, namely adaptive deconvolution-based disparity matching net (ADSM net) by adding deconvolution layers to learn how to enlarge the size of input feature map for the following convolution layers. Experimental results on the KITTI2012 and 2015 datasets demonstrate that the proposed method can achieve a good trade-off between accuracy and complexity. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

16 pages, 855 KiB  
Article
A Deep Attention Model for Action Recognition from Skeleton Data
by Yanbo Gao, Chuankun Li, Shuai Li, Xun Cai, Mao Ye and Hui Yuan
Appl. Sci. 2022, 12(4), 2006; https://0-doi-org.brum.beds.ac.uk/10.3390/app12042006 - 15 Feb 2022
Cited by 6 | Viewed by 1894
Abstract
This paper presents a new IndRNN-based deep attention model, termed DA-IndRNN, for skeleton-based action recognition to effectively model the fact that different joints are usually of different degrees of importance to different action categories. The model consists of (a) a deep IndRNN as [...] Read more.
This paper presents a new IndRNN-based deep attention model, termed DA-IndRNN, for skeleton-based action recognition to effectively model the fact that different joints are usually of different degrees of importance to different action categories. The model consists of (a) a deep IndRNN as the main classification network to overcome the limitation of a shallow RNN network in order to obtain deeper and longer features, and (b) a deep attention network with multiple fully connected layers to estimate reliable attention weights. To train the DA-IndRNN, a new triplet loss function is proposed to guide the learning of the attention among different action categories. Specifically, this triplet loss enforces intra-class attention distances to be smaller than inter-class attention distances and at the same time to allow multiple attention weight patterns to exist for the same class. The proposed DA-IndRNN can be trained end-to-end. Experiments on the widely used datasets, including the NTU RGB + D dataset and UOW Large-Scale Combined (LSC) Dataset, have demonstrated that the proposed method can achieve better and stable performance than the state-of-the-art attention models. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

Review

Jump to: Research, Other

23 pages, 2927 KiB  
Review
Deep Learning for Video Application in Cooperative Vehicle-Infrastructure System: A Comprehensive Survey
by Beipo Su, Yongfeng Ju and Liang Dai
Appl. Sci. 2022, 12(12), 6283; https://0-doi-org.brum.beds.ac.uk/10.3390/app12126283 - 20 Jun 2022
Viewed by 1788
Abstract
Video application is a research hotspot in cooperative vehicle-infrastructure systems (CVIS) which is greatly related to traffic safety and the quality of user experience. Dealing with large datasets of feedback from complex environments is a challenge when using traditional video application approaches. However, [...] Read more.
Video application is a research hotspot in cooperative vehicle-infrastructure systems (CVIS) which is greatly related to traffic safety and the quality of user experience. Dealing with large datasets of feedback from complex environments is a challenge when using traditional video application approaches. However, the in-depth structure of deep learning has the ability to deal with high-dimensional data sets, which shows better performance in video application problems. Therefore, the research value and significance of video applications over CVIS can be better reflected through deep learning. Firstly, the research status of traditional video application methods and deep learning methods over CVIS were introduced; the existing video application methods based on deep learning were classified according to generative and discriminative deep architecture. Then, we summarized the main methods of deep learning and deep reinforcement learning algorithms for video applications over CVIS, and made a comparative study of their performances. Finally, the challenges and development trends of deep learning in the field were explored and discussed. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

Other

Jump to: Research, Review

18 pages, 6215 KiB  
Study Protocol
Traffic Light Detection and Recognition Method Based on YOLOv5s and AlexNet
by Chuanxi Niu and Kexin Li
Appl. Sci. 2022, 12(21), 10808; https://0-doi-org.brum.beds.ac.uk/10.3390/app122110808 - 25 Oct 2022
Cited by 10 | Viewed by 3758
Abstract
Traffic light detection and recognition technology are of great importance for the development of driverless systems and vehicle-assisted driving systems. Since the target detection algorithm has the problems of lower detection accuracy and fewer detection types, this paper adopts the idea of first [...] Read more.
Traffic light detection and recognition technology are of great importance for the development of driverless systems and vehicle-assisted driving systems. Since the target detection algorithm has the problems of lower detection accuracy and fewer detection types, this paper adopts the idea of first detection and then classification and proposes a method based on YOLOv5s target detection and AlexNet image classification to detect and identify traffic lights. The method first detects the traffic light area using YOLOv5s, then extracts the area and performs image processing operations, and finally feeds the processed image to AlexNet for recognition judgment. With this method, the shortcomings of the single-target detection algorithm in terms of low recognition rate for small-target detection can be avoided. Since the homemade dataset contains more low-light images, the dataset is optimized using the ZeroDCE low-light enhancement algorithm, and the performance of the network model trained after optimization of the dataset can reach 99.46% AP (average precision), which is 0.07% higher than that before optimization, and the average accuracy on the traffic light recognition dataset can reach 87.75%. The experimental results show that the method has a high accuracy rate and can realize the recognition of many types of traffic lights, which can meet the requirements of traffic light detection on actual roads. Full article
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)
Show Figures

Figure 1

Back to TopTop