sensors-logo

Journal Browser

Journal Browser

Image and Video Processing and Recognition Based on Artificial Intelligence-2nd Edition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: closed (10 February 2023) | Viewed by 42012

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Division of Electronics and Electrical Engineering, Dongguk University, 30, Pildong- ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
Interests: deep learning; biometrics; image processing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor

Special Issue Information

Dear Colleagues,

Recent developments have led to the intense application of artificial intelligence (AI) techniques to image and video processing and recognition. Although the state-of-the-art technology has matured, its performance is still affected by various environmental conditions and heterogeneous databases. The purpose of this Special Issue is to invite high-quality and state-of-the-art academic papers on challenging issues in the field of AI-based image and video processing and recognition. We solicit original papers of unpublished and completed research that are not currently under review by any other conference, magazine, or journal. Topics of interest include, but are not limited to, the following:

  • AI-based image processing, understanding, recognition, compression, and reconstruction;
  • AI-based video processing, understanding, recognition, compression, and reconstruction;
  • Computer vision based on AI;
  • AI-based biometrics;
  • AI-based object detection and tracking;
  • Approaches that combine AI techniques and conventional methods for image and video processing and recognition;
  • Explainable AI (XAI) for image and video processing and recognition;
  • Generative adversarial network (GAN)-based image and video processing and recognition;
  • Approaches that combine AI techniques and blockchain methods for image and video processing and recognition.

Prof. Dr. Kang Ryoung Park
Prof. Dr. Sangyoun Lee
Prof. Dr. Euntai Kim
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image processing, understanding, recognition, compression, and reconstruction based on AI
  • video processing, understanding, recognition, compression, and reconstruction based on AI
  • computer vision based on AI
  • biometrics based on AI
  • fusion of AI and conventional methods
  • XAI and GAN
  • fusion of AI and blockchain methods

Related Special Issue

Published Papers (19 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

15 pages, 2282 KiB  
Article
Unsupervised Video Summarization Based on Deep Reinforcement Learning with Interpolation
by Ui Nyoung Yoon, Myung Duk Hong and Geun-Sik Jo
Sensors 2023, 23(7), 3384; https://0-doi-org.brum.beds.ac.uk/10.3390/s23073384 - 23 Mar 2023
Cited by 1 | Viewed by 1661
Abstract
Individuals spend time on online video-sharing platforms searching for videos. Video summarization helps search through many videos efficiently and quickly. In this paper, we propose an unsupervised video summarization method based on deep reinforcement learning with an interpolation method. To train the video [...] Read more.
Individuals spend time on online video-sharing platforms searching for videos. Video summarization helps search through many videos efficiently and quickly. In this paper, we propose an unsupervised video summarization method based on deep reinforcement learning with an interpolation method. To train the video summarization network efficiently, we used the graph-level features and designed a reinforcement learning-based video summarization framework with a temporal consistency reward function and other reward functions. Our temporal consistency reward function helped to select keyframes uniformly. We present a lightweight video summarization network with transformer and CNN networks to capture the global and local contexts to efficiently predict the keyframe-level importance score of the video in a short length. The output importance score of the network was interpolated to fit the video length. Using the predicted importance score, we calculated the reward based on the reward functions, which helped select interesting keyframes efficiently and uniformly. We evaluated the proposed method on two datasets, SumMe and TVSum. The experimental results illustrate that the proposed method showed a state-of-the-art performance compared to the latest unsupervised video summarization methods, which we demonstrate and analyze experimentally. Full article
Show Figures

Figure 1

24 pages, 14566 KiB  
Article
YOLO Series for Human Hand Action Detection and Classification from Egocentric Videos
by Hung-Cuong Nguyen, Thi-Hao Nguyen, Rafał Scherer and Van-Hung Le
Sensors 2023, 23(6), 3255; https://0-doi-org.brum.beds.ac.uk/10.3390/s23063255 - 20 Mar 2023
Cited by 5 | Viewed by 4520
Abstract
Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance [...] Read more.
Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance of the “You Only Live Once” (YOLO) network over the past seven years, we propose a study comparing the efficiency of hand detection and classification based on the YOLO-family networks. This study is based on the following problems: (1) systematizing all architectures, advantages, and disadvantages of YOLO-family networks from version (v)1 to v7; (2) preparing ground-truth data for pre-trained models and evaluation models of hand detection and classification on EV datasets (FPHAB, HOI4D, RehabHand); (3) fine-tuning the hand detection and classification model based on the YOLO-family networks, hand detection, and classification evaluation on the EV datasets. Hand detection and classification results on the YOLOv7 network and its variations were the best across all three datasets. The results of the YOLOv7-w6 network are as follows: FPHAB is P = 97% with TheshIOU = 0.5; HOI4D is P = 95% with TheshIOU = 0.5; RehabHand is larger than 95% with TheshIOU = 0.5; the processing speed of YOLOv7-w6 is 60 fps with a resolution of 1280 × 1280 pixels and that of YOLOv7 is 133 fps with a resolution of 640 × 640 pixels. Full article
Show Figures

Figure 1

14 pages, 7755 KiB  
Article
The Successive Next Network as Augmented Regularization for Deformable Brain MR Image Registration
by Meng Li, Shunbo Hu, Guoqiang Li, Fuchun Zhang, Jitao Li, Yue Yang, Lintao Zhang, Mingtao Liu, Yan Xu, Deqian Fu, Wenyin Zhang and Xing Wang
Sensors 2023, 23(6), 3208; https://0-doi-org.brum.beds.ac.uk/10.3390/s23063208 - 17 Mar 2023
Viewed by 1194
Abstract
Deep-learning-based registration methods can not only save time but also automatically extract deep features from images. In order to obtain better registration performance, many scholars use cascade networks to realize a coarse-to-fine registration progress. However, such cascade networks will increase network parameters by [...] Read more.
Deep-learning-based registration methods can not only save time but also automatically extract deep features from images. In order to obtain better registration performance, many scholars use cascade networks to realize a coarse-to-fine registration progress. However, such cascade networks will increase network parameters by an n-times multiplication factor and entail long training and testing stages. In this paper, we only use a cascade network in the training stage. Unlike others, the role of the second network is to improve the registration performance of the first network and function as an augmented regularization term in the whole process. In the training stage, the mean squared error loss function between the dense deformation field (DDF) with which the second network has been trained and the zero field is added to constrain the learned DDF such that it tends to 0 at each position and to compel the first network to conceive of a better deformation field and improve the network’s registration performance. In the testing stage, only the first network is used to estimate a better DDF; the second network is not used again. The advantages of this kind of design are reflected in two aspects: (1) it retains the good registration performance of the cascade network; (2) it retains the time efficiency of the single network in the testing stage. The experimental results show that the proposed method effectively improves the network’s registration performance compared to other state-of-the-art methods. Full article
Show Figures

Figure 1

19 pages, 7524 KiB  
Article
A Study on the Effectiveness of Deep Learning-Based Anomaly Detection Methods for Breast Ultrasonography
by Changhee Yun, Bomi Eom, Sungjun Park, Chanho Kim, Dohwan Kim, Farah Jabeen, Won Hwa Kim, Hye Jung Kim and Jaeil Kim
Sensors 2023, 23(5), 2864; https://0-doi-org.brum.beds.ac.uk/10.3390/s23052864 - 06 Mar 2023
Cited by 3 | Viewed by 1711
Abstract
In the medical field, it is delicate to anticipate good performance in using deep learning due to the lack of large-scale training data and class imbalance. In particular, ultrasound, which is a key breast cancer diagnosis method, is delicate to diagnose accurately as [...] Read more.
In the medical field, it is delicate to anticipate good performance in using deep learning due to the lack of large-scale training data and class imbalance. In particular, ultrasound, which is a key breast cancer diagnosis method, is delicate to diagnose accurately as the quality and interpretation of images can vary depending on the operator’s experience and proficiency. Therefore, computer-aided diagnosis technology can facilitate diagnosis by visualizing abnormal information such as tumors and masses in ultrasound images. In this study, we implemented deep learning-based anomaly detection methods for breast ultrasound images and validated their effectiveness in detecting abnormal regions. Herein, we specifically compared the sliced-Wasserstein autoencoder with two representative unsupervised learning models autoencoder and variational autoencoder. The anomalous region detection performance is estimated with the normal region labels. Our experimental results showed that the sliced-Wasserstein autoencoder model outperformed the anomaly detection performance of others. However, anomaly detection using the reconstruction-based approach may not be effective because of the occurrence of numerous false-positive values. In the following studies, reducing these false positives becomes an important challenge. Full article
Show Figures

Figure 1

15 pages, 2613 KiB  
Article
Facial Expression Recognition Robust to Occlusion and to Intra-Similarity Problem Using Relevant Subsampling
by Jieun Kim and Deokwoo Lee
Sensors 2023, 23(5), 2619; https://0-doi-org.brum.beds.ac.uk/10.3390/s23052619 - 27 Feb 2023
Cited by 1 | Viewed by 1722
Abstract
This paper proposes facial expression recognition (FER) with the wild data set. In particular, this paper chiefly deals with two issues, occlusion and intra-similarity problems. The attention mechanism enables one to use the most relevant areas of facial images for specific expressions, and [...] Read more.
This paper proposes facial expression recognition (FER) with the wild data set. In particular, this paper chiefly deals with two issues, occlusion and intra-similarity problems. The attention mechanism enables one to use the most relevant areas of facial images for specific expressions, and the triplet loss function solves the intra-similarity problem that sometimes fails to aggregate the same expression from different faces and vice versa. The proposed approach for the FER is robust to occlusion, and it uses a spatial transformer network (STN) with an attention mechanism to utilize specific facial region that dominantly contributes (or that is the most relevant) to particular facial expressions, e.g., anger, contempt, disgust, fear, joy, sadness, and surprise. In addition, the STN model is connected to the triplet loss function to improve the recognition rate which outperforms the existing approaches that employ cross-entropy or other approaches using only deep neural networks or classical methods. The triplet loss module alleviates limitations of the intra-similarity problem, leading to further improvement of the classification. Experimental results are provided to substantiate the proposed approach for FER, and the result outperforms the recognition rate in more practical cases, e.g., occlusion. The quantitative result provides FER results with more than 2.09% higher accuracy compared to the existing FER results in CK+ data sets and 0.48% higher than the accuracy of the results with the modified ResNet model in the FER2013 data set. Full article
Show Figures

Figure 1

19 pages, 2438 KiB  
Article
A Feature-Trajectory-Smoothed High-Speed Model for Video Anomaly Detection
by Li Sun, Zhiguo Wang, Yujin Zhang and Guijin Wang
Sensors 2023, 23(3), 1612; https://0-doi-org.brum.beds.ac.uk/10.3390/s23031612 - 02 Feb 2023
Cited by 2 | Viewed by 1652
Abstract
High-speed detection of abnormal frames in surveillance videos is essential for security. This paper proposes a new video anomaly–detection model, namely, feature trajectory–smoothed long short-term memory (FTS-LSTM). This model trains an LSTM autoencoder network to generate future frames on normal video streams, and [...] Read more.
High-speed detection of abnormal frames in surveillance videos is essential for security. This paper proposes a new video anomaly–detection model, namely, feature trajectory–smoothed long short-term memory (FTS-LSTM). This model trains an LSTM autoencoder network to generate future frames on normal video streams, and uses the FTS detector and generation error (GE) detector to detect anomalies on testing video streams. FTS loss is a new indicator in the anomaly–detection area. In the training stage, the model applies a feature trajectory smoothness (FTS) loss to constrain the LSTM layer. This loss enables the LSTM layer to learn the temporal regularity of video streams more precisely. In the detection stage, the model utilizes the FTS loss and the GE loss as two detectors to detect anomalies. By cascading the FTS detector and the GE detector to detect anomalies, the model achieves a high speed and competitive anomaly-detection performance on multiple datasets. Full article
Show Figures

Figure 1

11 pages, 22324 KiB  
Article
VERD: Emergence of Product-Based Video E-Commerce Retrieval Dataset from User’s Perspective
by Gwangjin Lee, Won Jo and Yukyung Choi
Sensors 2023, 23(1), 513; https://0-doi-org.brum.beds.ac.uk/10.3390/s23010513 - 03 Jan 2023
Cited by 1 | Viewed by 2395
Abstract
Customer demands for product search are growing as a result of the recent growth of the e-commerce market. According to this trend, studies on object-centric retrieval using product images have emerged, but it is difficult to respond to complex user-environment scenarios and a [...] Read more.
Customer demands for product search are growing as a result of the recent growth of the e-commerce market. According to this trend, studies on object-centric retrieval using product images have emerged, but it is difficult to respond to complex user-environment scenarios and a search requires a vast amount of data. In this paper, we propose the Video E-commerce Retrieval Dataset (VERD), which utilizes user-perspective videos. In addition, a benchmark and additional experiments are presented to demonstrate the need for independent research on product-centered video-based retrieval. VERD is publicly accessible for academic research and can be downloaded by contacting the author by email. Full article
Show Figures

Figure 1

16 pages, 5310 KiB  
Article
PRAGAN: Progressive Recurrent Attention GAN with Pretrained ViT Discriminator for Single-Image Deraining
by Bingcai Wei, Di Wang, Zhuang Wang and Liye Zhang
Sensors 2022, 22(24), 9587; https://0-doi-org.brum.beds.ac.uk/10.3390/s22249587 - 07 Dec 2022
Cited by 2 | Viewed by 1595
Abstract
Images captured in bad weather are not conducive to visual tasks. Rain streaks in rainy images will significantly affect the regular operation of imaging equipment; to solve this problem, using multiple neural networks is a trend. The ingenious integration of network structures allows [...] Read more.
Images captured in bad weather are not conducive to visual tasks. Rain streaks in rainy images will significantly affect the regular operation of imaging equipment; to solve this problem, using multiple neural networks is a trend. The ingenious integration of network structures allows for full use of the powerful representation and fitting abilities of deep learning to complete low-level visual tasks. In this study, we propose a generative adversarial network (GAN) with multiple attention mechanisms for image rain removal tasks. Firstly, to the best of our knowledge, we propose a pretrained vision transformer (ViT) as the discriminator in GAN for single-image rain removal for the first time. Secondly, we propose a neural network training method that can use a small amount of data for training while maintaining promising results and reliable visual quality. A large number of experiments prove the correctness and effectiveness of our method. Our proposed method achieves better results on synthetic and real image datasets than multiple state-of-the-art methods, even when using less training data. Full article
Show Figures

Figure 1

19 pages, 4200 KiB  
Article
A Novel Dynamic Bit Rate Analysis Technique for Adaptive Video Streaming over HTTP Support
by Ponnai Manogaran Ashok Kumar, Lakshmi Narayanan Arun Raj, B. Jyothi, Naglaa F. Soliman, Mohit Bajaj and Walid El-Shafai
Sensors 2022, 22(23), 9307; https://0-doi-org.brum.beds.ac.uk/10.3390/s22239307 - 29 Nov 2022
Cited by 1 | Viewed by 1667
Abstract
Recently, there has been an increase in research interest in the seamless streaming of video on top of Hypertext Transfer Protocol (HTTP) in cellular networks (3G/4G). The main challenges involved are the variation in available bit rates on the Internet caused by resource [...] Read more.
Recently, there has been an increase in research interest in the seamless streaming of video on top of Hypertext Transfer Protocol (HTTP) in cellular networks (3G/4G). The main challenges involved are the variation in available bit rates on the Internet caused by resource sharing and the dynamic nature of wireless communication channels. State-of-the-art techniques, such as Dynamic Adaptive Streaming over HTTP (DASH), support the streaming of stored video, but they suffer from the challenge of live video content due to fluctuating bit rate in the network. In this work, a novel dynamic bit rate analysis technique is proposed to model client–server architecture using attention-based long short-term memory (A-LSTM) networks for solving the problem of smooth video streaming over HTTP networks. The proposed client system analyzes the bit rate dynamically, and a status report is sent to the server to adjust the ongoing session parameter. The server assesses the dynamics of the bit rate on the fly and calculates the status for each video sequence. The bit rate and buffer length are given as sequential inputs to LSTM to produce feature vectors. These feature vectors are given different weights to produce updated feature vectors. These updated feature vectors are given to multi-layer feed forward neural networks to predict six output class labels (144p, 240p, 360p, 480p, 720p, and 1080p). Finally, the proposed A-LSTM work is evaluated in real-time using a code division multiple access evolution-data optimized network (CDMA20001xEVDO Rev-A) with the help of an Internet dongle. Furthermore, the performance is analyzed with the full reference quality metric of streaming video to validate our proposed work. Experimental results also show an average improvement of 37.53% in peak signal-to-noise ratio (PSNR) and 5.7% in structural similarity (SSIM) index over the commonly used buffer-filling technique during the live streaming of video. Full article
Show Figures

Figure 1

14 pages, 1359 KiB  
Article
A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
by Xiaoyu Teng, Xiaolin Gui, Pan Xu, Jianglei Tong, Jian An, Yang Liu and Huilan Jiang
Sensors 2022, 22(21), 8275; https://0-doi-org.brum.beds.ac.uk/10.3390/s22218275 - 28 Oct 2022
Cited by 1 | Viewed by 1240
Abstract
Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the [...] Read more.
Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial–temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial–temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods. Full article
Show Figures

Figure 1

21 pages, 3839 KiB  
Article
Cross-Sensor Fingerprint Enhancement Using Adversarial Learning and Edge Loss
by Ashwaq Alotaibi, Muhammad Hussain, Hatim AboAlSamh, Wadood Abdul and George Bebis
Sensors 2022, 22(18), 6973; https://0-doi-org.brum.beds.ac.uk/10.3390/s22186973 - 15 Sep 2022
Viewed by 1601
Abstract
A fingerprint sensor interoperability problem, or a cross-sensor matching problem, occurs when one type of sensor is used for enrolment and a different type for matching. Fingerprints captured for the same person using various sensor technologies have various types of noises and artifacts. [...] Read more.
A fingerprint sensor interoperability problem, or a cross-sensor matching problem, occurs when one type of sensor is used for enrolment and a different type for matching. Fingerprints captured for the same person using various sensor technologies have various types of noises and artifacts. This problem motivated us to develop an algorithm that can enhance fingerprints captured using different types of sensors and touch technologies. Inspired by the success of deep learning in various computer vision tasks, we formulate this problem as an image-to-image transformation designed using a deep encoder–decoder model. It is trained using two learning frameworks, i.e., conventional learning and adversarial learning based on a conditional Generative Adversarial Network (cGAN) framework. Since different types of edges form the ridge patterns in fingerprints, we employed edge loss to train the model for effective fingerprint enhancement. The designed method was evaluated on fingerprints from two benchmark cross-sensor fingerprint datasets, i.e., MOLF and FingerPass. To assess the quality of enhanced fingerprints, we employed two standard metrics commonly used: NBIS Fingerprint Image Quality (NFIQ) and Structural Similarity Index Metric (SSIM). In addition, we proposed a metric named Fingerprint Quality Enhancement Index (FQEI) for comprehensive evaluation of fingerprint enhancement algorithms. Effective fingerprint quality enhancement results were achieved regardless of the sensor type used, where this issue was not investigated in the related literature before. The results indicate that the proposed method outperforms the state-of-the-art methods. Full article
Show Figures

Figure 1

20 pages, 1521 KiB  
Article
Learning Gait Representations with Noisy Multi-Task Learning
by Adrian Cosma and Emilian Radoi
Sensors 2022, 22(18), 6803; https://0-doi-org.brum.beds.ac.uk/10.3390/s22186803 - 08 Sep 2022
Cited by 3 | Viewed by 1563
Abstract
Gait analysis is proven to be a reliable way to perform person identification without relying on subject cooperation. Walking is a biometric that does not significantly change in short periods of time and can be regarded as unique to each person. So far, [...] Read more.
Gait analysis is proven to be a reliable way to perform person identification without relying on subject cooperation. Walking is a biometric that does not significantly change in short periods of time and can be regarded as unique to each person. So far, the study of gait analysis focused mostly on identification and demographics estimation, without considering many of the pedestrian attributes that appearance-based methods rely on. In this work, alongside gait-based person identification, we explore pedestrian attribute identification solely from movement patterns. We propose DenseGait, the largest dataset for pretraining gait analysis systems containing 217 K anonymized tracklets, annotated automatically with 42 appearance attributes. DenseGait is constructed by automatically processing video streams and offers the full array of gait covariates present in the real world. We make the dataset available to the research community. Additionally, we propose GaitFormer, a transformer-based model that after pretraining in a multi-task fashion on DenseGait, achieves 92.5% accuracy on CASIA-B and 85.33% on FVG, without utilizing any manually annotated data. This corresponds to a +14.2% and +9.67% accuracy increase compared to similar methods. Moreover, GaitFormer is able to accurately identify gender information and a multitude of appearance attributes utilizing only movement patterns. The code to reproduce the experiments is made publicly. Full article
Show Figures

Figure 1

16 pages, 1007 KiB  
Article
Occluded Pedestrian-Attribute Recognition for Video Sensors Using Group Sparsity
by Geonu Lee, Kimin Yun and Jungchan Cho
Sensors 2022, 22(17), 6626; https://0-doi-org.brum.beds.ac.uk/10.3390/s22176626 - 01 Sep 2022
Cited by 1 | Viewed by 1368
Abstract
Pedestrians are often obstructed by other objects or people in real-world vision sensors. These obstacles make pedestrian-attribute recognition (PAR) difficult; hence, occlusion processing for visual sensing is a key issue in PAR. To address this problem, we first formulate the identification of non-occluded [...] Read more.
Pedestrians are often obstructed by other objects or people in real-world vision sensors. These obstacles make pedestrian-attribute recognition (PAR) difficult; hence, occlusion processing for visual sensing is a key issue in PAR. To address this problem, we first formulate the identification of non-occluded frames as temporal attention based on the sparsity of a crowded video. In other words, a model for PAR is guided to prevent paying attention to the occluded frame. However, we deduced that this approach cannot include a correlation between attributes when occlusion occurs. For example, “boots” and “shoe color” cannot be recognized simultaneously when the foot is invisible. To address the uncorrelated attention issue, we propose a novel temporal-attention module based on group sparsity. Group sparsity is applied across attention weights in correlated attributes. Accordingly, physically-adjacent pedestrian attributes are grouped, and the attention weights of a group are forced to focus on the same frames. Experimental results indicate that the proposed method achieved 1.18% and 6.21% higher F1-scores than the advanced baseline method on the occlusion samples in DukeMTMC-VideoReID and MARS video-based PAR datasets, respectively. Full article
Show Figures

Figure 1

17 pages, 9280 KiB  
Article
Content Swapping: A New Image Synthesis for Construction Sign Detection in Autonomous Vehicles
by Hongje Seong, Seunghyun Baik, Youngjo Lee, Suhyeon Lee and Euntai Kim
Sensors 2022, 22(9), 3494; https://0-doi-org.brum.beds.ac.uk/10.3390/s22093494 - 04 May 2022
Cited by 3 | Viewed by 1963
Abstract
Construction signs alert drivers to the dangers of abnormally blocked roads. In the case of autonomous vehicles, construction signs should be detected automatically to prevent accidents. One might think that we can accomplish the goal easily using the popular deep-learning-based detectors, but it [...] Read more.
Construction signs alert drivers to the dangers of abnormally blocked roads. In the case of autonomous vehicles, construction signs should be detected automatically to prevent accidents. One might think that we can accomplish the goal easily using the popular deep-learning-based detectors, but it is not the case. To train the deep learning detectors to detect construction signs, we need a large amount of training images which contain construction signs. However, collecting training images including construction signs is very difficult in the real world because construction events do not occur frequently. To make matters worse, the construction signs might have dozens of different construction signs (i.e., contents). To address this problem, we propose a new method named content swapping. Our content swapping divides a construction sign into two parts: the board and the frame. Content swapping generates numerous synthetic construction signs by combining the board images (i.e., contents) taken from the in-domain images and the frames (i.e., geometric shapes) taken from the out-domain images. The generated synthetic construction signs are then added to the background road images via the cut-and-paste mechanism, increasing the number of training images. Furthermore, three fine-tuning methods regarding the region, size, and color of the construction signs are developed to make the generated training images look more realistic. To validate our approach, we applied our method to real-world images captured in South Korea. Finally, we achieve an average precision (AP50) score of 84.98%, which surpasses that of the off-the-shelf method by 9.15%. Full experimental results are available online as a supplemental video. The images used in the experiments are also released as a new dataset CSS138 for the benefit of the autonomous driving community. Full article
Show Figures

Figure 1

25 pages, 7124 KiB  
Article
Towards Enhancing Traffic Sign Recognition through Sliding Windows
by Muhammad Atif, Tommaso Zoppi, Mohamad Gharib and Andrea Bondavalli
Sensors 2022, 22(7), 2683; https://0-doi-org.brum.beds.ac.uk/10.3390/s22072683 - 31 Mar 2022
Cited by 10 | Viewed by 2829
Abstract
Automatic Traffic Sign Detection and Recognition (TSDR) provides drivers with critical information on traffic signs, and it constitutes an enabling condition for autonomous driving. Misclassifying even a single sign may constitute a severe hazard, which negatively impacts the environment, infrastructures, and human lives. [...] Read more.
Automatic Traffic Sign Detection and Recognition (TSDR) provides drivers with critical information on traffic signs, and it constitutes an enabling condition for autonomous driving. Misclassifying even a single sign may constitute a severe hazard, which negatively impacts the environment, infrastructures, and human lives. Therefore, a reliable TSDR mechanism is essential to attain a safe circulation of road vehicles. Traffic Sign Recognition (TSR) techniques that use Machine Learning (ML) algorithms have been proposed, but no agreement on a preferred ML algorithm nor perfect classification capabilities were always achieved by any existing solutions. Consequently, our study employs ML-based classifiers to build a TSR system that analyzes a sliding window of frames sampled by sensors on a vehicle. Such TSR processes the most recent frame and past frames sampled by sensors through (i) Long Short-Term Memory (LSTM) networks and (ii) Stacking Meta-Learners, which allow for efficiently combining base-learning classification episodes into a unified and improved meta-level classification. Experimental results by using publicly available datasets show that Stacking Meta-Learners dramatically reduce misclassifications of signs and achieved perfect classification on all three considered datasets. This shows the potential of our novel approach based on sliding windows to be used as an efficient solution for TSR. Full article
Show Figures

Figure 1

20 pages, 8074 KiB  
Article
Enhancing Detection Quality Rate with a Combined HOG and CNN for Real-Time Multiple Object Tracking across Non-Overlapping Multiple Cameras
by Lesole Kalake, Yanqiu Dong, Wanggen Wan and Li Hou
Sensors 2022, 22(6), 2123; https://0-doi-org.brum.beds.ac.uk/10.3390/s22062123 - 09 Mar 2022
Cited by 9 | Viewed by 2594
Abstract
Multi-object tracking in video surveillance is subjected to illumination variation, blurring, motion, and similarity variations during the identification process in real-world practice. The previously proposed applications have difficulties in learning the appearances and differentiating the objects from sundry detections. They mostly rely heavily [...] Read more.
Multi-object tracking in video surveillance is subjected to illumination variation, blurring, motion, and similarity variations during the identification process in real-world practice. The previously proposed applications have difficulties in learning the appearances and differentiating the objects from sundry detections. They mostly rely heavily on local features and tend to lose vital global structured features such as contour features. This contributes to their inability to accurately detect, classify or distinguish the fooling images. In this paper, we propose a paradigm aimed at eliminating these tracking difficulties by enhancing the detection quality rate through the combination of a convolutional neural network (CNN) and a histogram of oriented gradient (HOG) descriptor. We trained the algorithm with an input of 120 × 32 images size and cleaned and converted them into binary for reducing the numbers of false positives. In testing, we eliminated the background on frames size and applied morphological operations and Laplacian of Gaussian model (LOG) mixture after blobs. The images further underwent feature extraction and computation with the HOG descriptor to simplify the structural information of the objects in the captured video images. We stored the appearance features in an array and passed them into the network (CNN) for further processing. We have applied and evaluated our algorithm for real-time multiple object tracking on various city streets using EPFL multi-camera pedestrian datasets. The experimental results illustrate that our proposed technique improves the detection rate and data associations. Our algorithm outperformed the online state-of-the-art approach by recording the highest in precisions and specificity rates. Full article
Show Figures

Figure 1

18 pages, 1972 KiB  
Article
An Efficient Approach Using Knowledge Distillation Methods to Stabilize Performance in a Lightweight Top-Down Posture Estimation Network
by Changhyun Park, Hean Sung Lee, Woo Jin Kim, Han Byeol Bae, Jaeho Lee and Sangyoun Lee
Sensors 2021, 21(22), 7640; https://0-doi-org.brum.beds.ac.uk/10.3390/s21227640 - 17 Nov 2021
Cited by 3 | Viewed by 2415
Abstract
Multi-person pose estimation has been gaining considerable interest due to its use in several real-world applications, such as activity recognition, motion capture, and augmented reality. Although the improvement of the accuracy and speed of multi-person pose estimation techniques has been recently studied, limitations [...] Read more.
Multi-person pose estimation has been gaining considerable interest due to its use in several real-world applications, such as activity recognition, motion capture, and augmented reality. Although the improvement of the accuracy and speed of multi-person pose estimation techniques has been recently studied, limitations still exist in balancing these two aspects. In this paper, a novel knowledge distilled lightweight top-down pose network (KDLPN) is proposed that balances computational complexity and accuracy. For the first time in multi-person pose estimation, a network that reduces computational complexity by applying a “Pelee” structure and shuffles pixels in the dense upsampling convolution layer to reduce the number of channels is presented. Furthermore, to prevent performance degradation because of the reduced computational complexity, knowledge distillation is applied to establish the pose estimation network as a teacher network. The method performance is evaluated on the MSCOCO dataset. Experimental results demonstrate that our KDLPN network significantly reduces 95% of the parameters required by state-of-the-art methods with minimal performance degradation. Moreover, our method is compared with other pose estimation methods to substantiate the importance of computational complexity reduction and its effectiveness. Full article
Show Figures

Figure 1

24 pages, 1925 KiB  
Article
Deep-Learning-Based Stress Recognition with Spatial-Temporal Facial Information
by Taejae Jeon, Han Byeol Bae, Yongju Lee, Sungjun Jang and Sangyoun Lee
Sensors 2021, 21(22), 7498; https://0-doi-org.brum.beds.ac.uk/10.3390/s21227498 - 11 Nov 2021
Cited by 5 | Viewed by 3022
Abstract
In recent times, as interest in stress control has increased, many studies on stress recognition have been conducted. Several studies have been based on physiological signals, but the disadvantage of this strategy is that it requires physiological-signal-acquisition devices. Another strategy employs facial-image-based stress-recognition [...] Read more.
In recent times, as interest in stress control has increased, many studies on stress recognition have been conducted. Several studies have been based on physiological signals, but the disadvantage of this strategy is that it requires physiological-signal-acquisition devices. Another strategy employs facial-image-based stress-recognition methods, which do not require devices, but predominantly use handcrafted features. However, such features have low discriminating power. We propose a deep-learning-based stress-recognition method using facial images to address these challenges. Given that deep-learning methods require extensive data, we constructed a large-capacity image database for stress recognition. Furthermore, we used temporal attention, which assigns a high weight to frames that are highly related to stress, as well as spatial attention, which assigns a high weight to regions that are highly related to stress. By adding a network that inputs the facial landmark information closely related to stress, we supplemented the network that receives only facial images as the input. Experimental results on our newly constructed database indicated that the proposed method outperforms contemporary deep-learning-based recognition methods. Full article
Show Figures

Figure 1

33 pages, 11871 KiB  
Article
Restoration of Motion Blurred Image by Modified DeblurGAN for Enhancing the Accuracies of Finger-Vein Recognition
by Jiho Choi, Jin Seong Hong, Muhammad Owais, Seung Gu Kim and Kang Ryoung Park
Sensors 2021, 21(14), 4635; https://0-doi-org.brum.beds.ac.uk/10.3390/s21144635 - 06 Jul 2021
Cited by 13 | Viewed by 3018
Abstract
Among many available biometrics identification methods, finger-vein recognition has an advantage that is difficult to counterfeit, as finger veins are located under the skin, and high user convenience as a non-invasive image capturing device is used for recognition. However, blurring can occur when [...] Read more.
Among many available biometrics identification methods, finger-vein recognition has an advantage that is difficult to counterfeit, as finger veins are located under the skin, and high user convenience as a non-invasive image capturing device is used for recognition. However, blurring can occur when acquiring finger-vein images, and such blur can be mainly categorized into three types. First, skin scattering blur due to light scattering in the skin layer; second, optical blur occurs due to lens focus mismatching; and third, motion blur exists due to finger movements. Blurred images generated in these kinds of blur can significantly reduce finger-vein recognition performance. Therefore, restoration of blurred finger-vein images is necessary. Most of the previous studies have addressed the restoration method of skin scattering blurred images and some of the studies have addressed the restoration method of optically blurred images. However, there has been no research on restoration methods of motion blurred finger-vein images that can occur in actual environments. To address this problem, this study proposes a new method for improving the finger-vein recognition performance by restoring motion blurred finger-vein images using a modified deblur generative adversarial network (modified DeblurGAN). Based on an experiment conducted using two open databases, the Shandong University homologous multi-modal traits (SDUMLA-HMT) finger-vein database and Hong Kong Polytechnic University finger-image database version 1, the proposed method demonstrates outstanding performance that is better than those obtained using state-of-the-art methods. Full article
Show Figures

Figure 1

Back to TopTop