Submit to Applied Sciences Review for Applied Sciences Propose a Special Issue

Journal Browser

Deep Learning-Based Action Recognition

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (15 November 2021) | Viewed by 44300

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special issue Deep Learning-Based Action Recognition book cover image

Share This Special Issue

Special Issue Editor

Prof. Dr. Hyo Jong Lee

E-Mail Website
Guest Editor

Dept of Computer Science & Engineering, Jeonbuk National University, Jeonju 54896, Korea
Interests: image processing; pattern recognition; artificial intelligence
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Human action recognition (HAR) has gained popularity because it can be used for numerous applications, such as health-care services, video surveillance, and human–computer interaction. The key to good human action recognition is robust human action modeling and feature representation. Conventional shallow learning algorithms such as support vector machine and random forest require the manual extraction of some representative features from large sensory data. However, manual feature extraction requires prior knowledge and will inevitably lose implicit features.

Recently, deep learning has achieved great success in many challenging research areas, such as image classification and object detection. The greatest advantage of deep learning is its ability to automatically learn representative features from large-scale data. Now, more and more researchers have also been applying deep learning to human action recognition. However, many challenging research problems in terms of accuracy, device heterogeneity, scene changes, and others remain unsolved.

This Issue intends to prompt state-of-the-art methods on deep learning for human action recognition. We invite researchers to submit research papers in this Issue on Deep Learning-Based Action Recognition.

Dr. Hyo Jong Lee
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

• image/Video based HAR using deep learning
• multimodal fusion for HAR using deep learning
• fusion of shallow models with deep networks for HAR
• device heterogeneity for device-based HAR
• scene changes for device-free HAR
• transfer learning for HAR
• online learning for HAR
• semi-supervised learning for HAR
• survey for deep learning based HAR

Published Papers (13 papers)

Download All Papers

Editorial

Jump to: Research

3 pages, 159 KiB

Open AccessEditorial

Special Issue on Deep Learning-Based Action Recognition

by Hyo Jong Lee

Appl. Sci. 2022, 12(15), 7834; https://0-doi-org.brum.beds.ac.uk/10.3390/app12157834 - 04 Aug 2022

Viewed by 1172

Abstract

Human action recognition (HAR) has gained popularity because of its various applications, such as human–object interaction [...] Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

Research

Jump to: Editorial

15 pages, 937 KiB

Open AccessArticle

Action Recognition Algorithm of Spatio–Temporal Differential LSTM Based on Feature Enhancement

by Kai Hu, Fei Zheng, Liguo Weng, Yiwu Ding and Junlan Jin

Appl. Sci. 2021, 11(17), 7876; https://0-doi-org.brum.beds.ac.uk/10.3390/app11177876 - 26 Aug 2021

Cited by 15 | Viewed by 1902

Abstract

The Long Short-Term Memory (LSTM) network is a classic action recognition method because of its ability to extract time information. Researchers proposed many hybrid algorithms based on LSTM for human action recognition. In this paper, an improved Spatio–Temporal Differential Long Short-Term Memory (ST-D LSTM) network is proposed, an enhanced input differential feature module and a spatial memory state differential module are added to the network. Furthermore, a transmission mode of ST-D LSTM is proposed; this mode enables ST-D LSTM units to transmit the spatial memory state horizontally. Finally, these improvements are added into classical Long-term Recurrent Convolutional Networks (LRCN) to test the new network’s performance. Experimental results show that ST-D LSTM can effectively improve the accuracy of LRCN. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

15 pages, 19560 KiB

Open AccessArticle

Low-Cost Embedded System Using Convolutional Neural Networks-Based Spatiotemporal Feature Map for Real-Time Human Action Recognition

by Jinsoo Kim and Jeongho Cho

Appl. Sci. 2021, 11(11), 4940; https://0-doi-org.brum.beds.ac.uk/10.3390/app11114940 - 27 May 2021

Cited by 6 | Viewed by 2381

Abstract

The field of research related to video data has difficulty in extracting not only spatial but also temporal features and human action recognition (HAR) is a representative field of research that applies convolutional neural network (CNN) to video data. The performance for action recognition has improved, but owing to the complexity of the model, some still limitations to operation in real-time persist. Therefore, a lightweight CNN-based single-stream HAR model that can operate in real-time is proposed. The proposed model extracts spatial feature maps by applying CNN to the images that develop the video and uses the frame change rate of sequential images as time information. Spatial feature maps are weighted-averaged by frame change, transformed into spatiotemporal features, and input into multilayer perceptrons, which have a relatively lower complexity than other HAR models; thus, our method has high utility in a single embedded system connected to CCTV. The results of evaluating action recognition accuracy and data processing speed through challenging action recognition benchmark UCF-101 showed higher action recognition accuracy than the HAR model using long short-term memory with a small amount of video frames and confirmed the real-time operational possibility through fast data processing speed. In addition, the performance of the proposed weighted mean-based HAR model was verified by testing it in Jetson NANO to confirm the possibility of using it in low-cost GPU-based embedded systems. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

13 pages, 1760 KiB

Open AccessArticle

3D Skeletal Joints-Based Hand Gesture Spotting and Classification

by Ngoc-Hoang Nguyen, Tran-Dac-Thinh Phan, Soo-Hyung Kim, Hyung-Jeong Yang and Guee-Sang Lee

Appl. Sci. 2021, 11(10), 4689; https://0-doi-org.brum.beds.ac.uk/10.3390/app11104689 - 20 May 2021

Cited by 5 | Viewed by 2711

Abstract

This paper presents a novel approach to continuous dynamic hand gesture recognition. Our approach contains two main modules: gesture spotting and gesture classification. Firstly, the gesture spotting module pre-segments the video sequence with continuous gestures into isolated gestures. Secondly, the gesture classification module identifies the segmented gestures. In the gesture spotting module, the motion of the hand palm and fingers are fed into the Bidirectional Long Short-Term Memory (Bi-LSTM) network for gesture spotting. In the gesture classification module, three residual 3D Convolution Neural Networks based on ResNet architectures (3D_ResNet) and one Long Short-Term Memory (LSTM) network are combined to efficiently utilize the multiple data channels such as RGB, Optical Flow, Depth, and 3D positions of key joints. The promising performance of our approach is obtained through experiments conducted on three public datasets—Chalearn LAP ConGD dataset, 20BN-Jester, and NVIDIA Dynamic Hand gesture Dataset. Our approach outperforms the state-of-the-art methods on the Chalearn LAP ConGD dataset. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

14 pages, 16344 KiB

Open AccessArticle

A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network

by Jiahua Wu and Hyo-Jong Lee

Appl. Sci. 2021, 11(9), 4241; https://0-doi-org.brum.beds.ac.uk/10.3390/app11094241 - 07 May 2021

Cited by 3 | Viewed by 2783

Abstract

In bottom-up multi-person pose estimation, grouping joint candidates into the appropriately structured corresponding instance of a person is challenging. In this paper, a new bottom-up method, the Partitioned CenterPose (PCP) Network, is proposed to better cluster the detected joints. To achieve this goal, we propose a novel approach called Partition Pose Representation (PPR) which integrates the instance of a person and its body joints based on joint offset. PPR leverages information about the center of the human body and the offsets between that center point and the positions of the body’s joints to encode human poses accurately. To enhance the relationships between body joints, we divide the human body into five parts, and then, we generate a sub-PPR for each part. Based on this PPR, the PCP Network can detect people and their body joints simultaneously, then group all body joints according to joint offset. Moreover, an improved

l_{1}

loss is designed to more accurately measure joint offset. Using the COCO keypoints and CrowdPose datasets for testing, it was found that the performance of the proposed method is on par with that of existing state-of-the-art bottom-up methods in terms of accuracy and speed. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

31 pages, 40294 KiB

Open AccessArticle

Translating Videos into Synthetic Training Data for Wearable Sensor-Based Activity Recognition Systems Using Residual Deep Convolutional Networks

by Vitor Fortes Rey, Kamalveer Kaur Garewal and Paul Lukowicz

Appl. Sci. 2021, 11(7), 3094; https://0-doi-org.brum.beds.ac.uk/10.3390/app11073094 - 31 Mar 2021

Cited by 9 | Viewed by 2905

Abstract

Human activity recognition (HAR) using wearable sensors has benefited much less from recent advances in Deep Learning than fields such as computer vision and natural language processing. This is, to a large extent, due to the lack of large scale (as compared to computer vision) repositories of labeled training data for sensor-based HAR tasks. Thus, for example, ImageNet has images for around 100,000 categories (based on WordNet) with on average 1000 images per category (therefore up to 100,000,000 samples). The Kinetics-700 video activity data set has 650,000 video clips covering 700 different human activities (in total over 1800 h). By contrast, the total length of all sensor-based HAR data sets in the popular UCI machine learning repository is less than 63 h, with around 38 of those consisting of simple mode of locomotion activities like walking, standing or cycling. In our research we aim to facilitate the use of online videos, which exist in ample quantities for most activities and are much easier to label than sensor data, to simulate labeled wearable motion sensor data. In previous work we already demonstrated some preliminary results in this direction, focusing on very simple, activity specific simulation models and a single sensor modality (acceleration norm). In this paper, we show how we can train a regression model on generic motions for both accelerometer and gyro signals and then apply it to videos of the target activities to generate synthetic Inertial Measurement Units (IMU) data (acceleration and gyro norms) that can be used to train and/or improve HAR models. We demonstrate that systems trained on simulated data generated by our regression model can come to within around 10% of the mean F1 score of a system trained on real sensor data. Furthermore, we show that by either including a small amount of real sensor data for model calibration or simply leveraging the fact that (in general) we can easily generate much more simulated data from video than we can collect its real version, the advantage of the latter can eventually be equalized. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

24 pages, 5738 KiB

Open AccessArticle

Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints

by Nusrat Tasnim, Mohammad Khairul Islam and Joong-Hwan Baek

Appl. Sci. 2021, 11(6), 2675; https://0-doi-org.brum.beds.ac.uk/10.3390/app11062675 - 17 Mar 2021

Cited by 34 | Viewed by 3967

Abstract

Human activity recognition has become a significant research trend in the fields of computer vision, image processing, and human–machine or human–object interaction due to cost-effectiveness, time management, rehabilitation, and the pandemic of diseases. Over the past years, several methods published for human action recognition using RGB (red, green, and blue), depth, and skeleton datasets. Most of the methods introduced for action classification using skeleton datasets are constrained in some perspectives including features representation, complexity, and performance. However, there is still a challenging problem of providing an effective and efficient method for human action discrimination using a 3D skeleton dataset. There is a lot of room to map the 3D skeleton joint coordinates into spatio-temporal formats to reduce the complexity of the system, to provide a more accurate system to recognize human behaviors, and to improve the overall performance. In this paper, we suggest a spatio-temporal image formation (STIF) technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination. We conduct transfer learning (pretrained models- MobileNetV2, DenseNet121, and ResNet18 trained with ImageNet dataset) to extract discriminative features and evaluate the proposed method with several fusion techniques. We mainly investigate the effect of three fusion methods such as element-wise average, multiplication, and maximization on the performance variation to human action recognition. Our deep learning-based method outperforms prior works using UTD-MHAD (University of Texas at Dallas multi-modal human action dataset) and MSR-Action3D (Microsoft action 3D), publicly available benchmark 3D skeleton datasets with STIF representation. We attain accuracies of approximately 98.93%, 99.65%, and 98.80% for UTD-MHAD and 96.00%, 98.75%, and 97.08% for MSR-Action3D skeleton datasets using MobileNetV2, DenseNet121, and ResNet18, respectively. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

23 pages, 7340 KiB

Open AccessArticle

Recognition of Hand Gesture Sequences by Accelerometers and Gyroscopes

by Yen-Cheng Chu, Yun-Jie Jhang, Tsung-Ming Tai and Wen-Jyi Hwang

Appl. Sci. 2020, 10(18), 6507; https://0-doi-org.brum.beds.ac.uk/10.3390/app10186507 - 18 Sep 2020

Cited by 16 | Viewed by 4100

Abstract

The objective of this study is to present novel neural network (NN) algorithms and systems for sensor-based hand gesture recognition. The algorithms are able to classify accurately a sequence of hand gestures from the sensory data produced by accelerometers and gyroscopes. They are the extensions from the PairNet, which is a Convolutional Neural Network (CNN) capable of carrying out simple pairing operations with low computational complexities. Three different types of feedforward NNs, termed Residual PairNet, PairNet with Inception, and Residual PairNet with Inception are proposed for the extension. They are the PairNet operating in conjunction with short-cut connections and/or inception modules for achieving high classification accuracy and low computation complexity. A prototype system based on smart phones for remote control of home appliances has been implemented for the performance evaluation. Experimental results reveal that the PairNet has superior classification accuracy over its basic CNN and Recurrent NN (RNN) counterparts. Furthermore, the Residual PairNet, PairNet with Inception, and Residual PairNet with Inception are able to further improve classification hit rate and/or reduce recognition time for hand gesture recognition. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

15 pages, 4270 KiB

Open AccessFeature PaperArticle

Lightweight Stacked Hourglass Network for Human Pose Estimation

by Seung-Taek Kim and Hyo Jong Lee

Appl. Sci. 2020, 10(18), 6497; https://0-doi-org.brum.beds.ac.uk/10.3390/app10186497 - 17 Sep 2020

Cited by 20 | Viewed by 7639

Abstract

Human pose estimation is a problem that continues to be one of the greatest challenges in the field of computer vision. While the stacked structure of an hourglass network has enabled substantial progress in human pose estimation and key-point detection areas, it is largely used as a backbone network. However, it also requires a relatively large number of parameters and high computational capacity due to the characteristics of its stacked structure. Accordingly, the present work proposes a more lightweight version of the hourglass network, which also improves the human pose estimation performance. The new hourglass network architecture utilizes several additional skip connections, which improve performance with minimal modifications while still maintaining the number of parameters in the network. Additionally, the size of the convolutional receptive field has a decisive effect in learning to detect features of the full human body. Therefore, we propose a multidilated light residual block, which expands the convolutional receptive field while also reducing the computational load. The proposed residual block is also invariant in scale when using multiple dilations. The well-known MPII and LSP human pose datasets were used to evaluate the performance using the proposed method. A variety of experiments were conducted that confirm that our method is more efficient compared to current state-of-the-art hourglass weight-reduction methods. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

28 pages, 2519 KiB

Open AccessArticle

Robust Hand Shape Features for Dynamic Hand Gesture Recognition Using Multi-Level Feature LSTM

by Nhu-Tai Do, Soo-Hyung Kim, Hyung-Jeong Yang and Guee-Sang Lee

Appl. Sci. 2020, 10(18), 6293; https://0-doi-org.brum.beds.ac.uk/10.3390/app10186293 - 10 Sep 2020

Cited by 15 | Viewed by 4159

Abstract

This study builds robust hand shape features from the two modalities of depth and skeletal data for the dynamic hand gesture recognition problem. For the hand skeleton shape approach, we use the movement, the rotations of the hand joints with respect to their neighbors, and the skeletal point-cloud to learn the 3D geometric transformation. For the hand depth shape approach, we use the feature representation from the hand component segmentation model. Finally, we propose a multi-level feature LSTM with Conv1D, the Conv2D pyramid, and the LSTM block to deal with the diversity of hand features. Therefore, we propose a novel method by exploiting robust skeletal point-cloud features from skeletal data, as well as depth shape features from the hand component segmentation model in order for the multi-level feature LSTM model to benefit from both. Our proposed method achieves the best result on the Dynamic Hand Gesture Recognition (DHG) dataset with 14 and 28 classes for both depth and skeletal data with accuracies of 96.07% and 94.40%, respectively. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

14 pages, 10317 KiB

Open AccessArticle

Learning Class-Specific Features with Class Regularization for Videos

by Alexandros Stergiou, Ronald Poppe and Remco C. Veltkamp

Appl. Sci. 2020, 10(18), 6241; https://0-doi-org.brum.beds.ac.uk/10.3390/app10186241 - 08 Sep 2020

Cited by 1 | Viewed by 2336

Abstract

One of the main principles of Deep Convolutional Neural Networks (CNNs) is the extraction of useful features through a hierarchy of kernels operations. The kernels are not explicitly tailored to address specific target classes but are rather optimized as general feature extractors. Distinction between classes is typically left until the very last fully-connected layers. Consequently, variances between classes that are relatively similar are treated the same way as variations between classes that exhibit great dissimilarities. In order to directly address this problem, we introduce Class Regularization, a novel method that can regularize feature map activations based on the classes of the examples used. Essentially, we amplify or suppress activations based on an educated guess of the given class. We can apply this step to each minibatch of activation maps, at different depths in the network. We demonstrate that this improves feature search during training, leading to systematic improvement gains on the Kinetics, UCF-101, and HMDB-51 datasets. Moreover, Class Regularization establishes an explicit correlation between features and class, which makes it a perfect tool to visualize class-specific features at various network depths. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

13 pages, 2641 KiB

Open AccessArticle

Gesture Recognition Based on 3D Human Pose Estimation and Body Part Segmentation for RGB Data Input

by Ngoc-Hoang Nguyen, Tran-Dac-Thinh Phan, Guee-Sang Lee, Soo-Hyung Kim and Hyung-Jeong Yang

Appl. Sci. 2020, 10(18), 6188; https://0-doi-org.brum.beds.ac.uk/10.3390/app10186188 - 06 Sep 2020

Cited by 13 | Viewed by 3251

Abstract

This paper presents a novel approach for dynamic gesture recognition using multi-features extracted from RGB data input. Most of the challenges in gesture recognition revolve around the axis of the presence of multiple actors in the scene, occlusions, and viewpoint variations. In this paper, we develop a gesture recognition approach by hybrid deep learning where RGB frames, 3D skeleton joint information, and body part segmentation are used to overcome such problems. Extracted from the RGB images are the multimodal input observations, which are combined by multi-modal stream networks suited to different input modalities: residual 3D convolutional neural networks based on ResNet architecture (3DCNN_ResNet) for RGB images and color body part segmentation modalities; long short-term memory network (LSTM) for 3D skeleton joint modality. We evaluated the proposed model on four public datasets: UTD multimodal human action dataset, gaming 3D dataset, NTU RGB+D dataset, and MSRDailyActivity3D dataset and the experimental results on these datasets proves the effectiveness of our approach. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

14 pages, 6618 KiB

Open AccessArticle

Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features

by Jiuqing Dong, Yongbin Gao, Hyo Jong Lee, Heng Zhou, Yifan Yao, Zhijun Fang and Bo Huang

Appl. Sci. 2020, 10(4), 1482; https://0-doi-org.brum.beds.ac.uk/10.3390/app10041482 - 21 Feb 2020

Cited by 21 | Viewed by 3368

Abstract

Skeleton-based action recognition is a widely used task in action related research because of its clear features and the invariance of human appearances and illumination. Furthermore, it can also effectively improve the robustness of the action recognition. Graph convolutional networks have been implemented on those skeletal data to recognize actions. Recent studies have shown that the graph convolutional neural network works well in the action recognition task using spatial and temporal features of skeleton data. The prevalent methods to extract the spatial and temporal features purely rely on a deep network to learn from primitive 3D position. In this paper, we propose a novel action recognition method applying high-order spatial and temporal features from skeleton data, such as velocity features, acceleration features, and relative distance between 3D joints. Meanwhile, a method of multi-stream feature fusion is adopted to fuse these high-order features we proposed. Extensive experiments on Two large and challenging datasets, NTU-RGBD and NTU-RGBD-120, indicate that our model achieves the state-of-the-art performance. Full article

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Deep Learning-Based Action Recognition

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Published Papers (13 papers)

Editorial

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI