Next Article in Journal
Symmetric Mass Generation in Lattice Gauge Theory
Previous Article in Journal
First-Order Random Coefficient Multinomial Autoregressive Model for Finite-Range Time Series of Counts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Relation Selective Graph Convolutional Network for Skeleton-Based Action Recognition

1
Key Laboratory of Optical Engineering, Chinese Academy of Sciences, Chengdu 610209, China
2
Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China
3
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Submission received: 13 October 2021 / Revised: 21 November 2021 / Accepted: 22 November 2021 / Published: 30 November 2021
(This article belongs to the Topic Dynamical Systems: Theory and Applications)

Abstract

:
Graph convolutional networks (GCNs) have made significant progress in the skeletal action recognition task. However, the graphs constructed by these methods are too densely connected, and the same graphs are used repeatedly among channels. Redundant connections will blur the useful interdependencies of joints, and the overly repetitive graphs among channels cannot handle changes in joint relations between different actions. In this work, we propose a novel relation selective graph convolutional network (RS-GCN). We also design a trainable relation selection mechanism. It encourages the model to choose solid edges to work and build a stable and sparse topology of joints. The channel-wise graph convolution and multiscale temporal convolution are proposed to strengthening the model’s representative power. Furthermore, we introduce an asymmetrical module named the spatial-temporal attention module for more stable context modeling. Combining those changes, our model achieves state-of-the-art performance on three public benchmarks, namely NTU-RGB+D, NTU-RGB+D 120, and Northwestern-UCLA.

1. Introduction

Action recognition is an essential computer vision topic. There are broad practical applications like intelligent monitoring, human-computer interaction, video summary, and healthy caring. For example, ref. [1] design a rehabilitation system based on a customizable exergame protocol to prevent falls in the elderly population. In recent years, skeleton-based human action recognition has attracted increasing attention from researchers. The skeleton data is the coordinate sequence of each key point of the joints when the action occurs, which has a small data size and robustness to variations of viewpoints, appearances, and environmental change. Compared with the video, the skeleton sequence is a compact and high-level description of actions.
In the literature, the methods for skeleton-based action recognition can be classified as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and graph convolutional neural networks (GCNs). The CNN-based and RNN-based methods tend to represent the skeleton data with pseudo-image or vector sequences. However, the skeleton sequence is not the regular Euclidean structure data. It inherently has the structure information of the human body, which the vector or pseudo-image cannot fully express. To better model the topology information in skeleton sequences, graph convolutional networks are first introduced by [2]. They encode the skeleton sequence by the spatial-temporal graph. They build the adjacency matrix for the joints on the basis of the physical human body structure, making impressive progress. But the fixed adjacency matrix makes their model unable to capture some critical correlation between joints that are not included in the human physical structure. Thus 2s-AGCN (two-stream adaptive graph convolutional network) [3] proposes the adaptive adjacency matrix, which allows the generation of edges between arbitrary joints. After that, many methods follow the idea and utilize adaptive adjacency matrices to extract topology information.
Due to the excellent use of human body structure information and the continuous enhancement of model’s representative power, GCN-based methods have made significant progress. However, there are still limitations that exist in those methods. Firstly, the dense adjacency matrices may generate many redundant connections. These connections do not help extract discriminant features and may even disturb the capture of detailed local information on the spatial dimension. Secondly, the correlation of the joints in different actions is diverse. Those methods use the same adjacency matrices in every neuron of a layer, which seriously limits the model’s ability to capture topology in skeleton sequences of different actions.
To address such issues, we propose a trainable relation selection mechanism, which can help the model choose the most informative connection of the graph in a trainable way. Therefore, the adjacency matrix can be sparsed, and the nodes can focus on the most important neighbors. Then we propose the channel-wise graph convolution (CWG) and multiscale temporal convolution (MTC) to strengthen the model’s representative power. The CWG assigns adjacency matrices for channels, and diverse relations are built in different channels. The MTC has several branches, and each has its kernel size for assembling information from various temporal receptive fields. Furthermore, we introduce the spatial-temporal attention module (STAM) to enhance the model’s ability to capture context relations. Incorporating these improvements, we built a novel model called relation selective graph convolutional networks (RS-GCN). In the experiments, our proposed approach outperforms state-of-the-art methods on three large-scale skeleton action benchmarks: NTU-RGB+D [4], NTU-RGB+D 120 [5], and Northwesten-UCLA [6]. The main contributions of this work are summarized as follows:
  • We design a relation selection mechanism that helps the model choose the most helpful connections of the graph. It allows the model to generate sparse adjacency matrices and avoid redundant information transfer between nodes;
  • We propose the channel-wise graph convolution and the multiscale temporal convolution. Those two operations significantly enhance the model’s adaptability to different actions and different speeds of motion;
  • We introduce a spatial-temporal attention module that has a symmetrical structure, making our model more sensitive to complex context relations;
  • We integrate the components, forming a novel model named RS-GCN. In experiments, our model outperforms the state-of-the-art method on all three public skeleton-based action recognition datasets, which illustrates its superiority.

2. Related Works

In dealing with the skeletal action recognition problem, researchers have attempt many approaches. Early methods usually rely on the hand-craft features like a histogram of 3D joints [7], histogram of oriented displacements (HOD) [8], and relative 3D rotations between various body parts [9].
With the extensive application of deep learning technology, researchers begin to address this task through convolutional neural networks (CNNs) or recurrent neural networks (RNNs). The CNN-based methods usually encode the skeleton sequence into a pseudo-image and use the CNNs as the classifier [10,11,12,13]. The RNN-based methods view the skeleton data as the sequences of vectors, and various recurrent neurons are applied to model the temporal evolution [14,15,16]. However, as we all know, the connections between graph structure data nodes are complex and changeable. Neither convolutional neural networks nor recurrent neural networks fully express the human body’s structural information because of the grid-structured representation of features.
On the contrary, graph convolutional networks (GCNs) can effectively capture and represent this structural information using the adjacency matrix. The work of [2] first introduced the graph convolution into the task and has gained an exciting improvement, attracting extensive attention from researchers. Many methods solve the skeleton-based action recognition problem using GCNs. Ref. [3] propose adaptive graph convolution, and it can generate the connection between two arbitrary nodes to capture the relations of joints not connected by the natural human skeleton. Ref. [17] propose a multi-stream framework, each stream only focuses joints not activated by the previous streams, for more discriminative features. Ref. [18] design a graph edge convolutional neural network to explore the beneficial information in bones for skeleton-based action recognition. Ref. [19] decompose graph convolution into feature learning components that evolve the features of each graph vertex to learn the latent graph topologies. Ref. [20] propose a part-level graph convolutional network to capture the part-level information of skeletons. Ref. [21] design a split-transform-merge strategy in GCNs for skeleton sequence processing. Ref. [22] refine the pose before recognizing the action of the skeleton to reduce the effect of pose mistakes. Ref. [23] use two scales of graphs to explicitly capture relations among body-joints and body-parts. Ref. [24] guide GCNs to perceive significant variations across local movements by a tri-attention module. Ref. [25] attempt to fix the shortcomings of isolated temporal information in spatial temporal graph convolutional networks by a two-stream network called RNXt-GCN. Ref. [26] propose a graph pooling method, named Tripool, for a lower computational cost and large reception field.
No matter how much progress has been made in these GCN-based methods, there is still some room for improvement. Many of those methods focus on modeling the relation in nodes by generating excessive graph edges. However, some of those are redundant, and do not help extract discriminant features and may even disturb the capture of detailed local information on the spatial dimension. Moreover, those GCN-based methods usually share the same adjacency matrices among all neurons of a layer, which is low-efficient to capture topology in different actions. In this work, we deploy channel-wise adjacency matrices in the GCNs to strengthen the model’s ability to capture more abundant relations. Furthermore, we propose a trainable edge-selective mechanism. It chooses the connections really needed for action representation and reduces unnecessary ones. Therefore, our model can focus more attention on potential discriminative parts in the skeleton.

3. Methods and Materials

3.1. Pipeline Overview

The pipeline of our framework is depicted in Figure 1, which consists of 8 GCN blocks, 2 STAMs, and a fully connected layer. Every GCN block contains a CWG and an MTC for the spatial and temporal dimension, respectively. The CWG deploys the different adjacency matrices for channels so that they can develop their topology independently. The MTC can integrate the information from various temporal receptive fields, enhancing the model’s adaptability to the speed of actions. STAMs are embedded behind the fifth and seventh GCN blocks, and they help our model collect the most informative features along spatial and temporal dimensions.

3.2. Channel-Wise Graph Convolution

The channel-wise graph convolution (CWG) is proposed to build different correlation in channels. The feature map of the network can be viewed as a C × T × N tensor, where N is the number of nodes, T is the temporal length, and C is the number of channels. Before the introduction of CWG, we briefly review the definition of channel-shared graph convolution in our baseline (2s-AGCN [3]), which can be formulated as:
f o u t = k W k f i n ( A k + B k + C k )
where f i n R C i n × T × N is the input feature; f o u t R C o u t × T × N is the output feature; W k R C o u t × C i n is the weight vector of the 1 × 1 convolution operation; and A k R N × N is the original normalized adjacency matrix defined by the skeleton’s natural structure. B k R N × N is an adaptive matrix that is optimized with the other parameters; and C k R N × N is a self-attention matrix to model relation in every two nodes.
In this work, we only use the B k initialized by A k as the adjacency matrix of CWG. It is inspired by [19], they believe that B k has higher freedom to learn and represents the skeleton’s topology to improve network performance. Moreover, we extend it by channel dimension to get our channel-wise adjacency matrix G k R C × N × N , then Equation (1) can be rewritten as:
f o u t = k W k ( [ f i n 1 G k 1 , f i n 2 G k 2 , , f i n C G k C ] )
where [ · ] denotes the concatenation of tensors.
The diagram of the channel-shared and channel-wise adjacent matrix is shown in Figure 2. The left half of the figure is the channel-shared graph convolution used in the previous skeleton-based action recognition approach. We can see that the same adjacency matrices are used in different channels. As we all know, the feature map in each channel is diverse, as are their topological relations. It is not reasonable to apply the same adjacency matrix to model dissimilar topological relations. The right half of the figure is the channel-wise graph convolution (CWG). For each channel, we assign an independent adjacency matrix to establish the topology with its feature map. In this way, the model’s ability to model spatial correlation is greatly improved. As the channel-share graph convolution also broadcasts the adjacency matrix to all channels, our method does not bring the additional computational cost of the inference.

3.3. Relation Selection Mechanism

As mentioned in Section 3.2, we deploy the learnable adjacency matrices in our model, which can add connections between two arbitrary nodes of the graph. It allows the generation of edges between highly correlated points that are not connected by the physical structure of the human skeleton. However, its ability to reduce edges is much weaker than its ability to increase them. The edge of the graph eliminates only when its corresponding element in the adjacency matrix is precisely 0. Eventually, the adjacency matrix produces very dense connections, creating complex and confusing relations among the nodes, which is detrimental to the extraction of discriminative features. Therefore, we propose a relation selection mechanism to pick out meaningful connections in the graph.
Specifically, we set a small threshold δ . Only the edge corresponding to the element larger than the threshold in the adjacent matrix will take effect in the graph convolution operation. It can be formulated as:
G ^ ( c , i , j ) = G ( c , i , j ) , G ( c , i , j ) < δ , 0 , G ( c , i , j ) δ .
where c , i , j is the index of the adjacency matrix elements at three different axes. Then we can represent our CWG as:
f o u t = k W k ( [ f i n 1 G ^ k 1 , f i n 2 G ^ k 2 , , f i n C G ^ k C ] ) .
Unfortunately, after thresholding the adjacency matrix, we will face two new problems. The first problem is related to initialization. As we initialize using a pre-defined adjacency matrix of the human skeleton, thresholding may prevent generating new edges. Specifically, the new edges are initialized to a minimum value below the threshold, and the threshold filters them out. Therefore, the gradients are always zeros, and the model can not generate new edges. The second problem is about training. The edges that are deleted by thresholding are permanently deactivated. If the deletions are incorrect, the model can not reactivate these edges in later training. To address those issues, we design a reactivation loss:
L R = α k , c , i , j min ( G k ( c , i , j ) δ , 0 )
where α is the coefficient to control the speed of reactivation, blowing the threshold, the deactivated edges slowly grow until it reaches the threshold. Then, the beneficial edges will have a chance to grow continually, and the others will be affected by the gradient and fall below the threshold again. The reactivation loss L R is added to our loss function L t o t a l during all of the training periods:
L t o t a l = L c r o s s _ e n t r o p y + L R .
In this manner, we construct the sparse adjacency matrices, which focus on the most valuable relations. Redundant edges that impair model performance are removed in a learnable way, and the necessity of edges to be validated repeatedly throughout the training process.

3.4. Multiscale Temporal Convolution

There is a difference in the speed with which people perform actions. Moreover, the starting frames of each skeleton sequence are not uniform, and their discriminative clues may span a great deal of time, even in different time ranges. The ideal skeleton-based action recognition model needs to be adaptable to these changes in the position of velocity and time.
Therefore, the MTC is proposed. The main idea is to divide the channels into several groups, each corresponding to a scale, and then assemble the information from different scales. As shown in Figure 3, we split the input feature into four groups and feed them into four reciprocal symmetric branches, respectively. Each branch applies 2D convolution with kernel size different from others, from 3 × 1 to 9 × 1 . The outputs of all branches are concatenated, then combined by a 1 × 1 convolution. The operation of MTC can be represented as:
F o u t = W [ C o n v 2 D ( F i n 1 , k 1 ) , C o n v 2 D ( F i n 2 , k 2 ) , , C o n v 2 D ( F i n B , k B ) ]
where W R C o u t × C i n is the weight matrix of the 1 × 1 convolution; [ · ] denotes the concatenation of tensors; C o n v 2 D ( · ) means the traditional convolution operation; k i is the kernel of branch i; and B is the number of branches.
Compared with the fixed 9 × 1 temporal convolutional kernel size deployed in the baseline (2s-agcn [3]), the proposed MTC has fewer parameters. It can also capture the multiscale temporal representation and enable our model to adapt to different action speeds.

3.5. Spatial-Temporal Attention Module

Since the adjacency matrices are learnable in the spatial dimension, the graph convolution is global. Nonetheless, it generates fixed correlations, which may be unreliable because it can not adjust according to the data. Instead, the self-attention operation can build correlations according to the input, giving it more universality. In the temporal dimension, the convolution operation is still processing a local neighborhood, and the global dependencies of the frames are neglected. Therefore, we propose the symmetry spatial-temporal attention module to model the context relation both in space and time. In the module, the part of spatial attention and temporal attention have the symmetry structure.
We can represent the input feature and output feature as a series of vector: F i n = { x 1 , x 2 , , x V } , F o u t = { y 1 , y 2 , , y V } , then x i R C × T , y j R C × T . The spatial attention unit can be formulated by:
y i = j s o f t m a x ( f ( θ ( x i ) T , ϕ ( x j ) ) ) g ( x j ) .
Here θ ( x i ) = W θ ( x i ) , ϕ ( x j ) = W ϕ ( x j ) , and g ( x j ) = W g ( x j ) are three linear embedding, f is a pairwise function, which produces the similarity scalar between i and j. We defined the angle-based similarity function f as:
f ( u , v ) = 1 / a r c c o s ( u v | u | · | v | ) .
When the angle between u and v is smaller, the two vectors are more correlated. Therefore, we use the reciprocal of the angle computed by the arccos function to measure similarity between vectors. We limit the range of the arccos function to [−1,1].
Likewise, if we unfold the input feature and output feature as: F i n = { x 1 , x 2 , , x T } , F o u t = { y 1 , y 2 , , y T } , then x i R C × V , y j R C × V . The temporal attention unit can also be formulated by Equation (8). Therefore, the temporal attention unit is symmetrical with the spatial attention unit. The temporal attention unit can capture the temporal context, and the spatial attention unit is used to model the spatial context. We first deploy the two units separately to form the temporal attention module (TAM ) and the spatial attention module (SAM). Collaborating the two units, we then design two different structures of STAM, namely STAM-A and STAM-B. As shown in Figure 4, the temporal attention unit and spatial attention unit are connected in series or parallel, respectively. Residual connections are added to preserve the local features. A ReLU activation function filters the output. The combination of temporal attention unit and spatial attention unit can establish the temporal and spatial interdependencies in a data-driven manner, which is helpful for skeletal action recognition. Structures of the four attention modules mentioned above are shown in Figure 4. We will compare their performance in Section 4.

3.6. Datasets

We use three public action recognition datasets in our experiments, namely NTU-RGB+D [4], NTU-RGB+D 120 [5], and Northwesten-UCLA [6].
NTU-RGB+D is an indoor action recognition dataset that has been widely used. It has 56,880 action clips in 60 classes. The 3D coordinates of the subject’s 25 joints captured by Kinect depth sensors are provided. Each action is performed by 40 subjects. Three cameras capture actions with horizontal angles at 45 °, 0°, and 45°. The dataset has two benchmark protocols: (1) Cross-subject (CS): samples are split into two parts, 40,320 samples for the training set and 16,560 samples for the test set, according to the subjects. (2) Cross-view (CV): Samples are split into two parts, 37,920 samples for the training set and 18,960 samples for the test set, according to the camera angle. Following the two protocols, we evaluated our model with the top-1 accuracy on both benchmarks. For each benchmark, the top-1 accuracy is reported.
NTU-RGB+D 120 is an extended version of the NTU RGB+D. Therefore, it is the largest indoor action recognition dataset currently. The dataset contains 114,480 videos in 120 classes. Each action is performed by 106 subjects. The author capture the videos from 155 viewpoints. Similarly, the dataset has two benchmark protocols: (1) Cross-subject (CS): samples are split according to subjects. The training set contains 63,026 samples, while the test set contains 50,922 samples. (2) Cross-setting (CE): Samples are split according to camera setting. The training set contains 54,471 samples, while the test set contains 59,477 samples. Following the two protocols, we evaluated our model with the top-1 accuracy on both benchmarks. For each benchmark, the top-1 accuracy is reported.
Northwesten-UCLA is a multi-modality action recognition dataset, which provides RGB data and depth video data. For each action, the author captures the data from three different viewpoints. Therefore, three Kinect cameras correspond to the viewpoints. The dataset has 1494 video clips in 10 action classes. Each action is performed by 10 actors. The training set contains samples from the first two cameras, and the testing set has the samples from the other camera. In this work, we only use the skeleton data for recognition action and report the top-1 accuracy.

4. Experimental Results

In this section, we introduce the results and details of our experiments in the datasets introduced in Section 3.6. To verify the effectiveness of the components of the RS-GCN, we first perform exhaustive ablation studies on NTU-RGB+D. Then, we compare the proposed model with other state-of-the-art methods on all three datasets.

4.1. Implementation Details

We conduct all our experiments with the PyTorch deep learning framework. We train our model for 60 epoches totally. We use the stochastic gradient descent (SGD) optimizer with the momentum of 0.9 for training, and the batch size is set to 32. The initial learning rate is set to 0.1 and decays with a factor of 0.1 at the 40th and 50th epoch. We set cross-entropy as the objective function. The weight decay is set to 0.0002. For NTU-RGB+D, and NTU-RGB+D 120, the relation selective threshold delta is set to 0.1, the coefficient alpha is set to 2.2 × 10 5 . For Northwestern-UCLA, the relation selective threshold delta is set to 0.01, and the coefficient alpha is set to 2.1 × 10 5 . We adopt the multi-stream fusion strategy in [27], and we compute the weighted average of each stream’s output as the final prediction.

4.2. Ablation Study

This section investigates the contributions of different components in the proposed RS-GCN, the effect of the relation selection mechanism, and the necessity of multi-stream inputs. We conduct experiments on the NTU-RGB+D dataset under the CV benchmark and only the joint-stream as the input.

4.2.1. Network Architectures

We first verify the necessity of the proposed components. We manually delete the proposed component from RS-GCN to form the variants. We compare the performance of those variants and the original RS-GCN on the NTU-RGB+D dataset, and the result is shown in Table 1. This table shows that all STAM, CWG, MTC, and relation selection mechanism benefit action recognition. Deleting any one of the components will harm the performance. The variant without the relation selection mechanism get the lowest accuracy, dropping by 0.6% compared to the original RS-GCN, verifying the effectiveness of the relation selection mechanism. As the components retained in the variants still improve performance, these variants are only marginally less accurate than RS-GCN. By collaborating all those components, RS-GCN gets the best accuracy, and outperforms the baseline [3] by 1.2%.

4.2.2. Comparison of Attention Modules

Here we compare the performance of the proposed model with different attention modules in Table 2. As we can see from the table, the TAM and SAM both get a low accuracy. Due to the complementary relationship between temporal attention and spatial attention, STAM-A and STAM-B have a clear improvement compared with the SAM and TAM. The accuracy of STAM-A is slightly higher than that of STAM-B because connecting in a series may be more beneficial for the integration of spatial-temporal information than parallel.

4.2.3. Necessity of Multi-Stream Inputs

We adopt the multi-stream fusion strategy in [27], which includes four streams. The joint stream’s input is the original skeleton coordinates. The bone stream’s input is the differential of adjacent joints’ coordinates in the skeleton. The joint motion stream’s and bone motion stream’s inputs are the time differential of the joint stream’s input and bone stream’s input, respectively. The final output is obtained by the weighted average of each stream’s prediction. We tested the performance of four streams and two combinations between them. Here, J, B, J-M, and B-M denote the joint stream, bone stream, joint motion stream, and bone motion stream. Moreover, 2s means the combination of joint stream and bone stream, and 4s means combining all four streams. The result is shown in Table 3. We can find in the table that the combination of the multi-streams method outperforms the single-stream methods.

4.2.4. Visualization of Adjacency Matrices

We visualized the adjacent matrix in the baseline and RS-GCN, respectively, and the results are shown in Figure 5. It can be seen that the adjacent matrices in the baseline are excessively dense, which may make redundant and chaotic connections between nodes. It prevents the model from selecting the information effectively. The adjacent matrices in RS-GCN are sparse, and their non-zero values are only distributed on a few crucial connections. The sparsed adjacency matrics allow the node to select the most solid relations and focus on the most informative adjacent nodes. Besides, the adjacency matrix in different channels shows a clear distinction. We can infer that every channel evaluates and selects the connections that are most suitable for its features. It provides enough adaptability for the model’s aggregation of features.

4.3. Comparison with the State-of-the-Art

In this section, we compare the proposed model with the state-of-the-art methods on the NTU-RGB+D dataset, NTU-RGB+D 120, and Northwesten-UCLA dataset. The methods used for comparison mainly include three types: CNN-based method [11,28,29,30,31], RNN-based method [4,14,15,32,33], and GCN-based method [2,3,17,27,34,35]. The results are shown in Table 4, Table 5 and Table 6, respectively. Our proposed model outperforms the baseline with a large margin, and it almost achieves the state-of-the-art performance on all three datasets, which verifies the superiority of our model.

5. Discussion and Conclusions

In this work, we propose a trainable relation selection mechanism. It helps our model choose the most informative connection of the graph and sparse the adjacency matrices. Therefore, the nodes can focus on the most important neighbors. In addition, we propose channel-wise graph convolution (CWG) and multiscale temporal convolution (MTC) to strengthen the model’s representative power. Furthermore, we introduce the spatial-temporal attention module (STAM) to enhance the model’s ability to capture context relations. Incorporating these improvements, we built a novel model called relation selective graph convolutional networks (RS-GCN). Comprehensive experiments on three public datasets show our model’s overwhelming performance compared to the state-of-the-art approaches, proving the effectiveness of our model. However, there are still some issues that need further investigation. The CWG and MTC also bring some extra computational cost, and when there is much noise in the skeleton sequence, the selection mechanism is prone to remove some desired connections. Therefore, in our future work, we will look for ways to enhance the model’s representation ability with less computational consumption and improve our relation selection mechanism to make it more robust to noise.

Author Contributions

Methodology, W.Y.; software, W.Y.; writing—original draft, W.Y.; writing—review and editing, W.Y. and J.Z.; supervision, J.Z., J.C. and Z.X.; funding acquisition, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Code of RS-GCN: https://github.com/kraus-yang/RS-GCN (accessed on 20 November 2021). Datasets: NTU-RGB+D/NTU-RGB+D: https://github.com/shahroudy/NTURGB-D (accessed on 17 April 2021); Northwesten-UCLA: https://wangjiangb.github.io/my_data.html (accessed on 30 April 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Palestra, G.; Rebiai, M.; Courtial, E.; Koutsouris, D. Evaluation of a Rehabilitation System for the Elderly in a Day Care Center. Information 2019, 10, 3. [Google Scholar] [CrossRef] [Green Version]
  2. Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  3. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 12018–12027. [Google Scholar] [CrossRef] [Green Version]
  4. Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1010–1019. [Google Scholar] [CrossRef] [Green Version]
  5. Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-View Action Modeling, Learning, and Recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 2649–2656. [Google Scholar] [CrossRef] [Green Version]
  7. Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar] [CrossRef]
  8. Gowayyed, M.A.; Torki, M.; Hussein, M.E.; El-Saban, M. Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013; p. 8. [Google Scholar]
  9. Vemulapalli, R.; Chellappa, R. Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 4471–4479. [Google Scholar] [CrossRef]
  10. Du, Y.; Fu, Y.; Wang, L. Skeleton based action recognition with convolutional neural network. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; IEEE: Kuala Lumpur, Malaysia, 2015; pp. 579–583. [Google Scholar] [CrossRef]
  11. Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 601–604. [Google Scholar] [CrossRef]
  12. Rahmani, H.; Bennamoun, M. Learning Action Recognition Model from Depth and Skeleton Videos. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017; pp. 5833–5842. [Google Scholar] [CrossRef] [Green Version]
  13. Caetano, C.; Brémond, F.; Schwartz, W.R. Skeleton Image Representation for 3D Action Recognition Based on Tree Structure and Reference Joints. In Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Rio de Janeiro, Brazil, 28–30 October 2019; pp. 16–23. [Google Scholar] [CrossRef] [Green Version]
  14. Lee, I.; Kim, D.; Kang, S.; Lee, S. Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017; pp. 1012–1020. [Google Scholar] [CrossRef]
  15. Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3671–3680. [Google Scholar] [CrossRef]
  16. Liu, H.; Tu, J.; Liu, M.; Ding, R. Learning Explicit Shape and Motion Evolution Maps for Skeleton-Based Human Action Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Calgary, AB, Canada, 2018; pp. 1333–1337. [Google Scholar] [CrossRef]
  17. Song, Y.F.; Zhang, Z.; Wang, L. Richly Activated Graph Convolutional Network for Action Recognition with Incomplete Skeletons. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1–5. [Google Scholar] [CrossRef] [Green Version]
  18. Zhang, X.; Xu, C.; Tian, X.; Tao, D. Graph Edge Convolutional Neural Networks for Skeleton-Based Action Recognition. IEEE Trans. Neural Networks Learn. Syst. 2020, 31, 3047–3060. [Google Scholar] [CrossRef] [PubMed]
  19. Zhu, G.; Zhang, L.; Li, H.; Shen, P.; Shah, S.A.A.; Bennamoun, M. Topology-learnable graph convolution for skeleton-based action recognition. Pattern Recognit. Lett. 2020, 135, 286–292. [Google Scholar] [CrossRef]
  20. Huang, L.; Huang, Y.; Ouyang, W.; Wang, L. Part-Level Graph Convolutional Network for Skeleton-Based Action Recognition. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11045–11052. [Google Scholar] [CrossRef]
  21. Huang, Z.; Shen, X.; Tian, X.; Li, H.; Huang, J.; Hua, X.S. Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 26–28 May 2020; ACM: Seattle, WA, USA, 2020; pp. 2122–2130. [Google Scholar] [CrossRef]
  22. Li, S.; Yi, J.; Farha, Y.A.; Gall, J. Pose Refinement Graph Convolutional Network for Skeleton-based Action Recognition. arXiv 2020, arXiv:2010.07367. [Google Scholar] [CrossRef]
  23. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Symbiotic Graph Neural Networks for 3D Skeleton-based Human Action Recognition and Motion Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
  24. Li, X.; Zhai, W.; Cao, Y. A tri-attention enhanced graph convolutional network for skeleton-based action recognition. IET Comput. Vis. 2021, 15, 110–121. [Google Scholar] [CrossRef]
  25. Liu, S.; Bai, X.; Fang, M.; Li, L.; Hung, C.C. Mixed graph convolution and residual transformation network for skeleton-based action recognition. Appl. Intell. 2021, 1–12. [Google Scholar] [CrossRef]
  26. Peng, W.; Shi, J.; Zhao, G. Spatial Temporal Graph Deconvolutional Network for Skeleton-Based Human Action Recognition. IEEE Signal Process. Lett. 2021, 28, 244–248. [Google Scholar] [CrossRef]
  27. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Directed Graph Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 7904–7913. [Google Scholar] [CrossRef]
  28. Kim, T.S.; Reiter, A. Interpretable 3D Human Action Analysis with Temporal Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 1623–1631. [Google Scholar] [CrossRef] [Green Version]
  29. Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A New Representation of Skeleton Sequences for 3D Action Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 4570–4579. [Google Scholar] [CrossRef] [Green Version]
  30. Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
  31. Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 786–792. [Google Scholar] [CrossRef] [Green Version]
  32. Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 3007–3021. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  34. Wu, C.; Wu, X.J.; Kittler, J. Spatial Residual Layer and Dense Connection Block Enhanced Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27 October–2 November 2019; IEEE: Seoul, Korea, 2019; pp. 1740–1748. [Google Scholar] [CrossRef]
  35. Yang, H.; Yan, D.; Zhang, L.; Li, D.; Sun, Y.; You, S.; Maybank, S.J. Feedback Graph Convolutional Network for Skeleton-based Action Recognition. arXiv 2020, arXiv:2003.07564. [Google Scholar] [CrossRef] [PubMed]
  36. Hu, G.; Cui, B.; Yu, S. Skeleton-Based Action Recognition with Synchronous Local and Non-Local Spatio-Temporal Learning and Frequency Attention. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1216–1221. [Google Scholar] [CrossRef] [Green Version]
  37. Peng, W.; Hong, X.; Zhao, G. Tripool: Graph triplet pooling for 3D skeleton-based action recognition. Pattern Recognit. 2021, 115, 107921. [Google Scholar] [CrossRef]
  38. Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Boston, MA, USA, 2015; pp. 1110–1118. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The pipeline of proposed RS-GCN (relation selective graph convolutional network).
Figure 1. The pipeline of proposed RS-GCN (relation selective graph convolutional network).
Symmetry 13 02275 g001
Figure 2. Diagram of the channel-shared graph convolution and the channel-wise graph convolution.
Figure 2. Diagram of the channel-shared graph convolution and the channel-wise graph convolution.
Symmetry 13 02275 g002
Figure 3. Diagram of the multiscale temporal convolution.
Figure 3. Diagram of the multiscale temporal convolution.
Symmetry 13 02275 g003
Figure 4. Four different attention modules.
Figure 4. Four different attention modules.
Symmetry 13 02275 g004
Figure 5. Visualization of adjacency matrices for the baseline (AGCN) and our proposed RS-GCN.
Figure 5. Visualization of adjacency matrices for the baseline (AGCN) and our proposed RS-GCN.
Symmetry 13 02275 g005
Table 1. Performance comparison of RS-GCN and its variants. w/o X means the variants deleting the X module.
Table 1. Performance comparison of RS-GCN and its variants. w/o X means the variants deleting the X module.
MethodAccuracy (%)
baseline93.72
w/o STAM94.63
w/o CWG94.46
w/o MTC94.64
w/o relation selection mechanism94.33
RS-GCN94.93
Table 2. Comparison of proposed model with different attention modules. w/X means the model using the X module.
Table 2. Comparison of proposed model with different attention modules. w/X means the model using the X module.
MethodAccuracy (%)
w TAM94.61
w SAM94.44
w STMA-A94.93
w STMA-B94.69
Table 3. Comparison of the proposed model with different inputs.
Table 3. Comparison of the proposed model with different inputs.
MethodAccuracy (%)
J-RS-GCN94.93
B-RS-GCN94.79
JM-RS-GCN92.97
BM-RS-GCN92.95
2s-RS-GCN95.92
4s-RS-GCN96.45
Table 4. Comparisons with the state-of-the-art methods on the NTU-RGB+D dataset.
Table 4. Comparisons with the state-of-the-art methods on the NTU-RGB+D dataset.
MethodCS (%)CV (%)
Deep LSTM [4]60.767.3
ST-LSTM [32]69.277.7
Ensemble TS-LSTM [14]74.681.3
VA-LSTM [33]79.287.7
GCA-LSTM [15]77.185.1
TCN [28]74.383.1
Clips + CNN + MTLN [29]79.684.8
Synthesized CNN [30]80.087.2
3scale ResNet 152 [11]84.690.9
HCN [31]86.591.1
SLnL-rFA [36]89.194.9
ST-GCN [2]81.588.3
RA-GCN [17]85.993.5
2s-AGCN (baseline) [3]88.595.1
2s-SDGCN [34]89.695.7
MS-AAGCN [27]90.096.2
BPLHM [18]85.491.1
Pose-refinement GCN [18]85.291.7
2s-FGCN [35]90.296.3
TA-GCN [24]89.996.3
sym-GCN [24]90.196.4
RNXt-GCN [25]91.495.8
4s-RS-GCN (ours)91.396.5
Table 5. Comparisons with the state-of-the-art methods on the NTU-120 RGB+D dataset.
Table 5. Comparisons with the state-of-the-art methods on the NTU-120 RGB+D dataset.
MethodCS (%)CE (%)
ST-LSTM [32]25.526.3
GCA-LSTM [15]61.263.3
ST-GCN [2]72.471.3
RA-GCN [17]81.182.7
2s-FGCN [35]85.487.4
RNXt-GCN [25]83.987.6
Tripool [37]80.182.8
4s-RS-GCN (ours)87.288.6
Table 6. Comparisons with the state-of-the-art methods on the Northwesten-UCLA dataset.
Table 6. Comparisons with the state-of-the-art methods on the Northwesten-UCLA dataset.
MethodTop1 (%)
HBRNN-L [38]78.5
Synthesized-pre-trained [30]86.1
Ensemble-TS-LSTM [14]89.2
2s-AGC-LSTM93.3
RNXt-GCN [25]76.9
4s-RS-GCN (ours)94.8
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yang, W.; Zhang, J.; Cai, J.; Xu, Z. Relation Selective Graph Convolutional Network for Skeleton-Based Action Recognition. Symmetry 2021, 13, 2275. https://0-doi-org.brum.beds.ac.uk/10.3390/sym13122275

AMA Style

Yang W, Zhang J, Cai J, Xu Z. Relation Selective Graph Convolutional Network for Skeleton-Based Action Recognition. Symmetry. 2021; 13(12):2275. https://0-doi-org.brum.beds.ac.uk/10.3390/sym13122275

Chicago/Turabian Style

Yang, Wenjie, Jianlin Zhang, Jingju Cai, and Zhiyong Xu. 2021. "Relation Selective Graph Convolutional Network for Skeleton-Based Action Recognition" Symmetry 13, no. 12: 2275. https://0-doi-org.brum.beds.ac.uk/10.3390/sym13122275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop