Next Article in Journal
SACT: A New Model of Covert Communication Based on SDN
Next Article in Special Issue
Design of a Kitchen-Monitoring and Decision-Making System to Support AAL Applications
Previous Article in Journal
A Case Driven Study of the Use of Time Series Classification for Flexibility in Industry 4.0
Previous Article in Special Issue
Healthcare Professional in the Loop (HPIL): Classification of Standard and Oral Cancer-Causing Anomalous Regions of Oral Cavity Using Textural Analysis Technique in Autofluorescence Imaging
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences

1
Computer Science & Engineering, Parul Institute of Technology, Parul University, Vadodara 391760, India
2
Symbiosis Centre for Applied Artificial Intelligence and Symbiosis Institute of Technology, Symbiosis International (Deemed) University, Pune 412115, India
3
Sankalchand Patel College of Engineering, Sankalchand Patel University, Visnagar 384315, India
4
Computer Science Department, Faculty of Technology, Linnaeus University, P G Vejdes väg, 351 95 Växjö, Sweden
5
Center for Intelligent Medical Electronics, Department of Electronic Engineering, School of Information Science and Technology, Fudan University, Shanghai 200433, China
*
Author to whom correspondence should be addressed.
Submission received: 3 November 2020 / Revised: 10 December 2020 / Accepted: 12 December 2020 / Published: 18 December 2020
(This article belongs to the Special Issue Smart Assisted Living)

Abstract

:
Human Action Recognition (HAR) is the classification of an action performed by a human. The goal of this study was to recognize human actions in action video sequences. We present a novel feature descriptor for HAR that involves multiple features and combining them using fusion technique. The major focus of the feature descriptor is to exploits the action dissimilarities. The key contribution of the proposed approach is to built robust features descriptor that can work for underlying video sequences and various classification models. To achieve the objective of the proposed work, HAR has been performed in the following manner. First, moving object detection and segmentation are performed from the background. The features are calculated using the histogram of oriented gradient (HOG) from a segmented moving object. To reduce the feature descriptor size, we take an averaging of the HOG features across non-overlapping video frames. For the frequency domain information we have calculated regional features from the Fourier hog. Moreover, we have also included the velocity and displacement of moving object. Finally, we use fusion technique to combine these features in the proposed work. After a feature descriptor is prepared, it is provided to the classifier. Here, we have used well-known classifiers such as artificial neural networks (ANNs), support vector machine (SVM), multiple kernel learning (MKL), Meta-cognitive Neural Network (McNN), and the late fusion methods. The main objective of the proposed approach is to prepare a robust feature descriptor and to show the diversity of our feature descriptor. Though we are using five different classifiers, our feature descriptor performs relatively well across the various classifiers. The proposed approach is performed and compared with the state-of-the-art methods for action recognition on two publicly available benchmark datasets (KTH and Weizmann) and for cross-validation on the UCF11 dataset, HMDB51 dataset, and UCF101 dataset. Results of the control experiments, such as a change in the SVM classifier and the effects of the second hidden layer in ANN, are also reported. The results demonstrate that the proposed method performs reasonably compared with the majority of existing state-of-the-art methods, including the convolutional neural network-based feature extractors.

1. Introduction

In machine vision, automatic understanding of video data (e.g., action recognition) remains a difficult but important challenge. The method of recognizing human actions that occur in a video sequence is defined as human action recognition (HAR). In video understanding, it is difficult to differentiate routine life actions, such as running, jogging, and walking, using an executable script. There has been an increasing interest in HAR over the past decade, and it is still an open field for many researchers. The domain of HAR has developed considerably with significant application in human motion analysis [1,2], identification of familiar people and gender [3], motion capture and animation [4], video editing [5], unusual activity detection [6], video search and indexing (useful for TV production, entertainment, social studies, security) [7], video2text (auto-scripting) [8], video annotation, and video mining [9].
Human action recognition is a challenging multi-class classification problem due to high intra-class variability within a given class. To overcome variability issue, we propose a scheme to design a feature descriptor that is highly invariant to the fluctuations present in the classes. In other words, the proposed feature descriptor fuses various diverse features. In addition, this paper addresses various challenges in HAR, such as variation in the background (outdoor or indoor), recognizing the gender of the action performer, variation in clothes worn, and scale variation. We deal with constrained video sequences that involve moving background and multiple actions in single video sequence.
Our contributions in this paper can be summarized in the following way. First, for moving object detection, we use a novel technique by incorporating the human visual attention model [10] making it background-independent. Therefore, its computational complexity is much lower than the algorithm which updates background at regular interval for moving object detection in the video. Second, we propose the feature description preparation layer, which includes the use of the HOG features with the non-overlapping windowing concept. Moreover, averaging the features reduces the size of the feature descriptor. In addition to the HOG, we also use the object displacement, which is crucial to differentiate the action performed at the same location, i.e., zero displacements (like boxing, hand waving, clapping, etc.) or at various locations, i.e., non-zero displacement (like walking, running, etc.). Furthermore, a velocity feature is used at this stage to further identify the overlapping actions having non-zero displacement (like walking, running, etc.). It is based on the observation that speed variation among such actions exists and incorporation of velocity feature can aid the classification. To consider the spatial context in terms of boundaries and smooth shapes of the human body, regional features from Fourier HOG are employed. Finally, we propose six different models for classification to demonstrate the effectiveness of the proposed features descriptor across different types of classifier families.
The rest of the paper is organized in the following way. Section 2 discusses the existing literature on HAR. Section 3 outlines the motivation for feature fusion and briefly describes the HOG, support vector machines (SVMs), artificial neural networks (ANNs), multiple kernel learning (MKL), and Meta-cognitive Neural Network (McNN). In Section 4, the proposed approach for HAR is described. Section 3 also presents the proposed techniques for fusing features. Section 4 presents and discusses the experimental results. Finally, we conclude the paper in Section 5.

2. Existing Methods

In the last two decades, most research on human action recognition is concentrated at two levels: (1) feature extraction and (2) feature classification. One of the feature extraction methods is the Dense trajectories approach [11] that extracts features at multiple scales. In addition, these features are sampled for each frame, and based on the displacement information from dense optical flow field actions are classified. In [12], an extension to Dense trajectories was proposed by replacing the Scale-Invariant Feature Transform (SIFT) feature with the Speeded Up Robust Features (SURF) feature to estimate camera motion.
The advantage of these trajectories representations is that they are robust to fast irregular motions and boundaries of human action. However, this method cannot handle the local motion in any action which involves the important movement of the hand, arm, and leg. Therefore, it is not providing enough information for action discrimination. This particular problem is overcome by using important motion parts using Motion Part Regularization Framework (MPRF) [13]. This framework uses Spatio-temporal grouping of densely extracted trajectories, which have been generated for motion part. Objective function for sparse selection of these trajectory groups has been optimized and learned motion parts are represented by fisher vector. Lan et al. again points out in [14] about the local motion of body parts, which result in small changes of intensity, resulting in low-frequency action information. In feature preparation layer, low-frequency action information is not included; therefore, resultant feature descriptors cannot capture enough detail for action classification. In order to address this problem, the Multi-skIp Feature Staking (MIFS) approach was proposed. This approach considers stacking extracted features using differential operators at various scales, which makes the task of action recognition invariant to speed and range of motion offered by the human subject. Due to consideration of various scales in feature building stage, computation complexity is increased in this approach.
In the traditional way, distinct features are derived for representing any human action. However, Liu et al. [15] proposed a human action recognition system, which extracts spatio-temporal and motion features automatically, and this is accomplished by an evolutionary algorithm such as genetic programming. These features are scale and shift invariant and extract color information as well from optical flow sequences. Finally, classification is performed using SVM but the automatic learning needs training process which is time-consuming. The approach in [16] defined the Fisher vector model based on the spatio-temporal local features. Conventional dictionary learning approaches are not appropriate for Fisher vectors extracted from features; therefore, the authors of Reference [16] proposed Multiple Instance Discriminative Dictionary Learning (MIDDL) methods for human action recognition. Recently, frequency domain representation of the multi-scale trajectories has been proposed [17]. The critical points are extracted from the optical flow field of each frame; later multi-scale trajectories are generated from these points and transformed into frequency domain. This frequency information is combined with other information like motion orientation and shapes at the end. The computational complexity of this method is high due to the consideration of the optical flow. The author [18] proposed the skeleton information as a coordinated non-cyclic diagram that gave the kinematic reliance between the joints and bones in the characteristic human body.
Recently proposed, the Deep Convolution Generate Adversarial Network (DCGANs) [19] bridges the gap between supervised and unsupervised learning. The author proposed a semisupervised framework for action recognition, which uses trained discriminator from GAN model. However, the method evaluates the feature based on the appearance of the human and does not account motion in feature building stage. Representation of action is evaluated in terms of distinct action sketches [20]. Sketch formation has been done using fast edge detection. Later on, the person in each particular frame is detected by R-CNN. Furthermore, ranking and pooling are deployed for designing distinct action sketch. Improved dense trajectories and pooling feature fusion are provided to SVM classifier for action recognition. VideoLSTM, a new recurrent neural network architecture, has been proposed in [21]. This architecture can adaptively fit the requirement of given video. This approach exploits new spatial layout of architecture, motion-based attention for relevant spatio-temporal location, and action localization from videoLSTM. In addition to that, there are several methods proposed over the decades  [22].

3. Proposed Framework

The proposed HAR framework is shown in Figure 1 and involves three parts: moving object detection, feature extraction, and action classification.

3.1. Moving Object Detection

Moving object detection plays crucial role in many computer vision applications. The process involves the classification of pixels from each frame of a video stream as a background or foreground pixel, and a model representing the background is generated. Then, the background is removed from each frame to enable moving object detection and the process is referred to as background subtraction. Popular background subtraction techniques include frame differencing [23,24], Shadow removal [25], Gaussian mixture model(GMM) [26], and CNN-based background removal [27]. However, an algorithm for moving object detection without any background modeling was presented in [28,29,30], and the detailed procedure is given below.
First, an average filter is applied on video sequence I ( m , n , t ) of size X × Y for a particular time t.
The moving object detection performance of the method is depicted in Figure 2 and Figure 3 for two different video sequences. The first column shows the snapshot from two different videos. A saliency map is shown in the second column, and a silhouette creation is done using morphological operation and shown in third column. The detected moving objects are shown in the fourth column.
I a v g = I ( m , n , t ) A ( X , Y ) ,
where A represents the avg filter of mask size X × Y, and ⊗ represents the convolution between two images. Next, a Gaussian filter is employed on the image,
I g a u s s i a n = I ( m , n , t ) G ( h , σ )
The Gaussian low-pass filter is represented as G. The saliency value calculated at each pixel ( m , n ) is given as
d i s t [ I g a u s s i a n ( m , n ) , I a v g ( m , n ) ] = [ I g a u s s i a n ( m , n ) I a v g ( m , n ) ]
S ( m , n ) = d i s t [ I g a u s s i a n ( m , n ) , I a v g ( m , n ) ]
The distance between the respective images is represented as d i s t . S ( m , n ) contains the moving object obtained from specific video. In the proposed approach, moving object detection is performed as
F G ( m , n ) = [ | I ( m , n , t ) I ( m , n , t 1 ) | > T h r e s h o l d ]
where F G ( m , n ) defines the moving object from the video sequence I. Therefore, the moving object detection is faster and computationally efficient, as the method is background-independent. In other words, the time-consuming process of updating the background at regular intervals is not needed.

3.2. Feature Extraction

The procedure for extracting feature descriptors from a segmented object is shown in Figure 4, which represents action in a compact three-dimensional space associated with an object, background scene, and variation that appears in the object over time. After detecting and segmenting moving objects from each video sequence, compact features are extracted. In the proposed approach, we calculate the following features.
  • HOG over 10 non-overlapping frames (HOGAVG10):
    Here, we have used HOG, which was proposed by Dalal and Trigg [31] in 2005 and is still a highly effective human detection feature. The segmented object is converted to a fixed size (e.g., 128 × 64). HOG features extracted from the resized segmented object (per frame) have a dimensionality of 3780 as explained in Figure 4. Each video has 120 frames; therefore, the final descriptor for each video having one action is 3780 × 120. Feature descriptors contain redundant data; thus, the computational cost for learning and testing is excessive. In the proposed approach, we have calculated HOG features over a window size of 10 non-overlapping frames (HOGAVG10) because the object does not change considerably over the frames as shown in Figure 5 . Thus, there is a considerable reduction in the redundant data by using 10 frames.
  • Displacement in Object Position (OBJ_DISP):
    To evaluate the displacement of an object, the centroid (or center of mass) of the silhouette corresponding to the object is calculated by taking the (arithmetic) mean of the pixels is denoted by
    μ ( x i , y j , t ) = 1 n m i = 1 n j = 1 m C ( x i , y j )
    Suppose that the centroid of the present frame is C ( x t , y t , t ) and the past frame is C ( x t 1 , y t 1 , t 1 ) . Then, the displacement (OBJ_DISP) D ( x t , y t , t ) can be approximated using
    O B J _ D I S P ( x t , y t , t ) = ( x t x t 1 ) 2 + ( y t y t 1 ) 2
  • Velocity of Object (OBJ_VELO):
    Similar to the displacement features, the extraction of velocity features also requires the centroid of the detected moving object. The displacement and velocity features are used to estimate the motion of the moving object as they increase the inter-class distance which subsequently increases the accuracy of the overall proposed framework.
    Velocity OBJ_VELO ( x t , y t , t ) of object is estimated using
    O B J _ V E L O ( x t , y t , t ) = O B J _ D I S P ( x t , y t , t ) Δ t ,
    where Δ t = t i + 10 t i (for example, Δ t = 10 for our proposed approach) and OBJ_DISP refers to Displacement.
  • Regional Features from Fourier HOG [32] (R_FHOG):
    In this work, we extended the Regional Features from Fourier HOG proposed in [32] for action recognition. In Cartesian coordinate system, two-dimensional function is represented by f ( x , y ) R 2 . The polar coordinate representation of same function is defined as [ r , θ ] , as r is frequency in radius and angle θ . The relation between polar and Cartesian coordinate is defined as
    r = f = x 2 + y 2 ,
    and
    θ = arctan ( y , x ) [ 0 , 2 π ) .
    In the polar coordinate system, the Fourier transform is combination of radial and angular parts. The basis function B for Fourier transform in polar coordinate systems is defined as
    B k , m ( r , θ ) = k J m ( k r ) Φ m ( θ ) ,
    where k is non-negative value, and its also defines the scale of the pattern; J m ( k r ) is a mth-order Bessel function; and Φ m = 1 2 e i m Φ . k can be continuous or discrete value, depending on whether the region is infinite or finite. Transform considering finite region r a , the basis function is reduced to
    B n , m ( r , θ ) = k R n m ( r ) Φ m ( θ ) ,
    where,
    R n m ( r ) = 1 N n ( m ) J m ( k n m r ) .
The basis function (13) is orthogonal and orthonormal in nature. For B n , m ( r , θ ) , m is number of cycles in angular direction and n 1 is defined as number of zero crossing in radial direction.
As the values of m and n increase, finer details can be extracted from the image. Generally, the evaluation of HOG features involves three steps namely gradient orientation binning, spatial aggregation, and magnitude normalization, which are followed in the Fourier domain as well.
Step 1: 
Gradient Orientation Binning:
The gradient of image I ( x , y ) R 2 is defined as G ( x , y ) = [ G x , G y ] , and its polar representation is defined as
F m ( r , θ ) = G e i G ,
where G = G x 2 + G y 2 and G = arctan ( G y , G x ) [ 0 , 2 π ) . Gradient orientation are stored in bins of histogram using distribution function h at each pixel. Suppose that the gradient of any image is resented as G = [ G x G y ] R 2 . The angular part of G is Φ ( G ) , and the distribution function h for each pixel should be Dirac function gain with G
h ( θ ) = G δ ( θ Φ ( G ) ) .
In this work, Fourier basis has been replaced with Fourier coefficient f ^ m
f ^ m = h , e i m ϕ = G e i m ϕ ( G ) .
In HOG, for each gradient vector, its magnitude contribution is split into three closest bins. Therefore, it can be considered a triangular interpolation. In Fourier space, to build a HOG feature, a 1D triangular kernel can be employed to implement the gradient orientation binning. However, the execution of this particular step does not affect the results. Therefore, this step has not been considered in the proposed work.
Step 2: 
Spatial Aggregation:
To achieve spatial aggregation, convolution operation is performed on a Gaussian Kernel or an isotropic kernel and Fourier coefficients obtained.
Step 3: 
Local Normalization:
An isotropic kernel is convolved with Fourier coefficient to achieve normalization of gradient magnitude. Steps 2 and 3 are performed using two kernels. The first kernel for spatial aggregation is K 1 : R 2 R and the second kernel K 2 : R 2 R is used for local normalization. Finally, Fourier HOG is accomplished using
F ˜ m = F m K 1 G 2 K 2 .
  • Regional descriptor using Fourier HOG:
To obtain the regional descriptor, a convolution operation is performed using the Fourier basis (in polar representation) function. B n , m ( r , θ ) .
R n , m = B n , m ( r , θ ) F ˜ m .
The graphical illustration of calculation of R_FHOG features is provided in Figure 6. Figure 7 depicts the positive result by showing R_FHOG (i.e., R n , m ) for the segmented object. To speed up the process, we have not considered non-redundant data. Therefore, we have selected region features which give a maximum response on a human region. The formation of the final template from region features considers a value of scale { 1 } , order { 1 , 1 } , and degree { 1 , 2 } . Template has been shown in Figure 8.

3.3. Fusion of Features

The motivation behind fusing features is to increase diversity within classes and thus improve classification.
  • HOGAVG10 + OBJ_DISP:
    Here, we fuse HOGAVG10 with OBJ_DISP. The importance of this parameter is to differentiate between actions performed at a static location (e.g., boxing, hand waving, and hand clapping) and actions performed at a dynamic location (e.g., walking, jogging, and running). Therefore, we gain inter-class discriminative power by combining these two features.
    The position of an object does not change drastically; thus, we propose to employ the window concept to investigate the object motion over that period. In addition, we take the average of the positions to reduce the feature set. This feature is important as it provides the inter-frame offset corresponding to the object position. The displacement values for all classes are shown in Table 1.
  • HOGAVG10 + OBJ_VELO:
    Actions with smaller interclass distances such as walking, jogging and running can be distinguished using velocity features. Therefore, we propose to fuse HOGAVG10 with OBJ_VELO.
  • HOGAVG10 + OBJ_DISP + OBJ_VELO:
    The HOGAVG10 + OBJ_DISP feature combination can differentiate actions performed at static/dynamic locations, whereas the HOGAVG10 + OBJ_VELO feature combination can effectively differentiate classes with similar actions. Therefore, we propose to combine HOGAVG10 + OBJ_DISP + OBJ_VELO to effectively classify similar actions performed at static/dynamic locations present in KTH and Weizmann datasets. The velocity values of persons performing actions are reported in Table 2.
  • R_FHOG + HOGAVG10 + OBJ_DISP+ OBJ_VELO:
    The R_FHOG feature is effective at splitting the frequency gradient into bands, subsequently emphasizing the human action region. In other words, R_FHOG represents crucial information regarding boundaries and smoothed shapes. R_FHOG also provides information regarding the spatial context of a human subject.

3.4. Formal Description

This section presents the proposed fusion techniques in detail. Fusion techniques are performed at both feature and classifier level, referred to as early and late fusion techniques, respectively.

3.4.1. Early Fusion

The task Feature Fusion is performed using basic techniques such as concatenating features one after another as shown in Figure 9.

3.4.2. Late Fusion

Late combination is utilized in this work to accomplish combination at classifier level. The two distinctive late combination approaches utilized in the current investigation are Decision Combination Neural Network (DCNN) and Sugeno Fuzzy Integral.
  • Decision Combination Neural Network (DCNN) 
Decision Combination of Neural Network (DCNN) [33] is neural network architecture with no concealed layers. Accordingly, DCNN characterizes the straight connection between the input and output nodes. The most elevated reaction of a specific output layer node is characterized as choice or class mark for action recognition. Details of the DCNN follow.
As shown in Figure 10, this neural network organization contains two layers: input layer (S) and output layer (Z) individually. M classifier’s outputs are taken care of corresponding to the input layer and there are N inputs nodes related with the class. The connection between units (/nodes) of the input layer and output layer are between associated by weights w. Each input node gets a score s i k , where i characterizes ith classifier and k characterizes kth class. On the off chance that s i k of input is associated with output node j, the weight of this connection is characterized as w i j k . The greatest reaction at the output layer node is characterized as the choice of action recognition.
The sigmoid activation function is used in each node, the reaction of this proposed late combination approach is characterized as
h j ( S 1 , , S M ) = i = 1 M j = 1 N w i j k s i k ,
D C N N ( S 1 , , S M ) = 1 1 + e h j ( S 1 , , S M ) .
  • Sugeno’s Fuzzy Integral 
The supposition of a basic weighted normal system is that all classifiers are not commonly reliant. In any case, classifiers are connected. To take out a requirement for such presumption, the possibility of fuzzy integral was actualized by the authors of [34,35] and is a nonlinear mapping function characterized with a fuzzy measure. A fuzzy integral is the fuzzy normal of classifier scores. Definitions are given underneath thinking about fuzzy and fuzzy integral, separately.
Definition 1.
Let X be a finite set defined as { x 1 , x 2 x n } . A fuzzy measure μ defined on X is a set of function μ : X [ 0 , 1 ] satisfying with
1. 
   μ ( ϕ ) = 0 , μ ( X ) = 1 ,
2. 
   A B , μ ( A ) μ ( B ) .
The fuzzy measure we adopt in this work is the Sugeno integral.
Definition 2.
Let μ be a fuzzy measure on X. The discrete Sugeno integral of function f : X [ 0 , 1 ] with respect to μ is defined as
S μ ( f ( x 1 ) , f ( x 2 ) f ( x n ) ) i = 1 n ( f ( x i ) μ ( A ( i ) )
where, ( i ) shows the indices have been permuted so that 0 f ( x 1 ) f ( x 2 ) f ( x n ) 1 . Moreover, A ( i ) : = { x ( i ) x ( n ) } and f ( x ( 0 ) ) = 0 .
Fuzzy measure μ is a μ λ -fuzzy measure and is calculated by using Sugeno’s λ measure. The value of μ ( A ( i ) ) is calculated recursively as
μ ( A ( 1 ) ) = μ ( { x ( 1 ) } ) = μ 1
μ ( A ( i ) ) = μ i + μ ( A ( i 1 ) ) + λ μ i μ ( A ( i 1 ) ) f o r 1 < i n .
the value of λ is calculated by solving the equation
λ + 1 = i = 1 n ( 1 + λ μ i )
where λ ( 1 , + ) and λ 0 . This can be easily computed by calculating an ( n 1 ) st degree polynomial and determining the distinct root greater than −1. The fuzzy integral is characterized in proposed work as late combination method for consolidating classifiers scores. Assume that C = { c 1 , c 2 c n } is a bunch of action classes of interest. Let X = { x 1 , x 2 x n } be a bunch of classifiers and A be an input pattern considered for action recognition. Let f k : X [ 0 , 1 ] be the assessment of the object A for class c k , for example, f k ( x i ) is sign of guarantee in the characterization of the input pattern A for class c k utilizing the classifier x i . Value 1 for f k ( x i ) is characterizing outright guarantee of input pattern A in class c k and 0 shows supreme uncertainty that the object is in c k .
Knowledge of the density function is needed to figure the fuzzy integral and μ i , ith density is considered as the level of significance of the source x i towards a ultimate choice. A maximal evaluation of comprehension between the evidence and desire is spoken to as fuzzy integral. In the proposed approach, the density function μ is approximated via preparing information gave to the classifier. The calculation in the proposed algorithm characterizes the late combination approach for choice combination. The Algorithm 1 defines the late fusion approach for decision fusion.
Algorithm 1 Late fusion (decision fusion) using fuzzy integral.
procedureFuzzy –Integral
    Calculate λ ;
    for each action class c k do
        for each classifier x i do
           Compute f k ( x i )
           Determine μ k ( { x i } )
        end for
        Calculate fuzzy integral for the action class
    end for
    Find out the action class label
end procedure

3.5. Classifier

Various classifiers have been used to evaluate the performance of proposed approach. The parameters and their respective values are summarized in Table 3. We have considered the parameters kernel function with degree (d), Gamma in Kernel Function ( γ ), and Regularization Parameter (c). Polynomial and Radial basis kernel functions have been used.
The parameters of the ANN are hidden layer neurons (n), the value of the learning rate (lr), momentum constant (mc), and number of epochs (ep). To find out the values of these parameters efficiently, ten levels of n, nine levels of mc, and ten levels of ep are evaluated in the parameter setting experiments. The value of lr is initially fixed at 0.1. The values of these parameters and their respective levels are evaluated in Table 4.

3.5.1. Meta-Cognitive Neural Network (McNN) Classifier

Neural network provides a self-learning mechanism, whereas the meta-cognitive phenomenon comprises self-regulated learning. Self-regulation makes the learning process more effective. Therefore, there is need of jump from single or simple learning to collaborative learning. The collaborative learning can be achieved using the cognitive component, which interprets knowledge, and the meta-cognitive component, which represents the dynamic model of the cognitive component.
Self-regulated learning is a key factor of meta-cognition. It is threefold mechanism: it plans, monitors, and manages the feedback. According to Flavell [37], meta-cognition is awareness and knowledge of the mental process for monitoring, regulate, and direct the desired goal. We present here Nelson and Naren’s meta-cognitive model [38]. The cognitive component and meta-cognitive component are prime entities of McNN. A detailed architecture of the Meta-cognitive Neural network is shown in Figure 11.

3.5.2. Cognitive Component

The cognitive component includes three-layered feedforward radial basis function network. It comprises an input layer, an output layer, and an intermediate hidden layer. The activation function for hidden neurons is Gaussian whereas, for output neurons, it is a linear activation function. Hidden layer neurons are built by the meta-cognitive algorithm. The predicted output y ^ of the McNN classifier with k Gaussian neurons from i 1 training samples is
y ^ j i = α j 0 + k = 1 K α j k ϕ k ( x i ) , j = 1 , 2 , , n ,
where α j 0 = bias to jth output neuron, α j k is weight connecting the kth neuron to the jth output neuron, and ϕ k ( x i ) is the output of kth Gaussian neuron to the excitation x is represented as
ϕ k ( x i ) = e x p x i μ k l 2 ( σ k l ) 2 ,
where μ k l is the mean, σ k l is the variation in the mean value of the kth hidden neuron, and l represents the hidden layer class.

3.5.3. Meta-Cognitive Component

  • Measures: 
The meta cognitive component of McNN uses four parameters for regulation learning:
  • Estimated class label:
    Estimated class label c ^ can be calculated from predicted output y ^ i as
    c ^ = a r g m a x j 1 , 2 , n y ^ j i .
  • Maximum hinge error:
    Hinge error estimates posterior probability more precisely than mean square error function, and, eventually, the error between the predicted output y ^ i and actual output y i hinge error loss defined as
    e j = 0 i f y ^ j i y j i > 1 y ^ j i y j i o t h e r w i s e , j = 1 , 2 , , n
    The maximum absolute hinge error E is as follows,
    E = m a x j 1 , 2 , n | e j | .
  • Confidence of classifier:
    The classifier confidence is given as
    p ^ ( c / x i ) = m i n ( 1 , m a x ( 1 , y ^ j i ) ) + 1 2 .
  • Class-wise significance:
    The input feature is mapped to higher dimensional S using Gaussian activation function applied to hidden layer neurons. Therefore, it is considered to be on hyper-dimensional sphere. The feature space S is described by the mean μ and σ variation in the mean value of Gaussian neurons. Moreover, steps are shown in [39] for the calculation of potential ψ , which is given as
    ψ 2 K k = 1 K ϕ ( x i , μ k l ) .
    In the classification problem, each class distribution is considered crucial and eventually affect the accuracy of the classifier, significantly. Therefore, a measure of the spherical potential of new training data x belongs to class c with respect to neurons belongs to same class has been utilized, i.e., l = c . Class-wise significance ψ c is calculated as
    ψ c = 1 K c k = 1 K c ϕ ( x i , μ k c ) ,
    where K c is the number of neurons associated with class c. The sample contains relevant information or not depends on ψ c , the lowest value of it denotes sample consider novelty.
  • Learning Strategy: 
Based on various measures, the meta-cognitive component has different learning strategies, which deal with the basic rules of self-regulated learning. These strategies manage sequential learning process by utilizing one of them for new training sample.
  • Sample Delete Strategy:
    This strategy reduces the computational time consumed by learning the process. It reduces the redundancy in training samples, i.e., it prevents similar samples being learnt by the cognitive component. The measures used for this strategy are predicted class label and confidence level. When actual class label and predicted class label of the new training data is equal and the confidence score is greater than expected value, it indicates that new training data training data provides redundancy.
  • Neuron growth strategy:
    New hidden neuron should be added to the cognitive component or not is decided by this strategy. When new training sample include substantial information and the estimated class label is different from an actual class label, new hidden neuron should be added to adopt the knowledge.
  • Parameter update strategy:
    Parameters of the cognitive component are updated in this strategy from new training sample. The value of parameters change when an actual class label is same as the predicted class of sample and maximum hinge loss error is greater than a threshold set for adaptive parameter updation.
  • Sample reverse strategy:
    Fine tuning of parameters of the cognitive component has been established by new training samples, which are having some information but not much relevant.
The parameters are updated in McNN, when the desired class is equal to the actual class. The value of maximum hinge error E for neuron growth and the parameter update strategy is between 1.2 and 1.5, and 0.3 and 0.8, respectively. For parameter update strategy, If the value is close to 1, it will avoid system to use any sample. The value is close to 0 cause all samples to be used in updation. In neuron addition strategy, the value 1 lead of E lead to misclassification of all samples and the value 2 causes few neurons will be added. Other parameters are selected accordingly and the range of values of parameters have been shown in Table 5.

4. Performance Evaluation

A performance evaluation of the proposed work has been done using a sufficient set of performance parameters through extensive experiments on standard datasets, which is described as follows.

4.1. Database Used

The proposed approach was applied to two datasets: the KTH [40] and Weizmann datasets [41]. These datasets are popular benchmarks for action recognition in constrained video sequences. These datasets incorporate only one action in each frame with the static background.

4.1.1. KTH Dataset

The KTH dataset contains action clips with variations in the background, object, and scale, and was thus useful for determining the accuracy of our proposed method. The video sequences contain six different types of human actions (i.e., walking, jogging, running, boxing, hand waving, and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors, outdoors with scale variation (zooming), outdoors with different clothes (appearance), and indoors, as illustrated below. Static and homogeneous backgrounds are considered in all sequences, where the frame rate is 25 frames per second. The resolution of these videos is 160 × 120 pixels, and the duration of the videos is four seconds on average. There are 25 videos for each action in the four different categories. Certain snapshots of video sequences from the KTH dataset are shown in Figure 12.

4.1.2. Weizmann Dataset

The Weizmann database [41] is a collection of 90 low-resolution (180 × 144, de-interlaced 50 frames per second) video sequences. The dataset contains nine different humans, each one performing ten natural actions: run, walk, skip, jumping-jack (or shortly jack), jump forward on two legs (or jump), jump in place on two legs (or pjump), gallop sideways (or side), wave two hands (or wave2), wave one hand (or wave1), or bend. Snapshots of the Weizmann dataset are shown in Figure 13.

4.1.3. UCF11 Dataset

The UCF11 dataset [42] considered 11 human action with 1600 videos. These videos comprise youtube videos defining real human actions. The actions are performed by 25 various human objects under challenging conditions like large changes in viewpoint change, object scale, object appearance and pose, camera motion, cluttered background, illumination variation, etc. There are 11 action categories in UCF11: basketball shooting (Shoot), biking/cycling (Bike), diving (Dive), golf swinging (Golf), horse back riding (Ride), soccer juggling (Juggle), swinging (Swing), tennis swinging (Tennis), trampoline jumping (Jump), volleyball spiking (Spike), and walking with a dog (Dog).

4.1.4. HMDB51 Dataset

The HMDB51 dataset [43] is built up using videos, adopted from YouTube, movies, and various other sources for managing unconstrained environment. The datasets have the variety of 6849 video clips and 51 action categories. Each class has the at least 101 clips.

4.1.5. UCF101 Dataset

UCF101 is a dataset [44] of 13,320 videos including 101 different action classes. This dataset reflects the large diversity in terms of human appearance performing the action, scale, and viewpoint of the object, background clutter, and illumination variation, resulting in the most challenging dataset. This dataset is bridging path to real-time action recognition.

4.2. The Testing Strategy

The KTH dataset contains 600 video samples of 6 types of human actions. The dataset is divided into two parts: 80% and 20%. We have used a 10-fold leave-one-out cross-validation scheme on the 80% part and left out 20% for testing. In this experiment, nine splits are used for training, with the remaining split being used for the validation set, which optimizes the parameters of each classifier. The same testing strategy has been implemented for Weizmann dataset. Leave-One-Group-Out cross-validation has been used for the UCF11 dataset. A cross-validation strategy used for the HMDB51 dataset, the same as in [43]. The whole dataset is divided into three portions. Each includes 70 training and 30 testing video clips. The training strategy used for UCF101 is three split technique evaluated for training and testing.

4.3. Experimental Setup

Experiments were performed on an Intel(R) Core(TM) i5 (2nd Gen) 2430M CPU @ 2.5 GHz with 6 GB of RAM and a 64-bit operating system. The names of the parameters and the values used in this proposed work are listed in Table 3 and Table 4, respectively. In this section, we examine the performance of our proposed approach and compare it with the state-of-the-art methods. We also compare different classifier performances with our feature extraction technique for the proposed framework. All confusion matrices address the average accuracy of all features for the SVM classifier with different kernel functions, as well as for the ANN with different numbers of hidden layers.
In this experiment, we have also considered different types of fusion techniques, i.e., early and late have been considered for experimentation. We have employed five various fusion strategies in the proposed work. Figure 14, Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19 present the various models of the early fusion and late fusion techniques used in our experiments. In Figure 14, early fusion has been applied to features and fed to ANN classifier, and some early fusion of features are fed to SVM classifier as shown in Figure 15. Features are provided to MKL with base learner as ANN, and MKL with base learner as SVM, these strategies are defined in Figure 16 and Figure 17, respectively. Figure 18 shows a combination of classifiers scores using late fusion techniques, where we have used SVM classifier in this technique. Meta-cognitive Neural network has been used with all proposed features as shown in Figure 19.

4.4. Empirical Analysis

The confusion matrix is shown in Table 6 and Table 7 for different combinations of feature extraction and classifier techniques for the KTH dataset. Table 8, Table 9 and Table 10 show the results with the Weizmann dataset. We have considered linear, polynomial, and radial basis kernel functions for the SVM classification. The results demonstrate that we obtain the good result (97%) with the radial basis function SVM and best result 99.98% with the late fusion using fuzzy integral approach compare to other proposed approaches. Ambiguity arises from the classes like boxing, hand waving and hand clapping actions. Furthermore, running, walking and jogging are misclassified by all classifiers.
The confusion matrix with Radial basis function SVM (RBF SVM) for the UCF11 dataset is shown in Table 11. We accomplished 77.05% accuracy in this proposed approach for UCF11 dataset with previously mentioned parameters SVM classifier as in Table 4. Table 12, Table 13 and Table 14 shows the confusion matrix for KTH, Weizmann & UCF11 dataset using McNN. The UCF11 dataset has unconstrained environments and contains various challenges in video sequences; the proposed feature extraction technique is not adequate for describing the action performed by the human object. Therefore, we can see that a lot of actions are misclassified into other actions like Shoot is misclassified as Swing, etc.
Accuracy obtained for KTH dataset using late fusion using DCNN is 99.19% and late fusion using fuzzy integral is 99.98% i.e., effectiveness of fuzzy integral technique compared to DCNN technique as late fusion is higher as shown in Table 15.
Moreover, the performance of five broad groups is evaluated in this work using the particular model as shown in Figure 20. Recognition rate has been calculated for all group categories. A Large portion of performance has been gain from sports category. Even all other categories are performing impressively.
Table 15 and Table 16 compare our results with the state-of-the-art methods. Table 15 compares our proposed approach with 21 other approaches that used the KTH dataset. Our approach obtained an accuracy of 100%, which is outperformed to those of the state-of-the-art methods. The proposed approach is compared with the state-of-the-art methods for the Weizmann dataset, which is shown in Table 16. The result shows that our method outperforms the other methods. These comparisons demonstrate that the proposed approach is effective and superior in classifying actions.
Table 17, Table 18 and Table 19 show the state-of-the-art comparison for UCF11 dataset, HMDB-51 dataset and UCF101 dataset, respectively. Our results are achieving very good classification rate compared to other approaches, but humbler than the state-of-the-art results. Compare to early fusion and intermediate fusion techniques, late fusion techniques are superior. In late fusion techniques, fuzzy integral is performing better than DCNN late fusion technique for UCF11 dataset.
In Table 20, we compare our approach with various convolutional neural network architectures. For this comparison, the average accuracy has been calculated over three splits as is the original setting. For the UCF101 dataset, we find that our McNN with proposed features performed well compared with state-of the-art methods. For UCF101, we get a 1% improvement in classification accuracy. However, our result for HMDB51 dataset is not the best result, but the improvement in resultant accuracy is considerable.

5. Conclusions

In this paper, we have employed a HAR-based novel feature fusion approach. HOG, R_FHOG, displacement, and velocity features are combined to prepare the feature descriptor in this approach. The classifiers used to classify human action are an ANN, a SVM, MKL, late fusion approach, and McNN. The experimental results demonstrate that this proposed approach can easily recognize actions such as running, walking, and jumping. The McNN outperforms other classifiers. The proposed approach performs reasonably well compared with the majority of existing state-of-the-art methods. For the KTH dataset, our proposed approach outperforms existing methods, and for the Weizmann dataset our approach performs similarly to standard available methods. We have also checked the system performance with unconstrained UCF11 dataset, HMDB51 dataset, and UCF101 dataset, and its performance is approximate to the state-of-the-art method.
In the future, an overlapping window can be utilized for the feature extraction technique to increase the accuracy of the proposed method. Here, the proposed work focuses only on a constrained video; however, we can also use this proposed feature set for an unconstrained video, where more than one object is present in the video performing the same action or in the video performing multiple actions. The traditional neural network can be replaced by the convolutional neural network for further enhancements. We can conclude that fusion of features is a vital idea to enhance the performance of the classifier, where a large complex set of features available. Late fusion was found to be better than early fusion as features are used by multiple classifiers because of their competitiveness for late fusion.

Author Contributions

Conception and design, C.I.P., D.L., S.P.; collection and assembly of data, C.I.P., D.L., S.P., K.M., H.G., M.A.; data analysis and interpretation, C.I.P., D.L., S.P., K.M., H.G., and M.A.; manuscript writing, C.I.P., D.L., S.P., K.M., H.G., and M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors would like to thank the reviewers for their valuable suggestions which helped in improving the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hu, F.; Hao, Q.; Sun, Q.; Cao, X.; Ma, R.; Zhang, T.; Patil, Y.; Lu, J. Cyber-physical System With Virtual Reality for Intelligent Motion Recognition and Training. IEEE Trans. Syst. Man Cybern. Syst. 2017, 47, 347–363. [Google Scholar]
  2. Wang, L.; Hu, W.; Tan, T. Recent developments in human motion analysis. Pattern Recognit. 2003, 36, 585–601. [Google Scholar] [CrossRef]
  3. Vallacher, R.R.; Wegner, D.M. What do people think they’re doing? Action identification and human behavior. Psychol. Rev. 1987, 94, 3–15. [Google Scholar] [CrossRef]
  4. Pullen, K.; Bregler, C. Motion capture assisted animation: Texturing and synthesis. ACM Trans. Graph. 2002, 21, 501–508. [Google Scholar] [CrossRef]
  5. Mackay, W.E.; Davenport, G. Virtual video editing in interactive multimedia applications. Commun. ACM 1989, 32, 802–810. [Google Scholar] [CrossRef]
  6. Zhong, H.; Shi, J.; Visontai, M. Detecting unusual activity in video. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, Washington, DC, USA, 27 June–2 July 2004; Volume 2. [Google Scholar]
  7. Fan, C.T.; Wang, Y.K.; Huang, C.R. Heterogeneous information fusion and visualization for a large-scale intelligent video surveillance system. IEEE Trans. Syst. Man Cybern. Syst. 2017, 47, 593–604. [Google Scholar] [CrossRef]
  8. Filippova, K.; Hall, K.B. Improved video categorization from text meta-data and user comments. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 25–29 July 2011; pp. 835–842. [Google Scholar]
  9. Moxley, E.; Mei, T.; Manjunath, B.S. Video annotation through search and graph reinforcement mining. IEEE Trans. Multimed. 2010, 12, 184–193. [Google Scholar] [CrossRef] [Green Version]
  10. Peng, Q.; Cheung, Y.M.; You, X.; Tang, Y.Y. A Hybrid of Local and Global Saliencies for Detecting Image Salient Region and Appearance. IEEE Trans. Syst. Man Cybern. Syst. 2017, 47, 86–97. [Google Scholar] [CrossRef]
  11. Wang, H.; Kläser, A.; Schmid, C.; Liu, C.L. Action recognition by dense trajectories. In Proceedings of the CVPR 2011, Providence, RI, USA, 20–25 June 2011; pp. 3169–3176. [Google Scholar]
  12. Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
  13. Ni, B.; Moulin, P.; Yang, X.; Yan, S. Motion part regularization: Improving action recognition via trajectory selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3698–3706. [Google Scholar]
  14. Lan, Z.; Lin, M.; Li, X.; Hauptmann, A.G.; Raj, B. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 204–212. [Google Scholar]
  15. Liu, L.; Shao, L.; Li, X.; Lu, K. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Trans. Cybern. 2016, 46, 158–170. [Google Scholar] [CrossRef] [Green Version]
  16. Li, H.; Chen, J.; Xu, Z.; Chen, H.; Hu, R. Multiple instance discriminative dictionary learning for action recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2014–2018. [Google Scholar]
  17. Beaudry, C.; Péteri, R.; Mascarilla, L. An efficient and sparse approach for large scale human action recognition in videos. Mach. Vis. Appl. 2016, 27, 529–543. [Google Scholar] [CrossRef]
  18. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
  19. Ahsan, U.; Sun, C.; Essa, I. DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks. arXiv 2018, arXiv:1801.07230. [Google Scholar]
  20. Zheng, Y.; Yao, H.; Sun, X.; Zhao, S.; Porikli, F. Distinctive action sketch for human action recognition. Signal Process. 2018, 144, 323–332. [Google Scholar] [CrossRef] [Green Version]
  21. Li, Z.; Gavrilyuk, K.; Gavves, E.; Jain, M.; Snoek, C.G.M. VideoLSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 2018, 166, 41–50. [Google Scholar] [CrossRef] [Green Version]
  22. Zhang, H.-B.; Zhang, Y.-X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.-X.; Chen, D.-S. A comprehensive survey of vision-based human action recognition methods. Sensors 2019, 19, 1005. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Patel, C.I.; Garg, S. Comparative analysis of traditional methods for moving object detection in video sequence. Int. J. Comput. Sci. Commun. 2015, 6, 309–315. [Google Scholar]
  24. Patel, C.I.; Patel, R. Illumination invariant moving object detection. Int. J. Comput. Electr. Eng. 2013, 5, 73. [Google Scholar] [CrossRef]
  25. Spagnolo, P.; D’Orazio, T.; Leo, M.; Distante, A. Advances in background updating and shadow removing for motion detection algorithms. In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Versailles, France, 5–8 September 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 398–406. [Google Scholar]
  26. Patel, C.I.; Patel, R. Gaussian mixture model based moving object detection from video sequence. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology, Maharashtra, India, 25–26 February 2011; pp. 698–702. [Google Scholar]
  27. Mondéjar-Guerra, M.V.; Rouco, J.; Novo, J.; Ortega, M. An end-to-end deep learning approach for simultaneous background modeling and subtraction. In Proceedings of the BMVC, Cardiff, UK, 9–12 September 2019; p. 266. [Google Scholar]
  28. Patel, C.I.; Garg, S.; Zaveri, T.; Banerjee, A. Top-Down and bottom-up cues based moving object detection for varied background video sequences. Adv. Multimed. 2014, 2014, 879070. [Google Scholar] [CrossRef] [Green Version]
  29. Patel, C.I.; Garg, S. Robust face detection using fusion of haar and daubechies orthogonal wavelet template. Int. J. Comput. Appl. 2012, 46, 38–44. [Google Scholar]
  30. Ukani, V.; Garg, S.; Patel, C.; Tank, H. Efficient vehicle detection and classification for traffic surveillance system. In Proceedings of the International Conference on Advances in Computing and Data Sciences, Ghaziabad, India, 11–12 November 2016; Springer: Singapore, 2016; pp. 495–503. [Google Scholar]
  31. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
  32. Liu, K.; Skibbe, H.; Schmidt, T.; Blein, T.; Palme, K.; Brox, T.; Ronneberger, O. Rotation-invariant hog descriptors using Fourier analysis in polar and spherical coordinates. Int. J. Comput. 2014, 106, 342–364. [Google Scholar] [CrossRef]
  33. Lee, D.S.; Srihari, S.N. A theory of classifier combination: The neural network approach. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 42–45. [Google Scholar]
  34. Sugeno, M. Theory of Fuzzy Integrals and Its Applications. Ph.D. Thesis, Tokyo Institute of Technology, Tokyo, Japan, 1975. [Google Scholar]
  35. Cho, S.B.; Kim, J.H. Combining multiple neural networks by fuzzy integral for robust classification. IEEE Trans. Syst. Man Cybern. 1995, 25, 380–384. [Google Scholar]
  36. Patel, J.; Shah, S.; Thakkar, P.; Kotecha, K. Predicting stock market index using fusion of machine learning techniques. Expert Syst. Appl. 2015, 42, 2162–2172. [Google Scholar] [CrossRef]
  37. Flavell, J.H. Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. Am. Psychol. 1979, 34, 906. [Google Scholar] [CrossRef]
  38. Nelson, T.O. Metamemory: A theoretical framework and new findings. Psychol. Learn. Mot. 1990, 26, 125–173. [Google Scholar]
  39. Babu, G.S.; Suresh, S. Meta-cognitive neural network for classification problems in a sequential learning framework. Neurocomputing 2012, 81, 86–96. [Google Scholar] [CrossRef]
  40. Schuldt, C.; Laptev, I.; Caputo, B. Recognizing Human Actions: A Local SVM Approach. In Proceedings of the 17th International Conference on Pattern Recognition, (ICPR’04), Cambridge, UK, 26 August 2004; Volume 3. [Google Scholar]
  41. Gorelick, L.; Blank, M.; Shechtman, E.; Irani, M.; Basri, R. Actions as Space-Time Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 2247–2253. [Google Scholar] [CrossRef] [Green Version]
  42. Liu, J.; Luo, J.; Shah, M. Recognizing realistic actions from videos “in the wild”. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1996–2003. [Google Scholar]
  43. Kuehne, H.; Jhuang, H.; Stiefelhagen, R.; Serre, T. HMDB51: A large video database for human motion recognition. In Proceedings of the High Performance Computing in Science and Engineering ’12, Barcelona, Spain, 6–13 November 2011; pp. 571–582. [Google Scholar]
  44. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Action Classes from Videos in the Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
  45. Dollár, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; pp. 65–72. [Google Scholar]
  46. Jiang, H.; Drew, M.S.; Li, Z.N. Successive convex matching for action detection. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1646–1653. [Google Scholar]
  47. Niebles, J.C.; Wang, H.; Fei-Fei, L. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. Int. J. Comput. Vision 2008, 79, 299–318. [Google Scholar] [CrossRef] [Green Version]
  48. Yeo, C.; Ahammad, P.; Ramchandran, K.; Sastry, S.S. Compressed Domain Real-time Action Recognition. In Proceedings of the 2006 IEEE 8th Workshop on Multimedia Signal Processing, Victoria, BC, Canada, 3–6 October 2006; pp. 33–36. [Google Scholar]
  49. Ke, Y.; Sukthankar, R.; Hebert, M. Spatio-temporal shape and flow correlation for action recognition. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007. [Google Scholar]
  50. Kim, T.K.; Wong, S.F.; Cipolla, R. Tensor canonical correlation analysis for action classification. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007. [Google Scholar]
  51. Jhuang, H.; Serre, T.; Wolf, L.; Poggio, T. A biologically inspired system for action recognition. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
  52. Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
  53. Rapantzikos, K.; Avrithis, Y.; Kollias, S. Dense saliency-based spatio-temporal feature points for action recognition. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1454–1461. [Google Scholar]
  54. Bregonzio, M.; Gong, S.; Xiang, T. Recognizing action as clouds of space-time interest points. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1948–1955. [Google Scholar]
  55. Klaser, A.; Marszałek, M.; Schmid, C. A spatio-temporal descriptor based on 3d-gradients. In Proceedings of the BMVC 2008—19th British Machine Vision Conference, Leeds, UK, 7–10 September 2008. [Google Scholar]
  56. Fathi, A.; Mori, G. Action recognition by learning mid-level motion features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
  57. Le, Q.V.; Zou, W.Y.; Yeung, S.Y.; Ng, A.Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the CVPR 2011, Providence, RI, USA, 20–25 June 2011; pp. 3361–3368. [Google Scholar]
  58. Kovashka, A.; Grauman, K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2046–2053. [Google Scholar]
  59. Yeffet, L.; Wolf, L. Local trinary patterns for human action recognition. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 492–497. [Google Scholar]
  60. Wang, H.; Klaser, A.; Schmid, C.; Liu, C.L. Dense trajectories and motion boundary descriptors for action Recognition. Int. J. Comput. Vis. 2013, 103, 60–79. [Google Scholar] [CrossRef] [Green Version]
  61. Grundmann, M.; Meier, F.; Essa, I. 3D shape context and distance transform for action recognition. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–4. [Google Scholar]
  62. Weinland, D.; Boyer, E. Action recognition using exemplar-based embedding. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–7. [Google Scholar]
  63. Hoai, M.; Lan, Z.Z.; De la Torre, F. Joint segmentation and classification of human actions in video. In Proceedings of the CVPR 2011, Providence, RI, USA, 20–25 June 2011; pp. 3265–3272. [Google Scholar]
  64. Ballan, L.; Bertini, M.; Del Bimbo, A.; Seidenari, L.; Serra, G. Recognizing human actions by fusing spatio-temporal appearance and motion descriptors. In Proceedings of the International Conference on Image Processing, Cairo, Egypt, 7–10 November 2009; pp. 3569–3572. [Google Scholar]
  65. Wang, Y.; Mori, G. Learning a discriminative hidden part model for human action recognition. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–11 December 2009; pp. 1721–1728. [Google Scholar]
  66. Chen, C.C.; Aggarwal, J.K. Recognizing human action from a far field of view. In Proceedings of the 2009 Workshop on Motion and Video Computing (WMVC), Snowbird, UT, USA, 8–9 December 2009; pp. 1–7. [Google Scholar]
  67. Vezzani, R.; Baltieri, D.; Cucchiara, R. HMM based action recognition with projection histogram features. In Proceedings of the Recognizing Patterns in Signals, Speech, Images and Videos, Istanbul, Turkey, 23–26 August 2010; pp. 286–293. [Google Scholar]
  68. Dhillon, P.S.; Nowozin, S.; Lampert, C.H. Combining appearance and motion for human action classification in videos. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Miami, FL, USA, 20–25 June 2009; pp. 22–29. [Google Scholar]
  69. Lin, Z.; Jiang, Z.; Davis, L.S. Recognizing actions by shape-motion prototype trees. In Proceedings of the International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 444–451. [Google Scholar]
  70. Natarajan, P.; Singh, V.K.; Nevatia, R. Learning 3d action models from a few 2d videos for view invariant action recognition. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2006–2013. [Google Scholar]
  71. Yang, M.; Lv, F.; Xu, W.; Yu, K.; Gong, Y. Human action detection by boosting efficient motion features. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Kyoto, Japan, 27 September–4 October 2009; pp. 522–529. [Google Scholar]
  72. Liu, J.; Shah, M. Learning human actions via information maximization. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  73. Ikizler-Cinbis, N.; Sclaroff, S. Object, scene and actions: Combining multiple features for human action Recognition. In Proceedings of the European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; pp. 494–507. [Google Scholar]
  74. Mota, V.F.; Perez, E.D.A.; Maciel, L.M.; Vieira, M.B.; Gosselin, P.H. A tensor motion descriptor based on histograms of gradients and optical flow. Pattern Recognit. Lett. 2014, 39, 85–91. [Google Scholar] [CrossRef] [Green Version]
  75. Sad, D.; Mota, V.F.; Maciel, L.M.; Vieira, M.B.; De Araujo, A.A. A tensor motion descriptor based on multiple gradient estimators. In Proceedings of theConference on Graphics, Patterns and Images, Arequipa, Peru, 5–8 August 2013; pp. 70–74. [Google Scholar]
  76. Figueiredo, A.M.; Maia, H.A.; Oliveira, F.L.; Mota, V.F.; Vieira, M.B. A video tensor self-descriptor based on block matching. In Proceedings of the International Conference on Computational Science and Its Applications, Guimarães, Portugal, 30 June–3 July 2014; pp. 401–414. [Google Scholar]
  77. Hasan, M.; Roy-Chowdhury, A.K. Incremental activity modeling and recognition in streaming videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 796–803. [Google Scholar]
  78. Kihl, O.; Picard, D.; Gosselin, P.H. Local polynomial space-time descriptors for action classification. Mach. Vis. Appl. 2016, 27, 351–361. [Google Scholar] [CrossRef] [Green Version]
  79. Maia, H.A.; Figueiredo, A.M.D.O.; De Oliveira, F.L.M.; Mota, V.F.; Vieira, M.B. A video tensor self-descriptor based on variable size block matching. J. Mob. Multimed. 2015, 11, 90–102. [Google Scholar]
  80. Patel, C.I.; Garg, S.; Zaveri, T.; Banerjee, A.; Patel, R. Human action recognition using fusion of features for unconstrained video sequences. Comput. Electr. Eng. 2018, 70, 284–301. [Google Scholar] [CrossRef]
  81. Kliper-Gross, O.; Gurovich, Y.; Hassner TWolf, L. Motion Interchange Patterns for Action Recognition in Unconstrained Videos. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 256–269. [Google Scholar]
  82. Can, E.F.; Manmatha, R. Formulating action recognition as a ranking problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 251–256. [Google Scholar]
  83. Liu, W.; Zha, Z.J.; Wang, Y.; Lu, K.; Tao, D. p-Laplacian regularized sparse coding for human activity recognition. IEEE Trans. Ind. Electron. 2016, 63, 5120–5129. [Google Scholar] [CrossRef]
  84. Lan, Z.; Yi, Z.; Alexander, G.H.; Shawn, N. Deep local video feature for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1–7. [Google Scholar]
  85. Zhu, J.; Zhu, Z.; Zou, W. End-to-end video-level representation learning for action recognition. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 645–650. [Google Scholar]
  86. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Bangkok, Thailand, 18–20 November 2020; pp. 568–576. [Google Scholar]
  87. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Li, F.-F. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
  88. Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
  89. Sun, L.; Jia, K.; Yeung, D.-Y.; Shi, B.E. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4597–4605. [Google Scholar]
  90. Christoph, F.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
  91. Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2718–2726. [Google Scholar]
  92. Cherian, A.; Fernando, B.; Harandi, M.; Gould, S. Generalized rank pooling for activity recognition. arXiv 2017, arXiv:1704.02112. [Google Scholar]
  93. Seo, J.-J.; Kim, H.-I.; de Neve, W.; Ro, Y.M. Effective and efficient human action recognition using dynamic frame skipping and trajectory rejection. Image Vis. Comput. 2017, 58, 76–85. [Google Scholar] [CrossRef]
  94. Shi, Y.; Tian, Y.; Wang, Y.; Huang, T. Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Trans. Multimed. 2017, 19, 1510–1520. [Google Scholar] [CrossRef] [Green Version]
  95. Wang, J.; Cherian, A.; Porikli, F. Ordered pooling of optical flow sequences for action recognition. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 168–176. [Google Scholar]
  96. Zhu, Y.; Lan, Z.; Newsam, S.; Hauptmann, A. Hidden two-stream convolutional networks for action recognition. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 363–378. [Google Scholar]
  97. João, C.; Zisserman, A. Quo Vadis, Action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4724–4733. [Google Scholar]
Figure 1. Proposed framework.
Figure 1. Proposed framework.
Sensors 20 07299 g001
Figure 2. Moving Object Detection: (a) Video Sequence. (b) Saliency Map. (c) Silhouette Creation. (d) Segmented Object Image.
Figure 2. Moving Object Detection: (a) Video Sequence. (b) Saliency Map. (c) Silhouette Creation. (d) Segmented Object Image.
Sensors 20 07299 g002
Figure 3. Moving Object Detection: (a) Video Sequence. (b) Saliency Map. (c) Silhouette Creation. (d) Segmented Object Image.
Figure 3. Moving Object Detection: (a) Video Sequence. (b) Saliency Map. (c) Silhouette Creation. (d) Segmented Object Image.
Sensors 20 07299 g003
Figure 4. Proposed feature extraction technique: (a) Original video sequence. (b) Detected moving object. (c) Resize detected moving object into 128 × 64 size. (d) Histogram oriented gradient (HOG) feature extraction.
Figure 4. Proposed feature extraction technique: (a) Original video sequence. (b) Detected moving object. (c) Resize detected moving object into 128 × 64 size. (d) Histogram oriented gradient (HOG) feature extraction.
Sensors 20 07299 g004
Figure 5. Proposed feature calculation scheme.
Figure 5. Proposed feature calculation scheme.
Sensors 20 07299 g005
Figure 6. The generation process of the Region Feature Description.
Figure 6. The generation process of the Region Feature Description.
Sensors 20 07299 g006
Figure 7. The generation process of the Region Feature Descriptor for segmented moving object: The value below each descriptor image is defined as the scale (k), order (m), and degree (n) of the Basis function B n , m ( r , θ ) .
Figure 7. The generation process of the Region Feature Descriptor for segmented moving object: The value below each descriptor image is defined as the scale (k), order (m), and degree (n) of the Basis function B n , m ( r , θ ) .
Sensors 20 07299 g007
Figure 8. R_FHOG template.
Figure 8. R_FHOG template.
Sensors 20 07299 g008
Figure 9. Proposed early fusion technique using concatenation method.
Figure 9. Proposed early fusion technique using concatenation method.
Sensors 20 07299 g009
Figure 10. Proposed Late Fusion Technique using Decision Combination Neural Network (DCNN).
Figure 10. Proposed Late Fusion Technique using Decision Combination Neural Network (DCNN).
Sensors 20 07299 g010
Figure 11. McNN architecture.
Figure 11. McNN architecture.
Sensors 20 07299 g011
Figure 12. Video sequences from KTH datasets.
Figure 12. Video sequences from KTH datasets.
Sensors 20 07299 g012
Figure 13. Video sequences from Weizmann datasets.
Figure 13. Video sequences from Weizmann datasets.
Sensors 20 07299 g013
Figure 14. Early fusion with ANN.
Figure 14. Early fusion with ANN.
Sensors 20 07299 g014
Figure 15. Early fusion with SVM.
Figure 15. Early fusion with SVM.
Sensors 20 07299 g015
Figure 16. MKL with ANN.
Figure 16. MKL with ANN.
Sensors 20 07299 g016
Figure 17. MKL with SVM.
Figure 17. MKL with SVM.
Sensors 20 07299 g017
Figure 18. Late fusion.
Figure 18. Late fusion.
Sensors 20 07299 g018
Figure 19. Meta-cognitive neural network.
Figure 19. Meta-cognitive neural network.
Sensors 20 07299 g019
Figure 20. Accuracy comparison of different models on UCF101 dataset action categories.
Figure 20. Accuracy comparison of different models on UCF101 dataset action categories.
Sensors 20 07299 g020
Table 1. Displacement for all classes.
Table 1. Displacement for all classes.
Non-Zero Displacement Action TypeZero Displacement Action Type
ActionWalkingJoggingRunningSideSkipJumpBoxingHandclappingHandwavingBendJackPjump
Displacement1521.3364344251000000
Table 2. Velocity for all classes.
Table 2. Velocity for all classes.
Non-Zero Velocity Action TypeZero Velocity Action Type
ActionWalkingJoggingRunningSideSkipJumpBoxingHandclappingHandwavingBendJackPjump
Velocity37.553.32516085105127.5000000
Table 3. Parameters setting for SVM and their respective levels evaluated in experimentation [36].
Table 3. Parameters setting for SVM and their respective levels evaluated in experimentation [36].
ParametersLevelsLevels
(Polynomial Kernel)(Radial Basis)
Degree of Kernel Function (d)1; 2; 3; 4-
Gamma in Kernel Function ( γ )-0.5, 1.0, 1.5, ⋯, 5.0, 10.0
Regularization Parameter (c)0.5, 1, 5, 10, 1000.5, 1, 5, 10
Table 4. Parameters setting for neural network and their respective levels evaluated in experimentation [36].
Table 4. Parameters setting for neural network and their respective levels evaluated in experimentation [36].
ParametersLevel(s)
Hidden Layer Neurons (n)10, 20, ⋯, 100
Number of Epochs (ep)1000, 2000, ⋯, 10,000
Momentum Constant values (mc)0.1, 0.2, ⋯, 0.9
learning Rate Value (lr)0.1
Table 5. Parameter settings for McNN classifier.
Table 5. Parameter settings for McNN classifier.
ParameterStrategy-Wise Threshold
Sample DeleteNeuron GrowthParameter UpdateSample Reverse
Estimated class label c ^
maximum hinge error E [1.2–1.5][0.3–0.8]
Confidence of classifier p ^ ( c / x i ) [0.85–0.95]
Class-wise significance ψ c [0.4–0.8]
Table 6. Confusion matrix for SVM classifier with different kernel functions for KTH dataset.
Table 6. Confusion matrix for SVM classifier with different kernel functions for KTH dataset.
Linear SVMPolynomial SVMRadial Basis Function SVM
Predicted ClassBoxJogRunWalkWaveClapBoxJogRunWalkWaveClapBoxJogRunWalkWaveClap
Actual Class
box91.500.00.00.05.253.2596.460.00.00.02.201.3497.460.00.00.01.680.86
jog0.078.6715.535.80.00.00.095.713.251.040.00.00.097.182.240.580.00.0
run0.014.8281.483.70.00.00.02.4896.041.480.00.00.02.1697.280.560.00.0
walk0.07.6310.082.370.00.00.02.282.7394.990.00.00.01.601.40970.00.0
wave4.820.00.00.089.675.512.730.00.00.095.232.040.580.00.00.097.941.48
clap5.470.00.00.01.3693.173.270.00.00.01.3595.381.290.00.00.01.6897.03
Table 7. Confusion matrix for Neural Network with different number of Hidden layer for KTH dataset.
Table 7. Confusion matrix for Neural Network with different number of Hidden layer for KTH dataset.
Neural Network with 1 Hidden LayerNeural Network with 2 Hidden Layer
Predicted ClassBoxJogRunWalkWaveClapBoxJogRunWalkWaveClap
Actual Class
box97.100.00.00.01.461.4492.270.00.00.04.982.75
jog0.085.639.095.280.00.00.083.4011.654.950.00.0
run0.06.4988.674.340.00.00.010.8484.934.230.00.0
walk0.06.687.3585.970.00.00.08.419.3682.230.00.0
wave5.820.00.00.088.375.815.710.00.00.086.717.58
clap8.640.00.00.05.2686.108.230.00.00.04.5987.18
Table 8. Confusion matrix for SVM classifier with different kernel functions for Weizmann dataset.
Table 8. Confusion matrix for SVM classifier with different kernel functions for Weizmann dataset.
Linear SVMPolynomial SVM
Predicted ClassBendjackJumpSjumpRunSideSkipWalkWave1Wave2BendJackJumpSjumpRunSideSkipWalkWave1Wave2
Actual Class
bend91.645.462.50.400000093.752.4503.10.700000
jack086.233.213.54000000090.377.252.38000000
jump0088.583.73002.60400003.3691.045.1000000
sjump04.6211.2384.15000000009.5387.343.0300000
run000089.202.1303.67000003.7991.39004.3200
side0000091.971.696.34000000093.421.934.600
skip001.322.791.96093.430000001.721.632.3094.30000
walk0002.97.350087.23000002.217.590090.3200
wave100.2500.75000094.764.240000.64000095.383.93
wave2000000003.7596.25000000002.3697.14
Table 9. Confusion matrix for SVM classifier with different kernel functions for Weizmann dataset.
Table 9. Confusion matrix for SVM classifier with different kernel functions for Weizmann dataset.
Radial Basis Function SVM
Predicted ClassBendJackJumpSjumpRunSideSkipWalkWave1Wave2
Actual Class
bend100000000000
jack099.240.760000000
jump0098.371.63000000
sjump001.3398.67000000
run0001.6898.3200000
side0000099.5400.4600
skip000000100000
walk00000.720099.2800
wavel000000001000
wave2000000000100
Table 10. Confusion matrix for Neural Network with different number of Hidden layer for Weizmann dataset.
Table 10. Confusion matrix for Neural Network with different number of Hidden layer for Weizmann dataset.
Neural Network with 1 Hidden LayerNeural Network with 2 Hidden Layer
Predicted ClassBendJackJumpSjumpRunSideSkipWalkWave1Wave2BendJackJumpSjumpRunSideSkipWalkWave1Wave2
Actual Class
bend91.425.233.35000000093.734.162.110000000
jack093.212.374.42000000094.344.511.15000000
jump02.7991.086.1300000001.6892.475.85000000
sjump003.7196.29000000002.1797.23000000
run000089.680.5409.7800000090.351.0108.6400
side0000085.388.456.17000000089.787.362.8600
skip000009.3784.655.9800000009.8486.473.6700
walk000001.728.2690.0200000001.037.3891.5900
wave10000000086.7313.270000000089.5810.42
wave2000000008.6691.34000000006.1193.89
Table 11. Confusion matrix for RBF SVM for UCF11 dataset.
Table 11. Confusion matrix for RBF SVM for UCF11 dataset.
ShootBikeDiveGolfRideJuggleSwingTennisJumpSpikeDog
Shoot81.30000013.5005.20
Bike079.70016.7000003.6
Dive13.60830000003.40
Golf3.208.685.601.41.20000
Ride015.90078.200001.84.1
Juggle0004.8075.8019.4000
Swing07.9000068.3023.800
Tennis00014.609.8071.903.70
Jump00005.79.90084.400
Spike0018.5000010.9070.60
Dog9.3010.80011.2000068.7
Table 12. Confusion matrix of McNN for KTH dataset.
Table 12. Confusion matrix of McNN for KTH dataset.
Neural Network with 1 Hidden Layer
Predicted ClassBoxJogRunWalkWaveClap
Actual Class
box1000.00.00.000
jog0.01000000
run0.00100000
walk0.0001000.00.0
wave00.00.00.01000
clap00.00.00.00100
Table 13. Confusion matrix of McNN for Weizmann dataset.
Table 13. Confusion matrix of McNN for Weizmann dataset.
Predicted ClassBendJackJumpSjumpRunSideSkipWalkWave1Wave2
Actual Class
bend100000000000
jack010000000000
jump00100 000000
sjump000100000000
run000010000000
side000001000000
skip000000100000
walk000000010000
wavel000000001000
wave2000000000100
Table 14. Confusion matrix of McNN for UCF11 dataset.
Table 14. Confusion matrix of McNN for UCF11 dataset.
ShootBikeDiveGolfRideJuggleSwingTennisJumpSpikeDog
Shoot81.30000013.5005.20
Bike079.70016.7000003.6
Dive13.60830000003.40
Golf3.208.685.601.41.20000
Ride015.90078.200001.84.1
Juggle0004.8075.8019.4000
Swing07.9000068.3023.800
Tennis00014.609.8071.903.70
Jump00005.79.90084.400
Spike0018.5000010.9070.60
Dog9.3010.80011.2000068.7
Table 15. State-of-the-Art Comparison of Accuracy of Proposed Approaches for KTH dataset.
Table 15. State-of-the-Art Comparison of Accuracy of Proposed Approaches for KTH dataset.
MethodAccuracy
Heng et al. 2011 [11]94.2
Liu et al. 2016 [15]95.3
Beaudry et al. 2016 [17]95
Zheng et al. 2018 [20]94.58
Schuldt et al. 2004 [40]71.72
Jingen et al. 2009 [42]91.8
Dollar et al. 2005 [45]80
Jiang et al. 2006 [46]84.44
Juan et al. 2008 [47]83.33
Chuohao et al. 2006 [48]86
Ke et al. 2007 [49]81
Kim et al. 2007 [50]95.33
Jhuang et al. 2007 [51]91.6
Laptev et al. 2008 [52]91.83
Rapantzikos et al. 2009 [53]88.30
Bregonzio et al. 2009 [54]93.17
Klaser et al. 2008 [55]91.4
Fathi et al. 2008 [56]90.50
Le et al. 2011 [57]93.9
Kovashka et al. 2010 [58]94.53
Yeffet et al. 2009 [59]90.1
Wang et al. 2013 [60]95.3
Early Fusion using ANN84.12
Early Fusion using SVM92.32
MKL with ANN93.03
MKL with SVM95.85
Late Fusion using DCNN96.19
Late Fusion using98.98
Fuzzy Integral
McNN100
Table 16. State-of-the-Art Comparison of Accuracy of Proposed Approaches for Weizmann dataset .
Table 16. State-of-the-Art Comparison of Accuracy of Proposed Approaches for Weizmann dataset .
MethodAccuracy
Liu et al. 2016 [15]100
Lena et al. 2009 [41]88.2
Bregonzio et al. 2009 [54]96.66
Klaser et al. 2008 [55]84.3
Grundmann et al. 2008 [61]96.39
Weinland et al. 2008 [62]93.33
Nguyen et al. 2011 [63]87.7
Ballan et al. 2009 [64]92.41
Yang et al. 2009 [65]97.2
Chen et al. 2009 [66]100
Vezzani et al. 2010 [67]86.7
Dhillon et al. 2009 [68]88.5
Lin et al. 2009 [69]100
Natarajan et al. 2010 [70]99.5
Yan et al. 2009 [71]99.4
Early Fusion using ANN91.943
Early Fusion using SVM94.34
MKL with ANN92.09
MKL with SVM93.89
Late Fusion using DCNN95.25
Late Fusion using97.97
Fuzzy Integral
McNN100
Table 17. State-of-the-Art Comparison of Accuracy of Proposed Approaches for UCF11 dataset .
Table 17. State-of-the-Art Comparison of Accuracy of Proposed Approaches for UCF11 dataset .
MethodAccuracy
Wang et al. 2011 [11]84.2
Liu et al. 2009 [72]71.2
Ikizler et al. 2010 [73]75.2
Mota et al. 2013 [74]72.7
Sad et al. 2013 [75]72.6
Wang et al. 2013 [60]89.9
Figueiredo et al. 2014 [76]59.5
Hasan et al. 2014 [77]54.5
Kihl et al. 2014 [78]86.0
Maia et al. 2015 [79]64.0
Patel et al. 2016 [80]89.43
Early Fusion using ANN69.96
Early Fusion using SVM74.05
MKL with ANN75.07
MKL with SVM78.38
Late Fusion using DCNN79.88
Late Fusion using82.12
Fuzzy Integral
McNN89.93
Table 18. State-of-the-Art Comparison of Accuracy of Proposed Approaches for HMDB-51 dataset .
Table 18. State-of-the-Art Comparison of Accuracy of Proposed Approaches for HMDB-51 dataset .
MethodAccuracy
Liu et al. 2009 [72]71.2
Kuehne et al. 2011 [43]23.0
Kliper et al. 2012 [81]29.2
Wang et al. 2013 [60]46.6
Wang et al. 2013 [12]57.2
Can et al. 2013 [82]39.0
Ni et al. 2015 [13]66.7
Lan et al. 2015 [14]65.1
Liu et al. 2016 [15]48.4
Hongyang et al. 2016 [16]60.3
Beaudry et al. 2016 [17]49.6
Liu et al. 2016 [83]58.1
Ahsan et al. 2018 [19]28.5
Lin et al. 2018 [21]63.0
Lan et al. 2017 [84]75
Zhu et al. 2018 [85]74.8
Zhu et al. 2018 [84]78.7
Carreira et al. 2018 [85]80.2
Early Fusion using ANN44.68
Early Fusion using SVM49.32
MKL with ANN52.43
MKL with SVM54.19
Late Fusion using DCNN55.02
Late Fusion using55.89
Fuzzy Integral
McNN67.03
Table 19. State-of-the-Art Comparison of Accuracy of Proposed Approaches for UCF101 dataset .
Table 19. State-of-the-Art Comparison of Accuracy of Proposed Approaches for UCF101 dataset .
MethodAccuracy
Wang et al. 2013 [12]86
Simonyan et al. 2014 [86]88
Karpathy et al. 2014 [87]65.4
Donahue et al. 2015 [88]82.66
Sun et al. 2015 [89]88.1
Lan et al. 2015 [14]89.1
Feichtenhofer et al. 2016 [90]92.5
Zhang et al. 2016 [91]86.4
Cherian et al. 2017 [92]94.6
Seo et al. 2017 [93]85.74
Shi et al. 2017 [94]92.2
Wang et al. 2017 [95]91.32
Zheng et al. 2018 [20]95.1
Ahsan et al. 2018 [19]67.1
Zheng et al. 2018 [20]95.1
Lin et al. 2018 [21]91.5
Lan et al. 2017 [84]95.3
Zhu et al. 2018 [85]95.8
Zhu et al. 2018 [84]97.1
Carreira et al. 2018 [85]97.9
Early Fusion using ANN64.23
Early Fusion using SVM79.87
MKL with ANN81.93
MKL with SVM89.32
Late Fusion using DCNN91.87
Late Fusion using93.15
Fuzzy Integral
McNN94.59
Table 20. Classification accuracy against the state-of-the-art on HMDB51 and UCF101 datasets averaged over three splits with CNN architectures.
Table 20. Classification accuracy against the state-of-the-art on HMDB51 and UCF101 datasets averaged over three splits with CNN architectures.
MethodUCF101HMDB51
Two Stream CNN [86]8859.4
Slow Fusion CNN [87]65.4-
EMV+RGB-CNN [91]86.4-
Spatio-temporal CNN [89]88.159.1
Very Deep Two Stream Fusion [90]93.569.2
Generalized Rank Pooling [92]93.572.0
Frame Skipping + Trajectories Rejection [93]85.7458.91
Three-stream sDTD [94]92.265.2
Order Pooling [95]
(Dyn. Flow+RGB+(S)Op.Flow+IDT-FV)
91.3267.35
Deep Feature [84]95.375
End-to-End video [85]95.874.8
Two Stream CNN [96]97.178.7
Kinetics [97]97.980.2
Proposed Approach (McNN)94.5967.03
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Patel, C.I.; Labana, D.; Pandya, S.; Modi, K.; Ghayvat, H.; Awais, M. Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences. Sensors 2020, 20, 7299. https://0-doi-org.brum.beds.ac.uk/10.3390/s20247299

AMA Style

Patel CI, Labana D, Pandya S, Modi K, Ghayvat H, Awais M. Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences. Sensors. 2020; 20(24):7299. https://0-doi-org.brum.beds.ac.uk/10.3390/s20247299

Chicago/Turabian Style

Patel, Chirag I., Dileep Labana, Sharnil Pandya, Kirit Modi, Hemant Ghayvat, and Muhammad Awais. 2020. "Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences" Sensors 20, no. 24: 7299. https://0-doi-org.brum.beds.ac.uk/10.3390/s20247299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop