Next Article in Journal
Assessment and Evaluation of Force–Velocity Variables in Flywheel Squats: Validity and Reliability of Force Plates, a Linear Encoder Sensor, and a Rotary Encoder Sensor
Next Article in Special Issue
Viewpoint Robustness of Automated Facial Action Unit Detection Systems
Previous Article in Journal
Through-Floor Vital Sign Searching for Trapped Person Using Wireless-Netted UWB Radars
Previous Article in Special Issue
A Unified Framework of Deep Learning-Based Facial Expression Recognition System for Diversified Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi

1
School of Electronics and Electrical Engineering, Lovely Professional University, Jalandhar 144001, India
2
College of Computing and Informatics, Saudi Electronic University, Dammam 15515, Saudi Arabia
3
Department of Computer Engineering, Faculty of Science and Technology, Vishwakarma University, Pune 411048, India
4
Department of Computer Engineering, College of Computer and Information Technology, Taif University, PO Box 11099, Taif 21994, Saudi Arabia
5
Department of Information Technology, College of Computer and Information Technology, Taif University, PO Box 11099, Taif 21944, Saudi Arabia
*
Author to whom correspondence should be addressed.
Submission received: 19 September 2021 / Revised: 2 November 2021 / Accepted: 3 November 2021 / Published: 9 November 2021
(This article belongs to the Special Issue Research on Facial Expression Recognition)

Abstract

:
There is a significant interest in facial emotion recognition in the fields of human–computer interaction and social sciences. With the advancements in artificial intelligence (AI), the field of human behavioral prediction and analysis, especially human emotion, has evolved significantly. The most standard methods of emotion recognition are currently being used in models deployed in remote servers. We believe the reduction in the distance between the input device and the server model can lead us to better efficiency and effectiveness in real life applications. For the same purpose, computational methodologies such as edge computing can be beneficial. It can also encourage time-critical applications that can be implemented in sensitive fields. In this study, we propose a Raspberry-Pi based standalone edge device that can detect real-time facial emotions. Although this edge device can be used in variety of applications where human facial emotions play an important role, this article is mainly crafted using a dataset of employees working in organizations. A Raspberry-Pi-based standalone edge device has been implemented using the Mini-Xception Deep Network because of its computational efficiency in a shorter time compared to other networks. This device has achieved 100% accuracy for detecting faces in real time with 68% accuracy, i.e., higher than the accuracy mentioned in the state-of-the-art with the FER 2013 dataset. Future work will implement a deep network on Raspberry-Pi with an Intel Movidious neural compute stick to reduce the processing time and achieve quick real time implementation of the facial emotion recognition system.

1. Introduction

In today’s context, video cameras can be easily accessed by everyone. These video cameras can be mobile-based cameras or other static cameras like surveillance cameras, smartphone cameras, Raspberry-Pi cameras, or laptops, etc. With the help of these cameras, it is easy to capture human faces from any location at any place. This kind of freedom has enabled the research community to implement these smart systems for understanding the behavior of humans in real-time. To understand the behavior of a human being, expression plays the most important role. Various surveys have already been done to understand different components that play major roles in understanding human emotions. The outcome of those surveys concluded that non-verbal components, facial expressions, play the most important role during interpersonal communication [1]. Research in the field of emotive facial recognition has gathered attention in the last couple of decades, as the applications are not only limited to computer science but can be implemented in the field of affective computing, computer animation, cognitive science, and perceptual sciences etc. [2]. The major mode of exchanging feelings and emotions in daily life is facial expressions. Small facial gestures are strong enough to pass on the message to another person about one’s feelings. Facial emotion is more important than verbal communication, as the emotions are an actual spontaneous reflection of a person’s feelings. Lots of research is being carried out to develop such robots that are capable of understanding the facial emotions of human beings and could understand different moods of people [3]. To automate the facial emotion recognition system, various techniques have been used. With the help of such systems, facial expressions can be detected and the system has been applied during interviews [4], for surveillance systems [5], and for detection of aggression [6]. When it comes to computer vision and artificial intelligence, facial emotion recognition is one of the most important topics. For detection of facial emotions, various sensors can be used but facial images are more important because they carry enough information to understand interpersonal communication. Various research has been conducted over the last couple of years, out of which deep-learning-based FER approaches along with detailed algorithms have been proposed. In addition there are various hybrid and deep-learning approaches that are a combination of convolutional neural networks that combine the spatial and temporal features of frames [7].
Facial emotion recognition systems have gained popularity over the decades because of their diverse applications, and the majority of those applications are applicable in real-world activities like smart supervision for suspicious activities, marketing, group emotion analysis, etc. In the same field, a cost-effective system has been proposed by Muhammad Sajjad et al. that will help to implement a smart security system for law enforcement. This system has been proposed using Raspberry-Pi with Pi-cam, that also makes it cost effective and compact [8]. With the advancement of technology and the availability of various compact devices like Raspberry-Pi, it becomes easy to equip police and security officers with compact systems that can detect facial images in real-time. In addition to that, with the development of cloud-based technology, the captured images can be sent to cloud for future action. Such a cloud-assisted facial recognition framework has been proposed by Muhammad Sajjad et al. that can help to identify criminals and provide ease for police and security people to identify criminals quickly and easily with this proposed framework [9].
A smart home based on Raspberry-Pi has also been proposed that is completely automated. Chowdhury et al. has proposed this system to automate and provide access via web to carry out everyday work [10]. An energy efficient and smart water management system has been proposed to provide cost-effective solutions over existing irrigation systems. Agrawal et al. has proposed a system that could initially water around 50 pots kept in the garden and proposed a strong system that could further be extended for bigger fields [11]. Another application based on face recognition has been proposed by [12]. This system provides access only to those who were identified by the system and makes the system more secure. The proposed system is also based on Raspberry-Pi and Pi-Camera that again is a cost-effective and user-friendly hardware to work with. Various challenges like pose mismatching [13] and usage of strong descriptors [14] have been mentioned in the literature and several alternatives have been provided in the literature to handle such issues. Various classification techniques are available in the literature and a few of them are amazingly effective to extract important information from those images like HFR (Hybrid face regions) [15]. Implementation of deep networks for smartphones has allowed facial detection along with gender classification, and a pixel pattern-based gender classification has also been proposed on the FERRET dataset, with an accuracy of 90% with frontal faces [16]. Various architectures are proposed in the literature to implement emotion recognition in real-time with multiple faces in videos [17]. It becomes a challenge to detect faces and emotions in critical situations. To manage such situations, a fast and accurate system based on ORFs has been proposed that provides results by working on multiple components like backgrounds, pose estimation, face patches, etc. [18] The contributions of study are as follows:
  • The implementation of hardware prototype for real time facial emotion detection with Raspberry-Pi.
  • A deep convolutional neural network known as Mini-Xception is used for training, validation, and testing of emotive facial images.
  • Support vector machine (SVM) classification is implemented in the Raspberry-Pi hardware for classifying the persons.
The organization of the study is as follows: Section 2 presents the overview of AI and CNN; Section 3 presents the proposed methodology where the methodology for real time face emotion detection is covered. Section 4 presents the hardware setup, experimental results, and comparison of proposed hardware with previous studies. Finally, the study concluded in the final section.

2. Background

Artificial Intelligence (AI) refers to the representation of human intelligence of machines designed to think and recreate human actions [19]. The concept can also be extended to any system that demonstrates characteristics consistent with a human mind such as learning and problem-solving. AI is interdisciplinary, however advancements in machine learning and deep learning are triggering a shift in perspective for nearly all sectors [20]. Computer vision is an artificial intelligence area that concentrates on image issues. Convolutional neural networks (CNNs) and computer vision combined can perform complex operations from image recognition to resolving scientific problems [21]. CNNs are well known for the capability of image recognition and classification. In general, a basic convolutional neural network consists of neurons connected via multiple layers. These layers collect the input images and process them in different layers. A simple CNN consists of three types of layers: the convolutional layer, the max-pooling layer, and the fully connected layer. The first two layers are responsible for feature extraction, introducing non-linearity and feature reduction to reduce overfitting. The last layer, known as a fully connected layer, helps in the classification based on the features that are extracted in the previous layers. The fully connected layer contains the majority of the parameters. The number of parameters has also been reduced, presented in architectures like Inception V3 [22] in which the last layer is added, i.e., Global Average Pooling operation. This layer reduces the feature map by taking the average and converting the feature map into a scalar value. To further reduce the parameters, modern CNN architectures have presented the use of residual modules [23] and depth wise-separable convolutions [5]. The depth wise-separable CNNs works by separating the task of feature extraction and combining it within the convolutional layer, hence the parameters are further reduced. So, we have used a Mini-Xception CNN proposed by [24] which reduces the parameters by using depth wise-separable convolution layers instead of simple convolution layers and eliminates fully connected layers.

3. Proposed Methodology

This section discerns the proposed framework. The elucidation of each step is elaborated on in the next sections. The entire process is divided into three tightly coupled tasks. The first task is to train the pertained deep network after dividing the dataset into training, validation, and testing. The entire dataset of emotive facial images is divided into an 8:1:1 ratio. A dataset with N images will be divided into ԺTR for training, ԺVD for cross validation and ԺT for testing purposes. This means a training set of N number of images I will consist of Ժ T R = { I 1 T R , I 2 T R , I 3 T R I 4 N / 5 T R } as training images, Ժ V D = { I ( 4 N / 5 ) + 1 V D I 9 N / 10 V D } as validation images and Ժ T = { I ( 9 N / 10 ) + 1 T I N T } as testing images. A deep convolutional neural network known as Mini-Xception is used for training, validation, and testing of emotive facial images. Training, validation, and testing is done on Google-CoLab with 12GB NVIDIA Tesla K80 GPU using FER 2013 dataset.
The entire architecture is divided into two tightly coupled tasks, i.e., face recognition and facial emotion recognition in real time. For face recognition, a pre-trained deep network known as OpenFace is used. To begin with, a real time image from video is captured as Í 1 r . The total number of images captured in real time is δ R = { Í 1 r , Í 2 r Í N r } . To train the deep network, 6 images of each subject have been used and the network is trained with single triplet method of training for 20 different people with N images and denoted as δ T R . Once the training is complete, for the real time captured image, Í 1 r , the very first task is to find the face inside the captured image and discard the unwanted information. The description of parameters used in the proposed architecture is given in Table 1.
A well-known method known as HOG (Histogram of Oriented Gradients) has been used to find the faces. After the detection of the face, the facial image Í 1 r has been cropped, which is further preprocessed in order to remove the effects of bad lighting, tilted face, and skewness, etc., in the cropped image Í c r r . The cropped image is preprocessed with the face landmark estimation algorithm. This algorithm locates 68 landmarks on the cropped image Í c r r and, with the help of simple affine transformations, the image is preprocessed Í p r r using rotation, shear, and scale to center the eyes and mouth of the cropped image Í c r r at best. The preprocessed image Í p r r is fed to the pre-trained network to extract the features from Í p r r image and generate 128 embeddings that are measurements of the face. The feature of the preprocessed image Í p r r is generated with the help of the neural network that will generate a feature vector of 128 embeddings as Φ n e m b d . The last step is to classify the image by measuring the closest match of 128 embeddings Φ n e m b d by comparing it with the database images. The feature vector Φ n e m b d is passed through the simple SVM classifier Š, to recognize the face. The entire architecture with its detailed framework is represented in Figure 1.
The second task is to transmit the output of preprocessing stage Í p r r to the cloud where a pre-trained network of emotive facial recognition system is already available. This image Í p r r is again passed through all the layers of the mini-Xception deep network, which is a fast and depth-wise separable convolutional neural network for the recognition of emotion captured in Í 1 r image. The deep network on the cloud is trained on seven basic emotions and is labelled as { ε c = ε 1 ε 2 ε 7 } where the basic classes of seven emotions is represented as c.

3.1. Face Detection

After capturing the facial images in real time using Pi-cam, the first and foremost task is to separate the faces. To remove the unwanted and redundant information from the facial images like the background, a variety of methods are available. The most well-known method was the Viola Jones algorithm that was invented in the early 20s. We are using another method known as Histogram of Oriented Gradients, or HOG, to detect the facial images. Raspberry-Pi captures the emotive facial images in real-time via Pi-Cam from real time video frames. This image is real-time captured input image Í r which is converted to grayscale to extract HOG features to extract the facial part in the input image Í r . Finally the facial image is cropped, Í c r r , and is fed to a feature extraction unit to recognize faces in real time. Figure 2a shows the basic steps of face detection using HOG. As shown in Figure 2, the gradients are calculated for the entire grayscale image, and this is done by calculating the gradients for 16 × 16 pixels at a time.
This calculation is repeated for the entire grayscale image, and we will end up with an image of gradients. The next step is to calculate the strongest gradients in 16 × 16 windows of pixels and replace the gradients in that window with the strongest gradient. These will result in a basic image that consists of basic structure of face. To locate the face in the real-time captured input image or on real time video, we only located that part of the image which looks remarkably like a known HOG pattern and crop that part of the input image Í r , as a result we get the cropped facial image Í c r r .

3.2. Face Alignment

As the face is captured in real-time, the image captured can have faces turned in different direction. To deal with such situations, we wrapped each picture so that our system can locate the eyes and lips in a sample place. To perform this operation, we have used an algorithm proposed by [23] which is known as face landmark estimation. The main work of this algorithm is to locate 68 specific points known as facial landmarks, as shown in Figure 2b, of all faces.
These landmarks locate the eyes, nose, chin, lips, and eyebrows etc., on any face. As explained by [19], S = ( x 1 T , x 2 T , . , x p T ) T R 2 p is the vector that represents the p number of facial landmarks in image I. The main aim is to perform the estimation of S to the best possible estimate, which is nearest to the true shape and is denoted by S (t). This is done with the help of a cascade of regressors, where each regressor keeps on predicting and continuously updating the vector so that the estimation is accurate. S ( t + 1 ) = S ( t ) + r t S ( t ) is the method in which the regressor r t ( . , . ) is being used in a cascade for prediction and updating the vector S ( t + 1 ) .

3.3. Face Encoding

The next important step is to extract the features from the exactly centered image. The best way to get the unique features of any facial image is to measure the face. The dimensions of each face are different. The main challenge is about which measurement plays a vital role in recognition of captured image. This task can be difficult to achieve if performed with the traditional method of feature extraction. To achieve accuracy and raise the speed, a deep network is trained as machines have been proven to be better than humans when it comes to prediction. Training a deep network requires a lot of computation and power from a system. So, we used a pre-trained network which is provided by OpenFace [20]. Now, we just give the input and the Deep network that measures the 128 measurements for each face instead of single face; the network has been trained on 3 facial images at an instant as shown in Figure 3. This is achieved by training the first image (Image anchor) of person with a second image (positive) of the same person, with a completely different image (Image negative) of another person, as shown in Figure 4. The main purpose is to have the image anchor closer to image positive, as compared to any other image, called image negative. The selection of a triplet to carry out 128 measurements is important. Machine learning experts call these measurements of every individual face “embedding”. Training on face embedding using a large set of images called dataset will improve the accuracy and decrease the error rate eventually. This process requires huge CPU power and lot of time. To understand triplet loss, consider the representation as f ( y ) = I s which is representing an image y into s-dimensional Euclidean space. We oblige this implanting to live on the s-dimensional hypersphere. f ( y ) 2 = 1. As shown in Figure 3a, the main aim is to achieve minimum distance between y j m (Anchor) of a specific person with all the other images y j p (Positive) of the same person as compared to the image of any other person y i n (Negative).
So, we want to have the following:
y j m y j p 2 2 + β y j m y j n 2 2 ( y j m , y j n , y j p ) ζ
where β is the enforced margin between negative and positive pair of images and ζ is the set of all the possible triplets and has numbers equal to number P.
j P [ f ( y ) j m f ( y ) j p 2 2 f ( y j m ) f ( y j n ) 2 2 + β ] +
Generation of multiple triplets will help to overcome the issue faced in Equation (1) and selection of suitable and complex triplets will result in the improvement of the deep learning model.

3.4. SVM Based Classification

The last step is the most important step of finding the names of persons from the encodings. Different techniques have been presented that will help in the evaluation of various classifiers. A variety of machine learning classification algorithms can be used to classify the faces but the most simple and efficient one has been used for classification of faces, known as support vector machine (SVM). We kept it simple because we only want the output to be the face with the name of the person. Moreover, we are implementing this on Raspberry-Pi, so we want our system to be fast and accurate. Running this classifier on hardware takes milliseconds, which what we want, and the result of this classifier is the name of the person.

3.5. Dataset

The training of the network is illustrated in the Figure 4, here we have mapped the unique images from a single network into triplets. The gradient of the triplet loss is back propagated to the unique images through the mapping. The dataset that we have used consists of 35,888 images of facial emotions with seven categories (0 = Angry, 1 = Disgust, 2 = Fear, 3 = Happy, 4 = Sad, 5 = Surprise, 6 = Neutral). The FER 2013 dataset consist of 48 × 48 pixel gray scale images (https://www.kaggle.com/msambare/fer2013 (accessed on 11 May 2021)). The dataset that we have used is in csv format, consisting of only two columns, i.e., “emotions” and “pixels” and is kept in Google drive. The entire data is divided into an 8:1:1 ratio for training Ժ T R , validation Ժ V D and testing Ժ T .
The number of images available in the dataset has been categorized as per the expression, and the total number of images under each category is shown in Table 2 and the graphical representation of dataset is shown in Figure 5. The FER 2013 dataset is not a uniform dataset, and it does not contain a uniform number of images under each category. Figure 6 shows the sample images from the FER 2013 dataset. A large number of datasets is available to detect facial emotions.

3.6. Training CNN Model: Mini Xception

The dataset has been kept on Google drive and the training has been done on Google-CoLab with 12GB NVIDIA Tesla K80 GPU. The CNN has been trained with 80% of training data from the FER dataset and the remaining 10% of dataset is kept for validation. The architecture of Mini-Xception proposed by [24] is shown in Figure 7. Testing has been done on the remaining 10% of data as shown in Figure 8 and on the input given by Raspberry-Pi after detecting the face and converting the cropped and pre-processed images into 48 × 48 sizes. This architecture is trained on the FER 2013 dataset because we want the response to be quick and the proposed architecture of Mini-Xception has been proved to be quick and light because of its unique architecture and replacement of convolutional layers with depth wise convolutional layers, which will reduce the number of parameters and make it reliable to implement for real time emotion recognition. The results of training are shown in Figure 9 and training loss is represented in Figure 10. As our system is based on Raspberry-Pi, which has certain constraints in-terms of memory and processing capability, so a smaller number of parameters will be helpful in future advancement of this system. We have achieved an accuracy of 66% on FER 2013 dataset (without augmentation), as mentioned in the state-of-the-art, and 68% after data augmentation. The main reason behind this accuracy is variation in the dataset and non-uniformity of images under each category.

4. Experimental Results

The detailed results are shown in this section. We have divided the entire architecture to carry out two main tasks. The first task of face recognition has achieved an accuracy of 100%, as all the faces that were trained are recognized correctly, as open face has provided near-human accuracy [20] on the LFW benchmark. So, we have used it and implemented it in real time video with the help of the Raspberry-Pi 3 B+ model. We have used Python with OpenCV and with the help of Pi-Cam we acquired the live video and recognize the faces in those live videos. The proposed framework can locate multiple faces in the frames but is able to recognize only those that are already trained and present in the dataset. Face recognition has given correct results even with different objects like spectacles. We have installed the setup along with the biometric attendance area so that, at the time of punching in and punching out, the expression on the faces of employees can be recorded and, after collecting the data of the fortnight, the recorded faces with names and recognized expressions can be analysed. This analysis can be useful to recognize the consistent behavior of the employees in private organizations. For example, an employee with a constant sad or disgusted expression on his face can be identified and can be reported to a happiness cell or psychological support cell for helping such employees and making them feel good and comfortable in the workplace. The hardware setup of the proposed system is shown in Figure 8.
The comaparison with various models is shown in Table 3 and specifications of the system are given in Table 4, and the detailed algorithm is explained in Algorithm 1 and Algorithm 2. After detecting the face in real time, the cropped and pre-processed image is given to the pre-trained deep network which is trained on the FER 2013 dataset using Python and Keras. The results after classification are shown in the confusion matrix, as shown in the Table 5. Several misclassifications have been found as “disgust” is misclassified as “Angry”. From the dataset one can easily locate that the count of disgusted faces in the dataset is least 547. This simply indicates that the number of features that the network has been trained to classify as a disgusted face is less compared to other classes. Hence, the misclassification took place.
The total number of parameters for which the network has been trained is 2,134,407. When tested on real time video, 110 out of 120 images with expressions are recognized correctly. Figure 10 is the graphical representation of model accuracy and Figure 11 is the graphical representation of model loss. Figure 12 shows the real time face recognition result. Table 3 shows the comparison of available models.
Algorithm 1. Face Detection in Real-Time.
1\ Input: Real time video of subjects Í
2I. Capture the real time image Í from the real time video frames
3II. Facial Dataset creation
4For k = 1 to size of ( δ | T R ) = ( δ R = { Í 1 r , Í 2 r Í n r } ) , is the count of each subjects sample and n is the total number of subjects captured
5    a. Select an image Í 1 r from δ R
6    b. Convert the image
     Í 1 r to grayscale image: Í G r   Gray Scale ( Í | 1 r )
7    c. Detect the face region using Histogram of Oriented Gradients (HOG):
     Í H r   HOG Í G r
8    d. Crop the facial region Í C R r   Cropped Í H r
9    e. Preprocess the cropped image Í C R r by applying facial landmark and affine transform:
Í p r r   Pre-processing Í C R r
10    f. Repeat the sub steps a to e of II of database creation for n x Ď times
11End
12III. Label all n Ď ( Í p r r ) images of dataset with the name of the subjects:
( δ | T R ) l   Labeled n Ď ( Í p r r )
13IV. Training and feature extraction:
14For k = 1 to size of ( δ | T R ) l ,where l represents labelled data
15    a. Select the anchor image y j m = ( 1 . Í 1 r ) of first subject
16    b. Select the positive image y j p = ( 2 . Í 1 r ) of the same subject
17    c. Select the negative image. y i n = ( 1 . Í 2 r ) of the second subject
18    d. Feed the images y j m , y j p , y i n to the pre-trained deep network
19    e. Repeat the training to achieve y j m y j p 2 2 + β y j m y j n 2 2 ( y j m , y j n , y j p ) ζ ,where β is the enforced margin between negative and positive pair of images and ζ is the set of all the possible triplets and has numbers equal to number M
20    f. Generate multiple triplets to improve deep learning by j M
21    g. Generate the feature vector Φ n e m b d ,where n is the number of embedding’s generated by deep network
22End
23V. Cross validate by capturing image Í r in real time
24    i. Repeat the substeps a to e of II Facial Dataset creation for Í r
25    ii. Repeat the substeps a to f of IV Training step and feature extraction
26VI. Feed the image to classifier to classify the image Í r in real time by passing all the feature vectors Φ n e m b d called embedding’s that were generated in the substeps a to g of IV Training step and the Output: Prediction of the face of the subject with name in real-time\
Algorithm 2. Emotion Detection in Real-Time.
1\Input Emotive facial dataset Ժ
2I. Divide the dataset into training Ժ T R ,validation Ժ V D and testing Ժ T
3II. Training and feature extraction:
4For j = 1 to size of ) = I 4 N / 5 T R ,where N is the total number of images in Ժ
5    a. Select the image I 1 T R from Ժ T R
6    b. Feed the input I 1 T R to the deep network
7    c. Train the network by passing the i m a g e s ) with their labels and let the network extract all the parameter
8End
9III. Cross Validate:
10For j = 1 to size of ( Ժ V D = { I ( 4 N / 5 ) + 1 V D I 9 N / 10 V D }
11    a. Select the image I ( 4 N / 5 ) + 1 V D from Ժ V D
12    b. Repeat the substeps b and c from II Training and feature extraction for images Ժ V D
13    c. Use validation images   Ժ V D = { I ( 4 N / 5 ) + 1 V D I 9 N / 10 V D } the reduction of overfitting
14End
15IV. Testing:
16For j = 1 to size of Ժ T = { I ( 9 N / 10 ) + 1 T I N T }
17    i. Select the image I ( 9 N / 10 ) + 1 T from Ժ T
18    ii. Repeat the substeps b and c from II Training and feature extraction for images Ժ T
19    iii. Use testing images Ժ T = { I ( 9 N / 10 ) + 1 T I N T } to test the trained network for efficiency
20End
21V. Real time testing:
22Take the input from subset e of II from Algorithm 1
23For k = 1 to size of ( δ | T R ) = ℕ ( δ R = { Í 1 r , Í 2 r Í n r } ) , ℕ is the count of each subjects sample and nis the total number of subjects captured
24    i.Resize the image Í p r r   Pre-processing Í C R r
     Í R S r   Resize Í p r r
25    ii. Repeat the substeps b and c from II Training and feature extraction for images Í R S r
26End
27VI. Predict the facial expression { ε c = ε 1 ε 2 ε 7 } where the basic classis of seven emotions is represented as c = [Angry, disgust, fear, Happy, Sad, Surprise and Neutral].Output: Prediction of facial emotion of subject in real-time\
To evaluate the performance and effectiveness of proposed edge device, we have compared with the previous studies on facial emotional detection. It has been realized that the hardware implementation for facial emotion detection and recognition is less implemented. A few studies that have implemented this hardware recorded lower accuracy, like 51.28% and 47.44%, when compared with the proposed model, i.e., 68%.
As discussed in the above section, the FER 2013 dataset is not a balanced dataset. A total of 35,887 images of 7 classes are present in this dataset. The unbalanced dataset gave the results which are very low, specifically for disgust, fear, and sad emotion. So, a data balancing technique is used for balancing the data. Keras API helps to increase the data set by applying various techniques by using the Image Data Generator function. This mainly includes five functions, i.e., rotation at a certain angle, shearing, zooming, rescale, and horizontal flip. Before data augmentation, a total of 35,887 images were used, out of which only 547 images were of disgusted expressions. After applying data augmentation, a total of 41,904 images were used, of which 6564 were of disgusted faces. The confusion matrix after data augmentation is shown in Table 6.
Table 6 shows the confusion matrix after data augmentation. It can be noticed that prediction for disgust and fear has improved. The overall efficiency of system after data augmentation has been raised by 2% and came out as 68%. Table 7 shows the comparison of proposed edge device with previous studies.

6. Conclusions

Real time detection of any kind of activity that is suspicious in nature is difficult to identify without any actual interaction with the subject or suspect. Reading the face of a person in real time is a challenging task. With the help of compact and portable devices, it becomes easy for the majority of organizations to understand the behavior of their employees and resolve some of the minor and major issues at an early stage. To achieve that, a framework has been tested and proposed that can be implemented in any organization to understand employee behavior. The proposed framework is a cost-effective and compact alternative over all those heavy and bulky systems that are difficult to implement in real time. The system has been tested for 20 different people with all 7 emotions, and out of total 120 images, 110 images were identified with correct emotions in real time. The proposed framework has been implemented using the Mini-Xception Deep Network because of its computational efficiency in a shorter time as compared to other networks.
Facial expression representation plays an important role in facial expression recognition. It can be viewed as generating good features for describing the appearance, structure, and motion of facial expressions. More specifically, facial expression features attempt to effectively describe the facial muscle or facial motion for static or dynamic facial images. Numerous works have already done this and, although different proposed methods for facial expression recognition have achieved good results, there remain different problems that need to be addressed by the research community. The most important one is face variability in a single person. There are many factors that can cause two pictures from the same person to look totally different, such as light, face expression, or occlusion. Another problem to be taken into account is the environment. Except in controlled scenarios, face pictures have very different backgrounds, which can make the problem of face recognition more difficult. To address this issue, many of the most successful systems focus on treating the face alone, discarding all the surroundings. Smart meeting, video conferencing, and visual surveillance are some of the real-world applications that require a facial expression recognition system that works adequately on low resolution images. There exist lots of methods for facial expression recognition but very few of those methods provide results or work adequately on low resolution images. More research effort is required to be put forth for recognizing more complex facial expressions than the six classical ones, such as fatigue, pain, and mental states such as agreeing, disagreeing, lying, frustration, thinking, as they have numerous application areas. Other problems include expression intensity estimation, spontaneous expression recognition, micro expression recognition (brief, involuntary facial expression, lasts only 1/25 to 1/15 of a second), mis-alignment problems, illumination, and face pose variation. Moreover, studies proved that visual captures of facial expressions alone are not sufficient to identify the exact human emotions discussed in this section. This research can be further carried out by combining FER systems with various physiological sensors to identify the exact mental state of a person.

Author Contributions

N.R., A.G. and R.S. made contributions to conception and manuscript writing; A.S.A. and S.S.A. examined and supervised this research and outcomes; M.R. and Z.K. revised and polished the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Taif University Research Supporting Project number (TURSP-2020/311), Taif University, Taif, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this research will be made available on request to corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wiener, M.; Mehrabian, A. Language within Language: Immediacy, a Channel in Verbal Communication; Ardent Media: Lake Geneva, WI, USA, 1968; ISBN 0891972684. [Google Scholar]
  2. Kaulard, K.; Cunningham, D.W.; Bülthoff, H.H.; Wallraven, C. The MPI facial expression database—A validated database of emotional and conversational facial expressions. PLoS ONE 2012, 7, e32321. [Google Scholar] [CrossRef]
  3. Zeng, J.; Shan, S.; Chen, X. Facial expression recognition with inconsistently annotated datasets. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 222–237. [Google Scholar]
  4. Edwards, J.; Jackson, H.J.; Pattison, P.E. Emotion recognition via facial expression and affective prosody in schizophrenia: A methodological review. Clin. Psychol. Rev. 2002, 22, 789–832. [Google Scholar] [CrossRef]
  5. Amos, B.; Ludwiczuk, B.; Satyanarayanan, M. Openface: A general-purpose face recognition library with mobile applications. CMU Sch. Comput. Sci. 2016, 6, 1–20. [Google Scholar]
  6. Ashraf, A.B.; Lucey, S.; Cohn, J.F.; Chen, T.; Ambadar, Z.; Prkachin, K.M.; Solomon, P.E. The painful face–pain expression recognition using active appearance models. Image Vis. Comput. 2009, 27, 1788–1796. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Ko, B.C. A brief review of facial emotion recognition based on visual information. Sensors 2018, 18, 401. [Google Scholar] [CrossRef]
  8. Sajjad, M.; Nasir, M.; Ullah, F.U.M.; Muhammad, K.; Sangaiah, A.K.; Baik, S.W. Raspberry Pi assisted facial expression recognition framework for smart security in law-enforcement services. Inf. Sci. 2019, 479, 416–431. [Google Scholar] [CrossRef]
  9. Sajjad, M.; Nasir, M.; Muhammad, K.; Khan, S.; Jan, Z.; Sangaiah, A.K.; Elhoseny, M.; Baik, S.W. Raspberry Pi assisted face recognition framework for enhanced law-enforcement services in smart cities. Futur. Gener. Comput. Syst. 2020, 108, 995–1007. [Google Scholar] [CrossRef]
  10. Chowdhury, M.N.; Nooman, M.S.; Sarker, S. Access Control of Door and Home Security by Raspberry Pi Through Internet. Int. J. Sci. Eng. Res. 2013, 4, 550–558. [Google Scholar]
  11. Agrawal, N.; Singhal, S. Smart drip irrigation system using raspberry pi and arduino. In Proceedings of the International Conference on Computing, Communication & Automation, IEEE, Greater Noida, India, 15–16 May 2015; pp. 928–932. [Google Scholar]
  12. Chowhan, R.S.; Tanwar, R. Password-Less Authentication: Methods for User Verification and Identification to Login Securely Over Remote Sites. In Machine Learning and Cognitive Science Applications in Cyber Security; IGI Global: Hershey, PA, USA, 2019; pp. 190–212. [Google Scholar]
  13. Srikote, G.; Meesomboon, A. Face Recognition Performance Improvement using a Similarity Score of Feature Vectors based on Probabilistic Histograms. Adv. Electr. Comput. Eng. 2016, 16, 107–113. [Google Scholar] [CrossRef]
  14. Bashar, F.; Khan, A.; Ahmed, F.; Kabir, H. Face recognition using similarity pattern of image directional edge response. Adv. Electr. Comput. Eng. 2014, 14, 69–77. [Google Scholar] [CrossRef]
  15. Lajevardi, S.M.; Hussain, Z.M. Feature extraction for facial expression recognition based on hybrid face regions. Adv. Electr. Comput. Eng. 2009, 9, 63–67. [Google Scholar] [CrossRef]
  16. Haider, K.Z.; Malik, K.R.; Khalid, S.; Nawaz, T.; Jabbar, S. Deepgender: Real-time gender classification using deep learning for smartphones. J. Real-Time Image Process. 2019, 16, 15–29. [Google Scholar] [CrossRef]
  17. Lu, H.; Huang, Y.; Chen, Y.; Yang, D. Automatic gender recognition based on pixel-pattern-based texture feature. J. Real-Time Image Process. 2008, 3, 109–116. [Google Scholar] [CrossRef]
  18. Greche, L.; Akil, M.; Kachouri, R.; Es-Sbai, N. A new pipeline for the recognition of universal expressions of multiple faces in a video sequence. J. Real-Time Image Process. 2020, 17, 1389–1402. [Google Scholar] [CrossRef] [Green Version]
  19. Yoon, J.; Kim, D. An accurate and real-time multi-view face detector using orfs and doubly domain-partitioning classifier. J. Real-Time Image Process. 2019, 16, 2425–2440. [Google Scholar] [CrossRef]
  20. Lu, Y. Artificial intelligence: A survey on evolution, models, applications and future trends. J. Manag. Anal. 2019, 6, 1–29. [Google Scholar] [CrossRef]
  21. Pannu, A. Artificial intelligence and its application in different areas. Artif. Intell. 2015, 4, 79–84. [Google Scholar]
  22. Lemley, J.; Bazrafkan, S.; Corcoran, P. Deep Learning for Consumer Devices and Services: Pushing the limits for machine learning, artificial intelligence, and computer vision. IEEE Consum. Electron. Mag. 2017, 6, 48–56. [Google Scholar] [CrossRef] [Green Version]
  23. Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1867–1874. [Google Scholar]
  24. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
  25. Sambare, M. FER-2013: Learn Facial Expressions from An Image. Available online: https://www.kaggle.com/msambare/fer2013 (accessed on 11 May 2021).
  26. Bența, K.-I.; Vaida, M.-F. Towards real-life facial expression recognition systems. AECE 2015, 15, 93–102. [Google Scholar] [CrossRef]
  27. Jain, N.; Kumar, S.; Kumar, A.; Shamsolmoali, P.; Zareapoor, M. Hybrid deep neural networks for face emotion recognition. Pattern Recognit. Lett. 2018, 115, 101–106. [Google Scholar] [CrossRef]
  28. Zhang, H.; Jolfaei, A.; Alazab, M. A face emotion recognition method using convolutional neural network and image edge computing. IEEE Access 2019, 7, 159081–159089. [Google Scholar] [CrossRef]
  29. Riaz, M.N.; Shen, Y.; Sohail, M.; Guo, M. Exnet: An efficient approach for emotion recognition in the wild. Sensors 2020, 20, 1087. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Architecture for face recognition and facial emotion recognition in real time using Raspberry-Pi (Dataset from [25]).
Figure 1. Architecture for face recognition and facial emotion recognition in real time using Raspberry-Pi (Dataset from [25]).
Applsci 11 10540 g001
Figure 2. (a) Face detection using Histogram of Oriented Gradients (HOG) and (b) Face is aligned in real time using ensemble of regression trees algorithm.
Figure 2. (a) Face detection using Histogram of Oriented Gradients (HOG) and (b) Face is aligned in real time using ensemble of regression trees algorithm.
Applsci 11 10540 g002
Figure 3. Generation of 128-dimensional data from triplet.
Figure 3. Generation of 128-dimensional data from triplet.
Applsci 11 10540 g003
Figure 4. Network training flow for M unique images [Adapted from 5].
Figure 4. Network training flow for M unique images [Adapted from 5].
Applsci 11 10540 g004
Figure 5. Graphical representation of class distribution.
Figure 5. Graphical representation of class distribution.
Applsci 11 10540 g005
Figure 6. Sample data from FER 2013 dataset.
Figure 6. Sample data from FER 2013 dataset.
Applsci 11 10540 g006
Figure 7. Mini-Xception Architecture.
Figure 7. Mini-Xception Architecture.
Applsci 11 10540 g007
Figure 8. (a) Hardware setup (b,c) Hardware setup in Employees Cubical.
Figure 8. (a) Hardware setup (b,c) Hardware setup in Employees Cubical.
Applsci 11 10540 g008
Figure 9. Graphical representation of data segregated for (a) Training (b) Validation and (c) Testing.
Figure 9. Graphical representation of data segregated for (a) Training (b) Validation and (c) Testing.
Applsci 11 10540 g009
Figure 10. Graphical representation of training.
Figure 10. Graphical representation of training.
Applsci 11 10540 g010
Figure 11. Graphical representation of loss while training.
Figure 11. Graphical representation of loss while training.
Applsci 11 10540 g011
Figure 12. Face recognition result in real-time (a) without spectacles (b) with spectacles.
Figure 12. Face recognition result in real-time (a) without spectacles (b) with spectacles.
Applsci 11 10540 g012
Table 1. Description of parameters of proposed architecture.
Table 1. Description of parameters of proposed architecture.
Ժ Emotive Facial Dataset Φ n e m b d 128 Feature vectors as face embeddings
ԺTR Training Images Í Real time image representation
ԺVD Validation Images Í 1 r First real-time facial image
ԺT Testing Images Í p r r Preprocessed real-time facial image
δ R Real Time Captured Images Í c r r Cropped real-time facial image
I 1 T R First Training Image Í n r n real-time facial images
I 4 N / 5 T R 80% Training ImagesŠ SVM Classifier
I 9 N / 10 V D 10% Validation Images ε c Classifier with labels having c = 0 to 6 classes
I N T N number of testing images I Image representation
δ T R Training Dataset of facial imagesĎ Face detection database images
Table 2. Data set classes.
Table 2. Data set classes.
ClassEmotionNumberAfter Augmentation
0Angry49534953
1Disgust5476564
2Fear51215121
3Happy89898989
4Sad60776077
5Surprise40024002
6Neutral61986198
Table 3. Comparison with various Models.
Table 3. Comparison with various Models.
ModelAccuracyLearning RateTest AccuracyOptimizerRegularizationActivation Function
Mini_Xception73%0.00568%Adam, SGDL1ReLU
Densenet16159%0.001, 0.001, 0.00543%Adam, SGDL2Sigmoid
Resnet3868%0.000160%SGD, AdaGradL1Sigmoid
Mobilenet_V272.5%0.0001, 0.00164%AdaGrad, AdamL2ReLU
Table 4. System configuration.
Table 4. System configuration.
NameConfiguration
Imaging LibrariesOpenCV 2.4.11, imutils, dlib v18.16, Scikit-Learn, Scikit-Image, OpenFace
LibrariesMatplotlib, RPI.GPIO, Numpy, SciPy, PyLab,
Programming LanguagesPython 2.7
Operating SystemNOOBS
Table 5. Normalized Confusion Matrix of the testing dataset without augmentation.
Table 5. Normalized Confusion Matrix of the testing dataset without augmentation.
True LabelAngry0.680.010.050.040.080.030.10
Disgust0.470.440.020.020.000.040.02
Fear0.200.000.370.030.150.120.12
Happy0.030.000.010.890.010.030.03
Sad0.140.000.090.060.450.020.24
Surprise0.040.000.070.050.010.810.02
Neutral0.080.000.030.060.080.020.73
AngryDisgustFearHappySadSurpriseNeutral
Predicted Label
Table 6. Confusion Matrix of the testing dataset after data augmentation.
Table 6. Confusion Matrix of the testing dataset after data augmentation.
True LabelAngry0.680.010.050.040.080.030.10
Disgust0.470.540.020.020.000.040.02
Fear0.200.000.500.030.150.120.12
Happy0.030.000.010.890.010.030.03
Sad0.140.070.050.060.450.020.24
Surprise0.040.010.070.030.010.810.02
Neutral0.080.010.030.060.080.020.73
AngryDisgustFearHappySadSurpriseNeutral
Predicted Label
Table 7. Comparison of proposed edge device with previous studies.
Table 7. Comparison of proposed edge device with previous studies.
ResearchObjectiveHardware Based DeviceCloud ServerAlgorithmAccuracy
[26]Face emotion recognitionNoNoHybrid CNN-RNN94.91
[27]Facial expression recognitionNoNoCNNNA
[28]Emotional Recognition in the WildYesNoCNNNA
[29]Facial Expression Emotion DetectionAtlysTM Spartan-6FPGA development boardNoSVR (Support Vector Regression)MATLAB Simulink: 51.28%
Xlinix simulation: 47.44%
ProposedFacial emotion & detectionRaspberry-Pi based standalone edge deviceYesCNN + SVMRaspberry-Pi-based edge device: 68%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Rathour, N.; Khanam, Z.; Gehlot, A.; Singh, R.; Rashid, M.; AlGhamdi, A.S.; Alshamrani, S.S. Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi. Appl. Sci. 2021, 11, 10540. https://0-doi-org.brum.beds.ac.uk/10.3390/app112210540

AMA Style

Rathour N, Khanam Z, Gehlot A, Singh R, Rashid M, AlGhamdi AS, Alshamrani SS. Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi. Applied Sciences. 2021; 11(22):10540. https://0-doi-org.brum.beds.ac.uk/10.3390/app112210540

Chicago/Turabian Style

Rathour, Navjot, Zeba Khanam, Anita Gehlot, Rajesh Singh, Mamoon Rashid, Ahmed Saeed AlGhamdi, and Sultan S. Alshamrani. 2021. "Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi" Applied Sciences 11, no. 22: 10540. https://0-doi-org.brum.beds.ac.uk/10.3390/app112210540

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop