Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi

Rathour, Navjot; Khanam, Zeba; Gehlot, Anita; Singh, Rajesh; Rashid, Mamoon; AlGhamdi, Ahmed Saeed; Alshamrani, Sultan S.

doi:10.3390/app112210540

Open AccessArticle

Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi

¹

School of Electronics and Electrical Engineering, Lovely Professional University, Jalandhar 144001, India

²

College of Computing and Informatics, Saudi Electronic University, Dammam 15515, Saudi Arabia

³

Department of Computer Engineering, Faculty of Science and Technology, Vishwakarma University, Pune 411048, India

⁴

Department of Computer Engineering, College of Computer and Information Technology, Taif University, PO Box 11099, Taif 21994, Saudi Arabia

⁵

Department of Information Technology, College of Computer and Information Technology, Taif University, PO Box 11099, Taif 21944, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(22), 10540; https://0-doi-org.brum.beds.ac.uk/10.3390/app112210540

Submission received: 19 September 2021 / Revised: 2 November 2021 / Accepted: 3 November 2021 / Published: 9 November 2021

(This article belongs to the Special Issue Research on Facial Expression Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

There is a significant interest in facial emotion recognition in the fields of human–computer interaction and social sciences. With the advancements in artificial intelligence (AI), the field of human behavioral prediction and analysis, especially human emotion, has evolved significantly. The most standard methods of emotion recognition are currently being used in models deployed in remote servers. We believe the reduction in the distance between the input device and the server model can lead us to better efficiency and effectiveness in real life applications. For the same purpose, computational methodologies such as edge computing can be beneficial. It can also encourage time-critical applications that can be implemented in sensitive fields. In this study, we propose a Raspberry-Pi based standalone edge device that can detect real-time facial emotions. Although this edge device can be used in variety of applications where human facial emotions play an important role, this article is mainly crafted using a dataset of employees working in organizations. A Raspberry-Pi-based standalone edge device has been implemented using the Mini-Xception Deep Network because of its computational efficiency in a shorter time compared to other networks. This device has achieved 100% accuracy for detecting faces in real time with 68% accuracy, i.e., higher than the accuracy mentioned in the state-of-the-art with the FER 2013 dataset. Future work will implement a deep network on Raspberry-Pi with an Intel Movidious neural compute stick to reduce the processing time and achieve quick real time implementation of the facial emotion recognition system.

Keywords:

emotion recognition; face detection; face recognition; machine learning (ML); real-time systems; Raspberry-Pi; support vector machine (SVM)

1. Introduction

In today’s context, video cameras can be easily accessed by everyone. These video cameras can be mobile-based cameras or other static cameras like surveillance cameras, smartphone cameras, Raspberry-Pi cameras, or laptops, etc. With the help of these cameras, it is easy to capture human faces from any location at any place. This kind of freedom has enabled the research community to implement these smart systems for understanding the behavior of humans in real-time. To understand the behavior of a human being, expression plays the most important role. Various surveys have already been done to understand different components that play major roles in understanding human emotions. The outcome of those surveys concluded that non-verbal components, facial expressions, play the most important role during interpersonal communication [1]. Research in the field of emotive facial recognition has gathered attention in the last couple of decades, as the applications are not only limited to computer science but can be implemented in the field of affective computing, computer animation, cognitive science, and perceptual sciences etc. [2]. The major mode of exchanging feelings and emotions in daily life is facial expressions. Small facial gestures are strong enough to pass on the message to another person about one’s feelings. Facial emotion is more important than verbal communication, as the emotions are an actual spontaneous reflection of a person’s feelings. Lots of research is being carried out to develop such robots that are capable of understanding the facial emotions of human beings and could understand different moods of people [3]. To automate the facial emotion recognition system, various techniques have been used. With the help of such systems, facial expressions can be detected and the system has been applied during interviews [4], for surveillance systems [5], and for detection of aggression [6]. When it comes to computer vision and artificial intelligence, facial emotion recognition is one of the most important topics. For detection of facial emotions, various sensors can be used but facial images are more important because they carry enough information to understand interpersonal communication. Various research has been conducted over the last couple of years, out of which deep-learning-based FER approaches along with detailed algorithms have been proposed. In addition there are various hybrid and deep-learning approaches that are a combination of convolutional neural networks that combine the spatial and temporal features of frames [7].

Facial emotion recognition systems have gained popularity over the decades because of their diverse applications, and the majority of those applications are applicable in real-world activities like smart supervision for suspicious activities, marketing, group emotion analysis, etc. In the same field, a cost-effective system has been proposed by Muhammad Sajjad et al. that will help to implement a smart security system for law enforcement. This system has been proposed using Raspberry-Pi with Pi-cam, that also makes it cost effective and compact [8]. With the advancement of technology and the availability of various compact devices like Raspberry-Pi, it becomes easy to equip police and security officers with compact systems that can detect facial images in real-time. In addition to that, with the development of cloud-based technology, the captured images can be sent to cloud for future action. Such a cloud-assisted facial recognition framework has been proposed by Muhammad Sajjad et al. that can help to identify criminals and provide ease for police and security people to identify criminals quickly and easily with this proposed framework [9].

A smart home based on Raspberry-Pi has also been proposed that is completely automated. Chowdhury et al. has proposed this system to automate and provide access via web to carry out everyday work [10]. An energy efficient and smart water management system has been proposed to provide cost-effective solutions over existing irrigation systems. Agrawal et al. has proposed a system that could initially water around 50 pots kept in the garden and proposed a strong system that could further be extended for bigger fields [11]. Another application based on face recognition has been proposed by [12]. This system provides access only to those who were identified by the system and makes the system more secure. The proposed system is also based on Raspberry-Pi and Pi-Camera that again is a cost-effective and user-friendly hardware to work with. Various challenges like pose mismatching [13] and usage of strong descriptors [14] have been mentioned in the literature and several alternatives have been provided in the literature to handle such issues. Various classification techniques are available in the literature and a few of them are amazingly effective to extract important information from those images like HFR (Hybrid face regions) [15]. Implementation of deep networks for smartphones has allowed facial detection along with gender classification, and a pixel pattern-based gender classification has also been proposed on the FERRET dataset, with an accuracy of 90% with frontal faces [16]. Various architectures are proposed in the literature to implement emotion recognition in real-time with multiple faces in videos [17]. It becomes a challenge to detect faces and emotions in critical situations. To manage such situations, a fast and accurate system based on ORFs has been proposed that provides results by working on multiple components like backgrounds, pose estimation, face patches, etc. [18] The contributions of study are as follows:

The implementation of hardware prototype for real time facial emotion detection with Raspberry-Pi.
A deep convolutional neural network known as Mini-Xception is used for training, validation, and testing of emotive facial images.
Support vector machine (SVM) classification is implemented in the Raspberry-Pi hardware for classifying the persons.

The organization of the study is as follows: Section 2 presents the overview of AI and CNN; Section 3 presents the proposed methodology where the methodology for real time face emotion detection is covered. Section 4 presents the hardware setup, experimental results, and comparison of proposed hardware with previous studies. Finally, the study concluded in the final section.

2. Background

Artificial Intelligence (AI) refers to the representation of human intelligence of machines designed to think and recreate human actions [19]. The concept can also be extended to any system that demonstrates characteristics consistent with a human mind such as learning and problem-solving. AI is interdisciplinary, however advancements in machine learning and deep learning are triggering a shift in perspective for nearly all sectors [20]. Computer vision is an artificial intelligence area that concentrates on image issues. Convolutional neural networks (CNNs) and computer vision combined can perform complex operations from image recognition to resolving scientific problems [21]. CNNs are well known for the capability of image recognition and classification. In general, a basic convolutional neural network consists of neurons connected via multiple layers. These layers collect the input images and process them in different layers. A simple CNN consists of three types of layers: the convolutional layer, the max-pooling layer, and the fully connected layer. The first two layers are responsible for feature extraction, introducing non-linearity and feature reduction to reduce overfitting. The last layer, known as a fully connected layer, helps in the classification based on the features that are extracted in the previous layers. The fully connected layer contains the majority of the parameters. The number of parameters has also been reduced, presented in architectures like Inception V3 [22] in which the last layer is added, i.e., Global Average Pooling operation. This layer reduces the feature map by taking the average and converting the feature map into a scalar value. To further reduce the parameters, modern CNN architectures have presented the use of residual modules [23] and depth wise-separable convolutions [5]. The depth wise-separable CNNs works by separating the task of feature extraction and combining it within the convolutional layer, hence the parameters are further reduced. So, we have used a Mini-Xception CNN proposed by [24] which reduces the parameters by using depth wise-separable convolution layers instead of simple convolution layers and eliminates fully connected layers.

3. Proposed Methodology

This section discerns the proposed framework. The elucidation of each step is elaborated on in the next sections. The entire process is divided into three tightly coupled tasks. The first task is to train the pertained deep network after dividing the dataset into training, validation, and testing. The entire dataset of emotive facial images is divided into an 8:1:1 ratio. A dataset with N images will be divided into Ժ_TR for training, Ժ_VD for cross validation and Ժ_T for testing purposes. This means a training set of N number of images I will consist of

Ժ_{T R} = {I_{1}^{T R}, I_{2}^{T R}, I_{3}^{T R} \dots I_{4 N / 5}^{T R}}

as training images,

Ժ_{V D} = {I_{(4 N / 5) + 1}^{V D} \dots I_{9 N / 10}^{V D}}

as validation images and

Ժ_{T} = {I_{(9 N / 10) + 1}^{T} \dots I_{N}^{T}}

as testing images. A deep convolutional neural network known as Mini-Xception is used for training, validation, and testing of emotive facial images. Training, validation, and testing is done on Google-CoLab with 12GB NVIDIA Tesla K80 GPU using FER 2013 dataset.

The entire architecture is divided into two tightly coupled tasks, i.e., face recognition and facial emotion recognition in real time. For face recognition, a pre-trained deep network known as OpenFace is used. To begin with, a real time image from video is captured as

Í_{1}^{r}

. The total number of images captured in real time is

δ_{R}

=

{Í_{1}^{r}, Í_{2}^{r} \dots Í_{N}^{r}}

. To train the deep network, 6 images of each subject have been used and the network is trained with single triplet method of training for 20 different people with N images and denoted as

δ_{T R}

. Once the training is complete, for the real time captured image,

Í_{1}^{r}

, the very first task is to find the face inside the captured image and discard the unwanted information. The description of parameters used in the proposed architecture is given in Table 1.

A well-known method known as HOG (Histogram of Oriented Gradients) has been used to find the faces. After the detection of the face, the facial image

Í_{1}^{r}

has been cropped, which is further preprocessed in order to remove the effects of bad lighting, tilted face, and skewness, etc., in the cropped image

Í_{c r}^{r}

. The cropped image is preprocessed with the face landmark estimation algorithm. This algorithm locates 68 landmarks on the cropped image

Í_{c r}^{r}

and, with the help of simple affine transformations, the image is preprocessed

Í_{p r}^{r}

using rotation, shear, and scale to center the eyes and mouth of the cropped image

Í_{c r}^{r}

at best. The preprocessed image

Í_{p r}^{r}

is fed to the pre-trained network to extract the features from

Í_{p r}^{r}

image and generate 128 embeddings that are measurements of the face. The feature of the preprocessed image

Í_{p r}^{r}

is generated with the help of the neural network that will generate a feature vector of 128 embeddings as

Φ_{n}^{e m b d}

. The last step is to classify the image by measuring the closest match of 128 embeddings

Φ_{n}^{e m b d}

by comparing it with the database images. The feature vector

Φ_{n}^{e m b d}

is passed through the simple SVM classifier Š, to recognize the face. The entire architecture with its detailed framework is represented in Figure 1.

The second task is to transmit the output of preprocessing stage

Í_{p r}^{r}

to the cloud where a pre-trained network of emotive facial recognition system is already available. This image

Í_{p r}^{r}

is again passed through all the layers of the mini-Xception deep network, which is a fast and depth-wise separable convolutional neural network for the recognition of emotion captured in

Í_{1}^{r}

image. The deep network on the cloud is trained on seven basic emotions and is labelled as

{ε_{c} = ε_{1} ε_{2 \dots} ε_{7}}

where the basic classes of seven emotions is represented as c.

3.1. Face Detection

After capturing the facial images in real time using Pi-cam, the first and foremost task is to separate the faces. To remove the unwanted and redundant information from the facial images like the background, a variety of methods are available. The most well-known method was the Viola Jones algorithm that was invented in the early 20s. We are using another method known as Histogram of Oriented Gradients, or HOG, to detect the facial images. Raspberry-Pi captures the emotive facial images in real-time via Pi-Cam from real time video frames. This image is real-time captured input image

Í^{r}

which is converted to grayscale to extract HOG features to extract the facial part in the input image

Í^{r}

. Finally the facial image is cropped,

Í_{c r}^{r}

, and is fed to a feature extraction unit to recognize faces in real time. Figure 2a shows the basic steps of face detection using HOG. As shown in Figure 2, the gradients are calculated for the entire grayscale image, and this is done by calculating the gradients for 16 × 16 pixels at a time.

This calculation is repeated for the entire grayscale image, and we will end up with an image of gradients. The next step is to calculate the strongest gradients in 16 × 16 windows of pixels and replace the gradients in that window with the strongest gradient. These will result in a basic image that consists of basic structure of face. To locate the face in the real-time captured input image or on real time video, we only located that part of the image which looks remarkably like a known HOG pattern and crop that part of the input image

Í^{r}

, as a result we get the cropped facial image

Í_{c r}^{r}

.

3.2. Face Alignment

As the face is captured in real-time, the image captured can have faces turned in different direction. To deal with such situations, we wrapped each picture so that our system can locate the eyes and lips in a sample place. To perform this operation, we have used an algorithm proposed by [23] which is known as face landmark estimation. The main work of this algorithm is to locate 68 specific points known as facial landmarks, as shown in Figure 2b, of all faces.

These landmarks locate the eyes, nose, chin, lips, and eyebrows etc., on any face. As explained by [19],

S = {(x_{1}^{T}, x_{2}^{T}, \dots ., x_{p}^{T})}^{T} \in R^{2 p}

is the vector that represents the p number of facial landmarks in image I. The main aim is to perform the estimation of S to the best possible estimate, which is nearest to the true shape and is denoted by

S'

^(t). This is done with the help of a cascade of regressors, where each regressor keeps on predicting and continuously updating the vector so that the estimation is accurate.

S'^{(t + 1)} = S'^{(t)} + r_{t} S'^{(t)}

is the method in which the regressor

r_{t} (., .)

is being used in a cascade for prediction and updating the vector

S'^{(t + 1)}

.

3.3. Face Encoding

The next important step is to extract the features from the exactly centered image. The best way to get the unique features of any facial image is to measure the face. The dimensions of each face are different. The main challenge is about which measurement plays a vital role in recognition of captured image. This task can be difficult to achieve if performed with the traditional method of feature extraction. To achieve accuracy and raise the speed, a deep network is trained as machines have been proven to be better than humans when it comes to prediction. Training a deep network requires a lot of computation and power from a system. So, we used a pre-trained network which is provided by OpenFace [20]. Now, we just give the input and the Deep network that measures the 128 measurements for each face instead of single face; the network has been trained on 3 facial images at an instant as shown in Figure 3. This is achieved by training the first image (Image anchor) of person with a second image (positive) of the same person, with a completely different image (Image negative) of another person, as shown in Figure 4. The main purpose is to have the image anchor closer to image positive, as compared to any other image, called image negative. The selection of a triplet to carry out 128 measurements is important. Machine learning experts call these measurements of every individual face “embedding”. Training on face embedding using a large set of images called dataset will improve the accuracy and decrease the error rate eventually. This process requires huge CPU power and lot of time. To understand triplet loss, consider the representation as

f (y) = I^{s}

which is representing an image

y

into s-dimensional Euclidean space. We oblige this implanting to live on the s-dimensional hypersphere.

f (y)

₂ = 1. As shown in Figure 3a, the main aim is to achieve minimum distance between

y_{j}^{m}

(Anchor) of a specific person with all the other images

y_{j}^{p}

(Positive) of the same person as compared to the image of any other person

y_{i}^{n}

(Negative).

So, we want to have the following:

‖ y_{j}^{m} - y_{j}^{p} ‖_{2}^{2} + β ‖ y_{j}^{m} - y_{j}^{n} ‖_{2}^{2} \forall (y_{j}^{m}, y_{j}^{n}, y_{j}^{p}) \in ζ

(1)

where β is the enforced margin between negative and positive pair of images and

ζ

is the set of all the possible triplets and has numbers equal to number P.

\sum_{j}^{P} {[‖ f {(y)}_{j}^{m} - f {(y)}_{j}^{p} ‖_{2}^{2} - ‖ f (y_{j}^{m}) - f {(y_{j}^{n}) ‖}_{2}^{2} + β]}_{+}

(2)

Generation of multiple triplets will help to overcome the issue faced in Equation (1) and selection of suitable and complex triplets will result in the improvement of the deep learning model.

3.4. SVM Based Classification

The last step is the most important step of finding the names of persons from the encodings. Different techniques have been presented that will help in the evaluation of various classifiers. A variety of machine learning classification algorithms can be used to classify the faces but the most simple and efficient one has been used for classification of faces, known as support vector machine (SVM). We kept it simple because we only want the output to be the face with the name of the person. Moreover, we are implementing this on Raspberry-Pi, so we want our system to be fast and accurate. Running this classifier on hardware takes milliseconds, which what we want, and the result of this classifier is the name of the person.

3.5. Dataset

The training of the network is illustrated in the Figure 4, here we have mapped the unique images from a single network into triplets. The gradient of the triplet loss is back propagated to the unique images through the mapping. The dataset that we have used consists of 35,888 images of facial emotions with seven categories (0 = Angry, 1 = Disgust, 2 = Fear, 3 = Happy, 4 = Sad, 5 = Surprise, 6 = Neutral). The FER 2013 dataset consist of 48 × 48 pixel gray scale images (https://www.kaggle.com/msambare/fer2013 (accessed on 11 May 2021)). The dataset that we have used is in csv format, consisting of only two columns, i.e., “emotions” and “pixels” and is kept in Google drive. The entire data is divided into an 8:1:1 ratio for training

Ժ_{T R}

, validation

Ժ_{V D}

and testing

Ժ_{T}

.

The number of images available in the dataset has been categorized as per the expression, and the total number of images under each category is shown in Table 2 and the graphical representation of dataset is shown in Figure 5. The FER 2013 dataset is not a uniform dataset, and it does not contain a uniform number of images under each category. Figure 6 shows the sample images from the FER 2013 dataset. A large number of datasets is available to detect facial emotions.

3.6. Training CNN Model: Mini Xception

The dataset has been kept on Google drive and the training has been done on Google-CoLab with 12GB NVIDIA Tesla K80 GPU. The CNN has been trained with 80% of training data from the FER dataset and the remaining 10% of dataset is kept for validation. The architecture of Mini-Xception proposed by [24] is shown in Figure 7. Testing has been done on the remaining 10% of data as shown in Figure 8 and on the input given by Raspberry-Pi after detecting the face and converting the cropped and pre-processed images into 48 × 48 sizes. This architecture is trained on the FER 2013 dataset because we want the response to be quick and the proposed architecture of Mini-Xception has been proved to be quick and light because of its unique architecture and replacement of convolutional layers with depth wise convolutional layers, which will reduce the number of parameters and make it reliable to implement for real time emotion recognition. The results of training are shown in Figure 9 and training loss is represented in Figure 10. As our system is based on Raspberry-Pi, which has certain constraints in-terms of memory and processing capability, so a smaller number of parameters will be helpful in future advancement of this system. We have achieved an accuracy of 66% on FER 2013 dataset (without augmentation), as mentioned in the state-of-the-art, and 68% after data augmentation. The main reason behind this accuracy is variation in the dataset and non-uniformity of images under each category.

4. Experimental Results

The detailed results are shown in this section. We have divided the entire architecture to carry out two main tasks. The first task of face recognition has achieved an accuracy of 100%, as all the faces that were trained are recognized correctly, as open face has provided near-human accuracy [20] on the LFW benchmark. So, we have used it and implemented it in real time video with the help of the Raspberry-Pi 3 B+ model. We have used Python with OpenCV and with the help of Pi-Cam we acquired the live video and recognize the faces in those live videos. The proposed framework can locate multiple faces in the frames but is able to recognize only those that are already trained and present in the dataset. Face recognition has given correct results even with different objects like spectacles. We have installed the setup along with the biometric attendance area so that, at the time of punching in and punching out, the expression on the faces of employees can be recorded and, after collecting the data of the fortnight, the recorded faces with names and recognized expressions can be analysed. This analysis can be useful to recognize the consistent behavior of the employees in private organizations. For example, an employee with a constant sad or disgusted expression on his face can be identified and can be reported to a happiness cell or psychological support cell for helping such employees and making them feel good and comfortable in the workplace. The hardware setup of the proposed system is shown in Figure 8.

The comaparison with various models is shown in Table 3 and specifications of the system are given in Table 4, and the detailed algorithm is explained in Algorithm 1 and Algorithm 2. After detecting the face in real time, the cropped and pre-processed image is given to the pre-trained deep network which is trained on the FER 2013 dataset using Python and Keras. The results after classification are shown in the confusion matrix, as shown in the Table 5. Several misclassifications have been found as “disgust” is misclassified as “Angry”. From the dataset one can easily locate that the count of disgusted faces in the dataset is least 547. This simply indicates that the number of features that the network has been trained to classify as a disgusted face is less compared to other classes. Hence, the misclassification took place.

The total number of parameters for which the network has been trained is 2,134,407. When tested on real time video, 110 out of 120 images with expressions are recognized correctly. Figure 10 is the graphical representation of model accuracy and Figure 11 is the graphical representation of model loss. Figure 12 shows the real time face recognition result. Table 3 shows the comparison of available models.

	Algorithm 1. Face Detection in Real-Time.
1	\ Input: Real time video of subjects $Í$
2	I. Capture the real time image $Í$ from the real time video frames
3	II. Facial Dataset creation
4	For k = 1 to size of $(δ \| T R)$ = ℕ $(δ_{R} = {Í_{1}^{r}, Í_{2}^{r} \dots Í_{n}^{r}})$ , ℕ is the count of each subjects sample and n is the total number of subjects captured
5	a. Select an image $Í_{1}^{r}$ from $δ_{R}$
6	b. Convert the image $Í_{1}^{r}$ to grayscale image: $Í_{G}^{r}$ Gray Scale $(Í \| 1^{r})$
7	c. Detect the face region using Histogram of Oriented Gradients (HOG): $Í_{H}^{r}$ HOG $Í_{G}^{r}$
8	d. Crop the facial region $Í_{C R}^{r}$ Cropped $Í_{H}^{r}$
9	e. Preprocess the cropped image $Í_{C R}^{r}$ by applying facial landmark and affine transform: $Í_{p r}^{r}$ Pre-processing $Í_{C R}^{r}$
10	f. Repeat the sub steps a to e of II of database creation for n x Ď times
11	End
12	III. Label all $n Ď (Í_{p r}^{r})$ images of dataset with the name of the subjects: ${(δ \| T R)}^{l}$ Labeled $n Ď (Í_{p r}^{r})$
13	IV. Training and feature extraction:
14	For k = 1 to size of ${(δ \| T R)}^{l}$ ,where l represents labelled data
15	a. Select the anchor image $y_{j}^{m} = (1 . Í_{1}^{r})$ of first subject
16	b. Select the positive image $y_{j}^{p}$ = $(2 . Í_{1}^{r})$ of the same subject
17	c. Select the negative image. $y_{i}^{n}$ = $(1 . Í_{2}^{r})$ of the second subject
18	d. Feed the images $y_{j}^{m}$ , $y_{j}^{p}$ , $y_{i}^{n}$ to the pre-trained deep network
19	e. Repeat the training to achieve $y_{j}^{m} - y_{j}^{p}_{2}^{2} + β y_{j}^{m} - y_{j}^{n}_{2}^{2} \forall (y_{j}^{m}, y_{j}^{n}, y_{j}^{p}) \in ζ$ ,where β is the enforced margin between negative and positive pair of images and $ζ$ is the set of all the possible triplets and has numbers equal to number M
20	f. Generate multiple triplets to improve deep learning by $\sum_{j}^{M}$
21	g. Generate the feature vector $Φ_{n}^{e m b d}$ ,where n is the number of embedding’s generated by deep network
22	End
23	V. Cross validate by capturing image $Í^{r}$ in real time
24	i. Repeat the substeps a to e of II Facial Dataset creation for $Í^{r}$
25	ii. Repeat the substeps a to f of IV Training step and feature extraction
26	VI. Feed the image to classifier to classify the image $Í^{r}$ in real time by passing all the feature vectors $Φ_{n}^{e m b d}$ called embedding’s that were generated in the substeps a to g of IV Training step and the Output: Prediction of the face of the subject with name in real-time\

	Algorithm 2. Emotion Detection in Real-Time.
1	\Input Emotive facial dataset Ժ
2	I. Divide the dataset into training $Ժ_{T R}$ ,validation $Ժ_{V D}$ and testing $Ժ_{T}$
3	II. Training and feature extraction:
4	For j = 1 to size of ) = $I_{4 N / 5}^{T R}$ ,where N is the total number of images in Ժ
5	a. Select the image $I_{1}^{T R}$ from $Ժ_{T R}$
6	b. Feed the input $I_{1}^{T R}$ to the deep network
7	c. Train the network by passing the $i m a g e s$ ) with their labels and let the network extract all the parameter
8	End
9	III. Cross Validate:
10	For j = 1 to size of ( $Ժ_{V D} = {I_{(4 N / 5) + 1}^{V D} \dots I_{9 N / 10}^{V D}}$
11	a. Select the image $I_{(4 N / 5) + 1}^{V D}$ from $Ժ_{V D}$
12	b. Repeat the substeps b and c from II Training and feature extraction for images $Ժ_{V D}$
13	c. Use validation images $Ժ_{V D} = {I_{(4 N / 5) + 1}^{V D} \dots I_{9 N / 10}^{V D}}$ the reduction of overfitting
14	End
15	IV. Testing:
16	For j = 1 to size of $Ժ_{T} = {I_{(9 N / 10) + 1}^{T} \dots I_{N}^{T}}$
17	i. Select the image $I_{(9 N / 10) + 1}^{T}$ from $Ժ_{T}$
18	ii. Repeat the substeps b and c from II Training and feature extraction for images $Ժ_{T}$
19	iii. Use testing images $Ժ_{T} = {I_{(9 N / 10) + 1}^{T} \dots I_{N}^{T}}$ to test the trained network for efficiency
20	End
21	V. Real time testing:
22	Take the input from subset e of II from Algorithm 1
23	For k = 1 to size of $(δ \| T R)$ = ℕ $(δ_{R} = {Í_{1}^{r}, Í_{2}^{r} \dots Í_{n}^{r}})$ , ℕ is the count of each subjects sample and nis the total number of subjects captured
24	i.Resize the image $Í_{p r}^{r}$ Pre-processing $Í_{C R}^{r}$ $Í_{R S}^{r}$ Resize $Í_{p r}^{r}$
25	ii. Repeat the substeps b and c from II Training and feature extraction for images $Í_{R S}^{r}$
26	End
27	VI. Predict the facial expression ${ε_{c} = ε_{1} ε_{2 \dots} ε_{7}}$ where the basic classis of seven emotions is represented as c = [Angry, disgust, fear, Happy, Sad, Surprise and Neutral].Output: Prediction of facial emotion of subject in real-time\

To evaluate the performance and effectiveness of proposed edge device, we have compared with the previous studies on facial emotional detection. It has been realized that the hardware implementation for facial emotion detection and recognition is less implemented. A few studies that have implemented this hardware recorded lower accuracy, like 51.28% and 47.44%, when compared with the proposed model, i.e., 68%.

As discussed in the above section, the FER 2013 dataset is not a balanced dataset. A total of 35,887 images of 7 classes are present in this dataset. The unbalanced dataset gave the results which are very low, specifically for disgust, fear, and sad emotion. So, a data balancing technique is used for balancing the data. Keras API helps to increase the data set by applying various techniques by using the Image Data Generator function. This mainly includes five functions, i.e., rotation at a certain angle, shearing, zooming, rescale, and horizontal flip. Before data augmentation, a total of 35,887 images were used, out of which only 547 images were of disgusted expressions. After applying data augmentation, a total of 41,904 images were used, of which 6564 were of disgusted faces. The confusion matrix after data augmentation is shown in Table 6.

Table 6 shows the confusion matrix after data augmentation. It can be noticed that prediction for disgust and fear has improved. The overall efficiency of system after data augmentation has been raised by 2% and came out as 68%. Table 7 shows the comparison of proposed edge device with previous studies.

6. Conclusions

Real time detection of any kind of activity that is suspicious in nature is difficult to identify without any actual interaction with the subject or suspect. Reading the face of a person in real time is a challenging task. With the help of compact and portable devices, it becomes easy for the majority of organizations to understand the behavior of their employees and resolve some of the minor and major issues at an early stage. To achieve that, a framework has been tested and proposed that can be implemented in any organization to understand employee behavior. The proposed framework is a cost-effective and compact alternative over all those heavy and bulky systems that are difficult to implement in real time. The system has been tested for 20 different people with all 7 emotions, and out of total 120 images, 110 images were identified with correct emotions in real time. The proposed framework has been implemented using the Mini-Xception Deep Network because of its computational efficiency in a shorter time as compared to other networks.

Facial expression representation plays an important role in facial expression recognition. It can be viewed as generating good features for describing the appearance, structure, and motion of facial expressions. More specifically, facial expression features attempt to effectively describe the facial muscle or facial motion for static or dynamic facial images. Numerous works have already done this and, although different proposed methods for facial expression recognition have achieved good results, there remain different problems that need to be addressed by the research community. The most important one is face variability in a single person. There are many factors that can cause two pictures from the same person to look totally different, such as light, face expression, or occlusion. Another problem to be taken into account is the environment. Except in controlled scenarios, face pictures have very different backgrounds, which can make the problem of face recognition more difficult. To address this issue, many of the most successful systems focus on treating the face alone, discarding all the surroundings. Smart meeting, video conferencing, and visual surveillance are some of the real-world applications that require a facial expression recognition system that works adequately on low resolution images. There exist lots of methods for facial expression recognition but very few of those methods provide results or work adequately on low resolution images. More research effort is required to be put forth for recognizing more complex facial expressions than the six classical ones, such as fatigue, pain, and mental states such as agreeing, disagreeing, lying, frustration, thinking, as they have numerous application areas. Other problems include expression intensity estimation, spontaneous expression recognition, micro expression recognition (brief, involuntary facial expression, lasts only 1/25 to 1/15 of a second), mis-alignment problems, illumination, and face pose variation. Moreover, studies proved that visual captures of facial expressions alone are not sufficient to identify the exact human emotions discussed in this section. This research can be further carried out by combining FER systems with various physiological sensors to identify the exact mental state of a person.

Author Contributions

N.R., A.G. and R.S. made contributions to conception and manuscript writing; A.S.A. and S.S.A. examined and supervised this research and outcomes; M.R. and Z.K. revised and polished the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Taif University Research Supporting Project number (TURSP-2020/311), Taif University, Taif, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this research will be made available on request to corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wiener, M.; Mehrabian, A. Language within Language: Immediacy, a Channel in Verbal Communication; Ardent Media: Lake Geneva, WI, USA, 1968; ISBN 0891972684. [Google Scholar]
Kaulard, K.; Cunningham, D.W.; Bülthoff, H.H.; Wallraven, C. The MPI facial expression database—A validated database of emotional and conversational facial expressions. PLoS ONE 2012, 7, e32321. [Google Scholar] [CrossRef]
Zeng, J.; Shan, S.; Chen, X. Facial expression recognition with inconsistently annotated datasets. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 222–237. [Google Scholar]
Edwards, J.; Jackson, H.J.; Pattison, P.E. Emotion recognition via facial expression and affective prosody in schizophrenia: A methodological review. Clin. Psychol. Rev. 2002, 22, 789–832. [Google Scholar] [CrossRef]
Amos, B.; Ludwiczuk, B.; Satyanarayanan, M. Openface: A general-purpose face recognition library with mobile applications. CMU Sch. Comput. Sci. 2016, 6, 1–20. [Google Scholar]
Ashraf, A.B.; Lucey, S.; Cohn, J.F.; Chen, T.; Ambadar, Z.; Prkachin, K.M.; Solomon, P.E. The painful face–pain expression recognition using active appearance models. Image Vis. Comput. 2009, 27, 1788–1796. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ko, B.C. A brief review of facial emotion recognition based on visual information. Sensors 2018, 18, 401. [Google Scholar] [CrossRef]
Sajjad, M.; Nasir, M.; Ullah, F.U.M.; Muhammad, K.; Sangaiah, A.K.; Baik, S.W. Raspberry Pi assisted facial expression recognition framework for smart security in law-enforcement services. Inf. Sci. 2019, 479, 416–431. [Google Scholar] [CrossRef]
Sajjad, M.; Nasir, M.; Muhammad, K.; Khan, S.; Jan, Z.; Sangaiah, A.K.; Elhoseny, M.; Baik, S.W. Raspberry Pi assisted face recognition framework for enhanced law-enforcement services in smart cities. Futur. Gener. Comput. Syst. 2020, 108, 995–1007. [Google Scholar] [CrossRef]
Chowdhury, M.N.; Nooman, M.S.; Sarker, S. Access Control of Door and Home Security by Raspberry Pi Through Internet. Int. J. Sci. Eng. Res. 2013, 4, 550–558. [Google Scholar]
Agrawal, N.; Singhal, S. Smart drip irrigation system using raspberry pi and arduino. In Proceedings of the International Conference on Computing, Communication & Automation, IEEE, Greater Noida, India, 15–16 May 2015; pp. 928–932. [Google Scholar]
Chowhan, R.S.; Tanwar, R. Password-Less Authentication: Methods for User Verification and Identification to Login Securely Over Remote Sites. In Machine Learning and Cognitive Science Applications in Cyber Security; IGI Global: Hershey, PA, USA, 2019; pp. 190–212. [Google Scholar]
Srikote, G.; Meesomboon, A. Face Recognition Performance Improvement using a Similarity Score of Feature Vectors based on Probabilistic Histograms. Adv. Electr. Comput. Eng. 2016, 16, 107–113. [Google Scholar] [CrossRef]
Bashar, F.; Khan, A.; Ahmed, F.; Kabir, H. Face recognition using similarity pattern of image directional edge response. Adv. Electr. Comput. Eng. 2014, 14, 69–77. [Google Scholar] [CrossRef]
Lajevardi, S.M.; Hussain, Z.M. Feature extraction for facial expression recognition based on hybrid face regions. Adv. Electr. Comput. Eng. 2009, 9, 63–67. [Google Scholar] [CrossRef]
Haider, K.Z.; Malik, K.R.; Khalid, S.; Nawaz, T.; Jabbar, S. Deepgender: Real-time gender classification using deep learning for smartphones. J. Real-Time Image Process. 2019, 16, 15–29. [Google Scholar] [CrossRef]
Lu, H.; Huang, Y.; Chen, Y.; Yang, D. Automatic gender recognition based on pixel-pattern-based texture feature. J. Real-Time Image Process. 2008, 3, 109–116. [Google Scholar] [CrossRef]
Greche, L.; Akil, M.; Kachouri, R.; Es-Sbai, N. A new pipeline for the recognition of universal expressions of multiple faces in a video sequence. J. Real-Time Image Process. 2020, 17, 1389–1402. [Google Scholar] [CrossRef] [Green Version]
Yoon, J.; Kim, D. An accurate and real-time multi-view face detector using orfs and doubly domain-partitioning classifier. J. Real-Time Image Process. 2019, 16, 2425–2440. [Google Scholar] [CrossRef]
Lu, Y. Artificial intelligence: A survey on evolution, models, applications and future trends. J. Manag. Anal. 2019, 6, 1–29. [Google Scholar] [CrossRef]
Pannu, A. Artificial intelligence and its application in different areas. Artif. Intell. 2015, 4, 79–84. [Google Scholar]
Lemley, J.; Bazrafkan, S.; Corcoran, P. Deep Learning for Consumer Devices and Services: Pushing the limits for machine learning, artificial intelligence, and computer vision. IEEE Consum. Electron. Mag. 2017, 6, 48–56. [Google Scholar] [CrossRef] [Green Version]
Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1867–1874. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Sambare, M. FER-2013: Learn Facial Expressions from An Image. Available online: https://www.kaggle.com/msambare/fer2013 (accessed on 11 May 2021).
Bența, K.-I.; Vaida, M.-F. Towards real-life facial expression recognition systems. AECE 2015, 15, 93–102. [Google Scholar] [CrossRef]
Jain, N.; Kumar, S.; Kumar, A.; Shamsolmoali, P.; Zareapoor, M. Hybrid deep neural networks for face emotion recognition. Pattern Recognit. Lett. 2018, 115, 101–106. [Google Scholar] [CrossRef]
Zhang, H.; Jolfaei, A.; Alazab, M. A face emotion recognition method using convolutional neural network and image edge computing. IEEE Access 2019, 7, 159081–159089. [Google Scholar] [CrossRef]
Riaz, M.N.; Shen, Y.; Sohail, M.; Guo, M. Exnet: An efficient approach for emotion recognition in the wild. Sensors 2020, 20, 1087. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Architecture for face recognition and facial emotion recognition in real time using Raspberry-Pi (Dataset from [25]).

Figure 2. (a) Face detection using Histogram of Oriented Gradients (HOG) and (b) Face is aligned in real time using ensemble of regression trees algorithm.

Figure 3. Generation of 128-dimensional data from triplet.

Figure 4. Network training flow for M unique images [Adapted from 5].

Figure 5. Graphical representation of class distribution.

Figure 6. Sample data from FER 2013 dataset.

Figure 7. Mini-Xception Architecture.

Figure 8. (a) Hardware setup (b,c) Hardware setup in Employees Cubical.

Figure 9. Graphical representation of data segregated for (a) Training (b) Validation and (c) Testing.

Figure 10. Graphical representation of training.

Figure 11. Graphical representation of loss while training.

Figure 12. Face recognition result in real-time (a) without spectacles (b) with spectacles.

Table 1. Description of parameters of proposed architecture.

Ժ Emotive Facial Dataset	$Φ_{n}^{e m b d}$ 128 Feature vectors as face embeddings
Ժ_TR Training Images	$Í$ Real time image representation
Ժ_VD Validation Images	$Í_{1}^{r}$ First real-time facial image
Ժ_T Testing Images	$Í_{p r}^{r}$ Preprocessed real-time facial image
$δ_{R}$ Real Time Captured Images	$Í_{c r}^{r}$ Cropped real-time facial image
$I_{1}^{T R}$ First Training Image	$Í_{n}^{r}$ n real-time facial images
$I_{4 N / 5}^{T R}$ 80% Training Images	Š SVM Classifier
$I_{9 N / 10}^{V D}$ 10% Validation Images	$ε_{c}$ Classifier with labels having c = 0 to 6 classes
$I_{N}^{T}$ N number of testing images	$I$ Image representation
$δ_{T R}$ Training Dataset of facial images	Ď Face detection database images

Table 2. Data set classes.

Class	Emotion	Number	After Augmentation
0	Angry	4953	4953
1	Disgust	547	6564
2	Fear	5121	5121
3	Happy	8989	8989
4	Sad	6077	6077
5	Surprise	4002	4002
6	Neutral	6198	6198

Table 3. Comparison with various Models.

Model	Accuracy	Learning Rate	Test Accuracy	Optimizer	Regularization	Activation Function
Mini_Xception	73%	0.005	68%	Adam, SGD	L1	ReLU
Densenet161	59%	0.001, 0.001, 0.005	43%	Adam, SGD	L2	Sigmoid
Resnet38	68%	0.0001	60%	SGD, AdaGrad	L1	Sigmoid
Mobilenet_V2	72.5%	0.0001, 0.001	64%	AdaGrad, Adam	L2	ReLU

Table 4. System configuration.

Name	Configuration
Imaging Libraries	OpenCV 2.4.11, imutils, dlib v18.16, Scikit-Learn, Scikit-Image, OpenFace
Libraries	Matplotlib, RPI.GPIO, Numpy, SciPy, PyLab,
Programming Languages	Python 2.7
Operating System	NOOBS

Table 5. Normalized Confusion Matrix of the testing dataset without augmentation.

True Label	Angry	0.68	0.01	0.05	0.04	0.08	0.03	0.10
	Disgust	0.47	0.44	0.02	0.02	0.00	0.04	0.02
	Fear	0.20	0.00	0.37	0.03	0.15	0.12	0.12
	Happy	0.03	0.00	0.01	0.89	0.01	0.03	0.03
	Sad	0.14	0.00	0.09	0.06	0.45	0.02	0.24
	Surprise	0.04	0.00	0.07	0.05	0.01	0.81	0.02
	Neutral	0.08	0.00	0.03	0.06	0.08	0.02	0.73
		Angry	Disgust	Fear	Happy	Sad	Surprise	Neutral
	Predicted Label

Table 6. Confusion Matrix of the testing dataset after data augmentation.

True Label	Angry	0.68	0.01	0.05	0.04	0.08	0.03	0.10
	Disgust	0.47	0.54	0.02	0.02	0.00	0.04	0.02
	Fear	0.20	0.00	0.50	0.03	0.15	0.12	0.12
	Happy	0.03	0.00	0.01	0.89	0.01	0.03	0.03
	Sad	0.14	0.07	0.05	0.06	0.45	0.02	0.24
	Surprise	0.04	0.01	0.07	0.03	0.01	0.81	0.02
	Neutral	0.08	0.01	0.03	0.06	0.08	0.02	0.73
		Angry	Disgust	Fear	Happy	Sad	Surprise	Neutral
	Predicted Label

Table 7. Comparison of proposed edge device with previous studies.

Research	Objective	Hardware Based Device	Cloud Server	Algorithm	Accuracy
[26]	Face emotion recognition	No	No	Hybrid CNN-RNN	94.91
[27]	Facial expression recognition	No	No	CNN	NA
[28]	Emotional Recognition in the Wild	Yes	No	CNN	NA
[29]	Facial Expression Emotion Detection	AtlysTM Spartan-6FPGA development board	No	SVR (Support Vector Regression)	MATLAB Simulink: 51.28% Xlinix simulation: 47.44%
Proposed	Facial emotion & detection	Raspberry-Pi based standalone edge device	Yes	CNN + SVM	Raspberry-Pi-based edge device: 68%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rathour, N.; Khanam, Z.; Gehlot, A.; Singh, R.; Rashid, M.; AlGhamdi, A.S.; Alshamrani, S.S. Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi. Appl. Sci. 2021, 11, 10540. https://0-doi-org.brum.beds.ac.uk/10.3390/app112210540

AMA Style

Rathour N, Khanam Z, Gehlot A, Singh R, Rashid M, AlGhamdi AS, Alshamrani SS. Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi. Applied Sciences. 2021; 11(22):10540. https://0-doi-org.brum.beds.ac.uk/10.3390/app112210540

Chicago/Turabian Style

Rathour, Navjot, Zeba Khanam, Anita Gehlot, Rajesh Singh, Mamoon Rashid, Ahmed Saeed AlGhamdi, and Sultan S. Alshamrani. 2021. "Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi" Applied Sciences 11, no. 22: 10540. https://0-doi-org.brum.beds.ac.uk/10.3390/app112210540

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Facial Emotion Recognition Framework for Employees of Organizations Using Raspberry-Pi

Abstract

1. Introduction

2. Background

3. Proposed Methodology

3.1. Face Detection

3.2. Face Alignment

3.3. Face Encoding

3.4. SVM Based Classification

3.5. Dataset

3.6. Training CNN Model: Mini Xception

4. Experimental Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI