Pattern Recognition in Multimedia Signal Analysis

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 April 2022) | Viewed by 30728

Special Issue Editors


E-Mail Website
Guest Editor
Institute of Informatics and Telecommunications, National Center for Scientific Research, Athens, Greece
Interests: audio/speech analysis; multimodal information retrieval

E-Mail Website
Guest Editor
Department of Informatics and Telecommunications, University of Thessaly, Thessaly, Greece
Interests: multimedia analysis; computer vision; human activity recognition; emotion recognition; deep learning
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Huge amounts of multimedia data have been generated in recent years, either through profesional “content providers” (TV, movies, internet TV, and music videos) or user-generated content (vlogs, social media multimodal content, and multisensor data). Therefore, the need for automatic indexing, classification, content visualization, and recommendation, through multimodal pattern recognition, is obvious for various applications. In addition, multimedia data exhibit much richer structures and representations than simple forms of data and, as a result, the related pattern recognition approaches must take that into consideration.

This Special Issue focuses on novel approaches for analyzing multimodal content using pattern recognition and signal analysis algorithms. Application areas include but are not limited to video summarization, content-based multimedia indexing and retrieval, content-based recommender systems, multimodal behavior and emotion recognition, patient/elderly home monitoring based on multimodal sensors, mental health monitoring, and autonomous driving.

In this Special Issue, we invite submissions that report on cutting-edge research in the broad spectrum of pattern recognition in multimedia analysis, related to the aforementioned areas. Survey papers and reviews in a specific research and/or application area are also welcome. All submitted papers will undergo our standard peer-review procedure. Accepted papers will be published in open-access format in Applied Sciences and collected together on this Special Issue website.

Dr. Theodoros Giannakopoulos
Prof. Dr. Evaggelos Spyrou
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Multimodal signal processing;
  • Multimedia pattern recognition;
  • Audio-visual fusion.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

31 pages, 1145 KiB  
Article
Multimodal Classification of Safety-Report Observations
by Georgios Paraskevopoulos, Petros Pistofidis, Georgios Banoutsos, Efthymios Georgiou and Vassilis Katsouros
Appl. Sci. 2022, 12(12), 5781; https://0-doi-org.brum.beds.ac.uk/10.3390/app12125781 - 07 Jun 2022
Cited by 2 | Viewed by 1743
Abstract
Modern businesses are obligated to conform to regulations to prevent physical injuries and ill health for anyone present on a site under their responsibility, such as customers, employees and visitors. Safety officers (SOs) are engineers, who perform site audits to businesses, record observations [...] Read more.
Modern businesses are obligated to conform to regulations to prevent physical injuries and ill health for anyone present on a site under their responsibility, such as customers, employees and visitors. Safety officers (SOs) are engineers, who perform site audits to businesses, record observations regarding possible safety issues and make appropriate recommendations. In this work, we develop a multimodal machine-learning architecture for the analysis and categorization of safety observations, given textual descriptions and images taken from the location sites. For this, we utilize a new multimodal dataset, Safety4All, which contains 5344 safety-related observations created by 86 SOs in 486 sites. An observation consists of a short issue description, written by the SOs, accompanied with images where the issue is shown, relevant metadata and a priority score. Our proposed architecture is based on the joint fine tuning of large pretrained language and image neural network models. Specifically, we propose the use of a joint task and contrastive loss, which aligns the text and vision representations in a joint multimodal space. The contrastive loss ensures that inter-modality representation distances are maintained, so that vision and language representations for similar samples are close in the shared multimodal space. We evaluate the proposed model on three tasks, namely, priority classification of input observations, observation assessment and observation categorization. Our experiments show that inspection scene images and textual descriptions provide complementary information, signifying the importance of both modalities. Furthermore, the use of the joint contrastive loss produces strong multimodal representations and outperforms a baseline simple model in tasks fusion. In addition, we train and release a large transformer-based language model for the Greek language based on the Electra architecture. Full article
(This article belongs to the Special Issue Pattern Recognition in Multimedia Signal Analysis)
Show Figures

Figure 1

9 pages, 1445 KiB  
Article
Multi-Modal Emotion Recognition Using Speech Features and Text-Embedding
by Sung-Woo Byun, Ju-Hee Kim and Seok-Pil Lee
Appl. Sci. 2021, 11(17), 7967; https://0-doi-org.brum.beds.ac.uk/10.3390/app11177967 - 28 Aug 2021
Cited by 6 | Viewed by 2720
Abstract
Recently, intelligent personal assistants, chat-bots and AI speakers are being utilized more broadly as communication interfaces and the demands for more natural interaction measures have increased as well. Humans can express emotions in various ways, such as using voice tones or facial expressions; [...] Read more.
Recently, intelligent personal assistants, chat-bots and AI speakers are being utilized more broadly as communication interfaces and the demands for more natural interaction measures have increased as well. Humans can express emotions in various ways, such as using voice tones or facial expressions; therefore, multimodal approaches to recognize human emotions have been studied. In this paper, we propose an emotion recognition method to deliver more accuracy by using speech and text data. The strengths of the data are also utilized in this method. We conducted 43 feature vectors such as spectral features, harmonic features and MFCC from speech datasets. In addition, 256 embedding vectors from transcripts using pre-trained Tacotron encoder were extracted. The acoustic feature vectors and embedding vectors were fed into each deep learning model which produced a probability for the predicted output classes. The results show that the proposed model exhibited more accurate performance than in previous research. Full article
(This article belongs to the Special Issue Pattern Recognition in Multimedia Signal Analysis)
Show Figures

Figure 1

19 pages, 669 KiB  
Article
Emotion Identification in Movies through Facial Expression Recognition
by João Almeida, Luís Vilaça, Inês N. Teixeira and Paula Viana
Appl. Sci. 2021, 11(15), 6827; https://0-doi-org.brum.beds.ac.uk/10.3390/app11156827 - 25 Jul 2021
Cited by 8 | Viewed by 4765
Abstract
Understanding how acting bridges the emotional bond between spectators and films is essential to depict how humans interact with this rapidly growing digital medium. In recent decades, the research community made promising progress in developing facial expression recognition (FER) methods. However, no emphasis [...] Read more.
Understanding how acting bridges the emotional bond between spectators and films is essential to depict how humans interact with this rapidly growing digital medium. In recent decades, the research community made promising progress in developing facial expression recognition (FER) methods. However, no emphasis has been put in cinematographic content, which is complex by nature due to the visual techniques used to convey the desired emotions. Our work represents a step towards emotion identification in cinema through facial expressions’ analysis. We presented a comprehensive overview of the most relevant datasets used for FER, highlighting problems caused by their heterogeneity and to the inexistence of a universal model of emotions. Built upon this understanding, we evaluated these datasets with a standard image classification models to analyze the feasibility of using facial expressions to determine the emotional charge of a film. To cope with the problem of lack of datasets for the scope under analysis, we demonstrated the feasibility of using a generic dataset for the training process and propose a new way to look at emotions by creating clusters of emotions based on the evidence obtained in the experiments. Full article
(This article belongs to the Special Issue Pattern Recognition in Multimedia Signal Analysis)
Show Figures

Figure 1

17 pages, 1182 KiB  
Article
Multimodal Summarization of User-Generated Videos
by Theodoros Psallidas, Panagiotis Koromilas, Theodoros Giannakopoulos and Evaggelos Spyrou
Appl. Sci. 2021, 11(11), 5260; https://0-doi-org.brum.beds.ac.uk/10.3390/app11115260 - 05 Jun 2021
Cited by 8 | Viewed by 2800
Abstract
The exponential growth of user-generated content has increased the need for efficient video summarization schemes. However, most approaches underestimate the power of aural features, while they are designed to work mainly on commercial/professional videos. In this work, we present an approach that uses [...] Read more.
The exponential growth of user-generated content has increased the need for efficient video summarization schemes. However, most approaches underestimate the power of aural features, while they are designed to work mainly on commercial/professional videos. In this work, we present an approach that uses both aural and visual features in order to create video summaries from user-generated videos. Our approach produces dynamic video summaries, that is, comprising the most “important” parts of the original video, which are arranged so as to preserve their temporal order. We use supervised knowledge from both the aforementioned modalities and train a binary classifier, which learns to recognize the important parts of videos. Moreover, we present a novel user-generated dataset which contains videos from several categories. Every 1 s part of each video from our dataset has been annotated by more than three annotators as being important or not. We evaluate our approach using several classification strategies based on audio, video and fused features. Our experimental results illustrate the potential of our approach. Full article
(This article belongs to the Special Issue Pattern Recognition in Multimedia Signal Analysis)
Show Figures

Figure 1

15 pages, 624 KiB  
Article
Face Morphing, a Modern Threat to Border Security: Recent Advances and Open Challenges
by Erion-Vasilis Pikoulis, Zafeiria-Marina Ioannou, Mersini Paschou and Evangelos Sakkopoulos
Appl. Sci. 2021, 11(7), 3207; https://0-doi-org.brum.beds.ac.uk/10.3390/app11073207 - 02 Apr 2021
Cited by 8 | Viewed by 6686
Abstract
Face morphing poses a serious threat to Automatic Border Control (ABC) and Face Recognition Systems (FRS) in general. The aim of this paper is to present a qualitative assessment of the morphing attack issue, and the challenges it entails, highlighting both the technological [...] Read more.
Face morphing poses a serious threat to Automatic Border Control (ABC) and Face Recognition Systems (FRS) in general. The aim of this paper is to present a qualitative assessment of the morphing attack issue, and the challenges it entails, highlighting both the technological and human aspects of the problem. Here, after the face morphing attack scenario is presented, the paper provides an overview of the relevant bibliography and recent advances towards two central directions. First, the morphing of face images is outlined with a particular focus on the three main steps that are involved in the process, namely, landmark detection, face alignment and blending. Second, the detection of morphing attacks is presented under the prism of the so-called on-line and off-line detection scenarios and whether the proposed techniques employ handcrafted features, using classical methods, or automatically generated features, using deep-learning-based methods. The paper, then, presents the evaluation metrics that are employed in the corresponding bibliography and concludes with a discussion on open challenges that need to be address for further advancing automatic detection of morphing attacks. Despite the progress being made, the general consensus of the research community is that significant effort and resources are needed in the near future for the mitigation of the issue, especially, towards the creation of datasets capturing the full extent of the problem at hand and the availability of reference evaluation procedures for comparing novel automatic attack detection algorithms. Full article
(This article belongs to the Special Issue Pattern Recognition in Multimedia Signal Analysis)
Show Figures

Figure 1

22 pages, 2895 KiB  
Article
Distracted and Drowsy Driving Modeling Using Deep Physiological Representations and Multitask Learning
by Michalis Papakostas, Kapotaksha Das, Mohamed Abouelenien, Rada Mihalcea and Mihai Burzo
Appl. Sci. 2021, 11(1), 88; https://0-doi-org.brum.beds.ac.uk/10.3390/app11010088 - 24 Dec 2020
Cited by 10 | Viewed by 3337
Abstract
In this paper, we investigated various physiological indicators on their ability to identify distracted and drowsy driving. In particular, four physiological signals are being tested: blood volume pulse (BVP), respiration, skin conductance and skin temperature. Data were collected from 45 participants, under a [...] Read more.
In this paper, we investigated various physiological indicators on their ability to identify distracted and drowsy driving. In particular, four physiological signals are being tested: blood volume pulse (BVP), respiration, skin conductance and skin temperature. Data were collected from 45 participants, under a simulated driving scenario, through different times of the day and during their engagement on a variety of physical and cognitive distractors. We explore several statistical features extracted from those signals and their efficiency to discriminate between the presence or not of each of the two conditions. To that end, we evaluate three traditional classifiers (Random Forests, KNN and SVM), which have been extensively applied by the related literature and we compare their performance against a deep CNN-LSTM network that learns spatio-temporal physiological representations. In addition, we explore the potential of learning multiple conditions in parallel using a single machine learning model, and we discuss how such a problem could be formulated and what are the benefits and disadvantages of the different approaches. Overall, our findings indicate that information related to the BVP data, especially features that describe patterns with respect to the inter-beat-intervals (IBI), are highly associates with both targeted conditions. In addition, features related to the respiratory behavior of the driver can be indicative of drowsiness, while being less associated with distractions. Moreover, spatio-temporal deep methods seem to have a clear advantage against traditional classifiers on detecting both driver conditions. Our experiments show, that even though learning both conditions jointly can not compete directly to individual, task-specific CNN-LSTM models, deep multitask learning approaches have a great potential towards that end as they offer the second best performance on both tasks against all other evaluated alternatives in terms of sensitivity, specificity and the area under the receiver operating characteristic curve (AUC). Full article
(This article belongs to the Special Issue Pattern Recognition in Multimedia Signal Analysis)
Show Figures

Figure 1

Review

Jump to: Research

21 pages, 429 KiB  
Review
Deep Multimodal Emotion Recognition on Human Speech: A Review
by Panagiotis Koromilas and Theodoros Giannakopoulos
Appl. Sci. 2021, 11(17), 7962; https://0-doi-org.brum.beds.ac.uk/10.3390/app11177962 - 28 Aug 2021
Cited by 23 | Viewed by 6121
Abstract
This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal [...] Read more.
This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimplification of the temporal dimension, although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies. In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies. Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness. Full article
(This article belongs to the Special Issue Pattern Recognition in Multimedia Signal Analysis)
Show Figures

Figure 1

Back to TopTop