Novel Methods and Applications in Natural Language Processing

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Processes".

Deadline for manuscript submissions: closed (31 March 2023) | Viewed by 85300

Special Issue Editor


E-Mail Website
Guest Editor
Faculty of Information Technology and Communication Sciences (ITC), Tampere University, Kalevantie 4, 33100 Tampere, Finland
Interests: big data management; personalization; recommender systems; entity resolution; data exploration; data analytics; responsible data management
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Millions of people use natural-language interfaces in several ways and devices. This Special Issue on “Novel Methods and Applications in Natural Language Processing” aims to advance the use of natural language processing approaches to make them widely useful. Building natural language interfaces over data has attracted interest from several areas, including databases, machine learning, and human–computer interaction, offering a rich space of solutions. 

Topics of interest include but are not limited to the following: 

  • Parsing and grammatical formalisms
  • Lexical semantics
  • Linguistic resources
  • Statistical and knowledge-based methods
  • Machine translation
  • Dialog systems 
  • Conversational recommendations 
  • Speech recognition and synthesis
  • Computational linguistics and AI
  • Semantics and natural language processing 
  • Sentiment analysis
  • Multilingual natural language processing
  • Personalized natural language processing 
  • Negation processing 
  • Irony or sarcasm detection 
  • Emotion mining in social media
  • Cyber-bullying detection
  • Evaluation of natural language processing approaches

Dr. Kostas Stefanidis
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (27 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

10 pages, 254 KiB  
Article
Findings on Ad Hoc Contractions
by Sing Choi and Kazem Taghva
Information 2023, 14(7), 391; https://0-doi-org.brum.beds.ac.uk/10.3390/info14070391 - 10 Jul 2023
Viewed by 902
Abstract
Abbreviations are often overlooked, since their frequency and acceptance are almost second nature in everyday communication. Business names, handwritten notes, online messaging, professional domains, and different languages all have their own set of abbreviations. The abundance and frequent introduction of new abbreviations cause [...] Read more.
Abbreviations are often overlooked, since their frequency and acceptance are almost second nature in everyday communication. Business names, handwritten notes, online messaging, professional domains, and different languages all have their own set of abbreviations. The abundance and frequent introduction of new abbreviations cause multiple areas of overlaps and ambiguity, which mean documents often lose their clarity. We reverse engineered the process of creating these ad hoc abbreviations and revealed some preliminary statistics on what makes them easier or harder to define. In addition, we generated candidate definitions for which it proved difficult for a word sense disambiguation model to select the correct definition. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

23 pages, 809 KiB  
Article
Multi-Task Romanian Email Classification in a Business Context
by Alexandru Dima, Stefan Ruseti, Denis Iorga, Cosmin Karl Banica and Mihai Dascalu
Information 2023, 14(6), 321; https://0-doi-org.brum.beds.ac.uk/10.3390/info14060321 - 03 Jun 2023
Cited by 2 | Viewed by 1827
Abstract
Email classification systems are essential for handling and organizing the massive flow of communication, especially in a business context. Although many solutions exist, the lack of standardized classification categories limits their applicability. Furthermore, the lack of Romanian language business-oriented public datasets makes the [...] Read more.
Email classification systems are essential for handling and organizing the massive flow of communication, especially in a business context. Although many solutions exist, the lack of standardized classification categories limits their applicability. Furthermore, the lack of Romanian language business-oriented public datasets makes the development of such solutions difficult. To this end, we introduce a versatile automated email classification system based on a novel public dataset of 1447 manually annotated Romanian business-oriented emails. Our corpus is annotated with 5 token-related labels, as well as 5 sequence-related classes. We establish a strong baseline using pre-trained Transformer models for token classification and multi-task classification, achieving an F1-score of 0.752 and 0.764, respectively. We publicly release our code together with the dataset of labeled emails. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

20 pages, 652 KiB  
Article
Quantifying the Dissimilarity of Texts
by Benjamin Shade and Eduardo G. Altmann
Information 2023, 14(5), 271; https://0-doi-org.brum.beds.ac.uk/10.3390/info14050271 - 02 May 2023
Viewed by 2133
Abstract
Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures D using three [...] Read more.
Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures D using three different representations of texts—vocabularies, word frequency distributions, and vector embeddings—and three simple tasks—clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen–Shannon divergence applied to word frequencies performed strongly across all tasks, that D’s based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different D’s when the two texts varied in length by a factor h. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the h-dependency of the bias of the estimator of the generalised Jensen–Shannon divergence applied to word frequencies. We also found numerically that the Jensen–Shannon divergence and embedding-based approaches were robust to changes in h, while the Jaccard distance was not. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

20 pages, 576 KiB  
Article
Decomposed Two-Stage Prompt Learning for Few-Shot Named Entity Recognition
by Feiyang Ye, Liang Huang, Senjie Liang and KaiKai Chi
Information 2023, 14(5), 262; https://0-doi-org.brum.beds.ac.uk/10.3390/info14050262 - 28 Apr 2023
Cited by 1 | Viewed by 2423
Abstract
Named entity recognition (NER) in a few-shot setting is an extremely challenging task, and most existing methods fail to account for the gap between NER tasks and pre-trained language models. Although prompt learning has been successfully applied in few-shot classification tasks, adapting to [...] Read more.
Named entity recognition (NER) in a few-shot setting is an extremely challenging task, and most existing methods fail to account for the gap between NER tasks and pre-trained language models. Although prompt learning has been successfully applied in few-shot classification tasks, adapting to token-level classification similar to the NER task presents challenges in terms of time consumption and efficiency. In this work, we propose a decomposed prompt learning NER framework for few-shot settings, decomposing the NER task into two stages: entity locating and entity typing. In training, the location information of distant labels is used to train the entity locating model. A concise but effective prompt template is built to train the entity typing model. In inference, a pipeline approach is used to handle the entire NER task, which elegantly resolves time-consuming and inefficient problems. Specifically, a well-trained entity locating model is used to predict entity spans for each input. The input is then transformed using prompt templates, and the well-trained entity typing model is used to predict their types in a single step. Experimental results demonstrate that our framework outperforms previous prompt-based methods by an average of 2.3–12.9% in F1 score while achieving the best trade-off between accuracy and inference speed. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

32 pages, 1269 KiB  
Article
Evaluation of Automatic Legal Text Summarization Techniques for Greek Case Law
by Marios Koniaris, Dimitris Galanis, Eugenia Giannini and Panayiotis Tsanakas
Information 2023, 14(4), 250; https://0-doi-org.brum.beds.ac.uk/10.3390/info14040250 - 21 Apr 2023
Cited by 4 | Viewed by 2488
Abstract
The increasing amount of legal information available online is overwhelming for both citizens and legal professionals, making it difficult and time-consuming to find relevant information and keep up with the latest legal developments. Automatic text summarization techniques can be highly beneficial as they [...] Read more.
The increasing amount of legal information available online is overwhelming for both citizens and legal professionals, making it difficult and time-consuming to find relevant information and keep up with the latest legal developments. Automatic text summarization techniques can be highly beneficial as they save time, reduce costs, and lessen the cognitive load of legal professionals. However, applying these techniques to legal documents poses several challenges due to the complexity of legal documents and the lack of needed resources, especially in linguistically under-resourced languages, such as the Greek language. In this paper, we address automatic summarization of Greek legal documents. A major challenge in this area is the lack of suitable datasets in the Greek language. In response, we developed a new metadata-rich dataset consisting of selected judgments from the Supreme Civil and Criminal Court of Greece, alongside their reference summaries and category tags, tailored for the purpose of automated legal document summarization. We also adopted several state-of-the-art methods for abstractive and extractive summarization and conducted a comprehensive evaluation of the methods using both human and automatic metrics. Our results: (i) revealed that, while extractive methods exhibit average performance, abstractive methods generate moderately fluent and coherent text, but they tend to receive low scores in relevance and consistency metrics; (ii) indicated the need for metrics that capture better a legal document summary’s coherence, relevance, and consistency; (iii) demonstrated that fine-tuning BERT models on a specific upstream task can significantly improve the model’s performance. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

23 pages, 32221 KiB  
Article
Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
by Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie
Information 2023, 14(3), 195; https://0-doi-org.brum.beds.ac.uk/10.3390/info14030195 - 20 Mar 2023
Cited by 5 | Viewed by 3715
Abstract
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development [...] Read more.
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

13 pages, 329 KiB  
Article
Assessing Fine-Grained Explicitness of Song Lyrics
by Marco Rospocher and Samaneh Eksir
Information 2023, 14(3), 159; https://0-doi-org.brum.beds.ac.uk/10.3390/info14030159 - 02 Mar 2023
Cited by 3 | Viewed by 2758
Abstract
Music plays a crucial role in our lives, with growing consumption and engagement through streaming services and social media platforms. However, caution is needed for children, who may be exposed to explicit content through songs. Initiatives such as the Parental Advisory Label (PAL) [...] Read more.
Music plays a crucial role in our lives, with growing consumption and engagement through streaming services and social media platforms. However, caution is needed for children, who may be exposed to explicit content through songs. Initiatives such as the Parental Advisory Label (PAL) and similar labelling from streaming content providers aim to protect children from harmful content. However, so far, the labelling has been limited to tagging the song as explicit (if so), without providing any additional information on the reasons for the explicitness (e.g., strong language, sexual reference). This paper addresses this issue by developing a system capable of detecting explicit song lyrics and assessing the kind of explicit content detected. The novel contributions of the work include (i) a new dataset of 4000 song lyrics annotated with five possible reasons for content explicitness and (ii) experiments with machine learning classifiers to predict explicitness and the reasons for it. The results demonstrated the feasibility of automatically detecting explicit content and the reasons for explicitness in song lyrics. This work is the first to address explicitness at this level of detail and provides a valuable contribution to the music industry, helping to protect children from exposure to inappropriate content. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

38 pages, 2392 KiB  
Article
Incremental Entity Blocking over Heterogeneous Streaming Data
by Tiago Brasileiro Araújo, Kostas Stefanidis, Carlos Eduardo Santos Pires, Jyrki Nummenmaa and Thiago Pereira da Nóbrega
Information 2022, 13(12), 568; https://0-doi-org.brum.beds.ac.uk/10.3390/info13120568 - 05 Dec 2022
Viewed by 1534
Abstract
Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the [...] Read more.
Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-n neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

21 pages, 3549 KiB  
Article
Chinese Named Entity Recognition Based on BERT and Lightweight Feature Extraction Model
by Ruisen Yang, Yong Gan and Chenfang Zhang
Information 2022, 13(11), 515; https://0-doi-org.brum.beds.ac.uk/10.3390/info13110515 - 28 Oct 2022
Cited by 7 | Viewed by 3398
Abstract
In the early named entity recognition models, most text processing focused only on the representation of individual words and character vectors, and paid little attention to the semantic relationships between the preceding and following text in an utterance, which led to the inability [...] Read more.
In the early named entity recognition models, most text processing focused only on the representation of individual words and character vectors, and paid little attention to the semantic relationships between the preceding and following text in an utterance, which led to the inability to handle the problem of multiple meanings of a word during recognition. To address this problem, most models introduce the attention mechanism of Transformer model to solve the problem of multiple meanings of a word in text. However, the traditional Transformer model leads to a high computational overhead due to its fully connected structure. Therefore, this paper proposes a new model, the BERT-Star-Transformer-CNN-BiLSTM-CRF model, to solve the problem of the computational efficiency of the traditional Transformer. First, the input text is dynamically generated into a character vector using the BERT model pre-trained in large-scale preconditioning to solve the problem of multiple meanings of words, and then the lightweight Star-Transformer model is used as the feature extraction module to perform local feature extraction on the word vector sequence, while the CNN-BiLSTM joint model is used to perform global feature extraction on the context in the text. The obtained feature sequences are fused. Finally, the fused feature vector sequences are input to CRF for prediction of the final results. After the experiments, it is shown that the model has a significant improvement in precision, recall and F1 value compared with the traditional model, and the computational efficiency is improved by nearly 40%. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

11 pages, 725 KiB  
Article
Exploring the Utility of Dutch Question Answering Datasets for Human Resource Contact Centres
by Chaïm van Toledo, Marijn Schraagen, Friso van Dijk, Matthieu Brinkhuis and Marco Spruit
Information 2022, 13(11), 513; https://0-doi-org.brum.beds.ac.uk/10.3390/info13110513 - 28 Oct 2022
Viewed by 1767
Abstract
We explore the use case of question answering (QA) by a contact centre for 130,000 Dutch government employees in the domain of questions about human resources (HR). HR questions can be answered using personnel files or general documentation, with the latter being the [...] Read more.
We explore the use case of question answering (QA) by a contact centre for 130,000 Dutch government employees in the domain of questions about human resources (HR). HR questions can be answered using personnel files or general documentation, with the latter being the focus of the current research. We created a Dutch HR QA dataset with over 300 questions in the format of the Squad 2.0 dataset, which distinguishes between answerable and unanswerable questions. We applied various BERT-based models, either directly or after finetuning on the new dataset. The F1-scores reached 0.47 for unanswerable questions and 1.0 for answerable questions depending on the topic; however, large variations in scores were observed. We conclude more data are needed to further improve the performance of this task. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

16 pages, 742 KiB  
Article
Shedding Light on the Dark Web: Authorship Attribution in Radical Forums
by Leonardo Ranaldi, Federico Ranaldi, Francesca Fallucchi and Fabio Massimo Zanzotto
Information 2022, 13(9), 435; https://0-doi-org.brum.beds.ac.uk/10.3390/info13090435 - 14 Sep 2022
Cited by 6 | Viewed by 2567
Abstract
Online users tend to hide their real identities by adopting different names on the Internet. On Facebook or LinkedIn, for example, people usually appear with their real names. On other standard websites, such as forums, people often use nicknames to protect their real [...] Read more.
Online users tend to hide their real identities by adopting different names on the Internet. On Facebook or LinkedIn, for example, people usually appear with their real names. On other standard websites, such as forums, people often use nicknames to protect their real identities. Aliases are used when users are trying to protect their anonymity. This can be a challenge to law enforcement trying to identify users who often change nicknames. In unmonitored contexts, such as the dark web, users expect strong identity protection. Thus, without censorship, these users may create parallel social networks where they can engage in potentially malicious activities that could pose security threats. In this paper, we propose a solution to the need to recognize people who anonymize themselves behind nicknames—the authorship attribution (AA) task—in the challenging context of the dark web: specifically, an English-language Islamic forum dedicated to discussions of issues related to the Islamic world and Islam, in which members of radical Islamic groups are present. We provide extensive analysis by testing models based on transformers, styles, and syntactic features. Downstream of the experiments, we show how models that analyze syntax and style perform better than pre-trained universal language models. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

19 pages, 9969 KiB  
Article
Analysis of the Correlation between Mass-Media Publication Activity and COVID-19 Epidemiological Situation in Early 2022
by Kirill Yakunin, Ravil I. Mukhamediev, Marina Yelis, Yan Kuchin, Adilkhan Symagulov, Vitaly Levashenko, Elena Zaitseva, Margulan Aubakirov, Nadiya Yunicheva, Elena Muhamedijeva, Viktors Gopejenko and Yelena Popova
Information 2022, 13(9), 434; https://0-doi-org.brum.beds.ac.uk/10.3390/info13090434 - 14 Sep 2022
Cited by 1 | Viewed by 1963
Abstract
The paper presents the results of a correlation analysis between the information trends in the electronic media of Kazakhstan and indicators of the epidemiological situation of COVID-19 according to the World Health Organization (WHO). The developed method is based on topic modeling and [...] Read more.
The paper presents the results of a correlation analysis between the information trends in the electronic media of Kazakhstan and indicators of the epidemiological situation of COVID-19 according to the World Health Organization (WHO). The developed method is based on topic modeling and some other methods of processing natural language texts. The method allows for calculating the correlations between media topics, moods, the results of full-text search queries, and objective WHO data. The analysis of the results shows how the attitudes of society towards the problems of COVID-19 changed from 2021–2022. Firstly, the results reflect a steady trend of decreasing interest of electronic media in the topic of the pandemic, although to an unequal extent for different thematic groups. Secondly, there has been a tendency to shift the focus of attention to more pragmatic issues, such as remote learning problems, remote work, the impact of quarantine restrictions on the economy, etc. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

16 pages, 275 KiB  
Article
Detection of Racist Language in French Tweets
by Natalia Vanetik and Elisheva Mimoun
Information 2022, 13(7), 318; https://0-doi-org.brum.beds.ac.uk/10.3390/info13070318 - 29 Jun 2022
Cited by 4 | Viewed by 2861
Abstract
Toxic online content has become a major issue in recent years due to the exponential increase in the use of the internet. In France, there has been a significant increase in hate speech against migrant and Muslim communities following events such as Great [...] Read more.
Toxic online content has become a major issue in recent years due to the exponential increase in the use of the internet. In France, there has been a significant increase in hate speech against migrant and Muslim communities following events such as Great Britain’s exit from the EU, the Charlie Hebdo attacks, and the Bataclan attacks. Therefore, the automated detection of offensive language and racism is in high demand, and it is a serious challenge. Unfortunately, there are fewer datasets annotated for racist speech than for general hate speech available, especially for French. This paper attempts to breach this gap by (1) proposing and evaluating a new dataset intended for automated racist speech detection in French; (2) performing a case study with multiple supervised models and text representations for the task of racist language detection in French; and (3) performing cross-lingual experiments. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

15 pages, 1886 KiB  
Article
A Novel Multi-View Ensemble Learning Architecture to Improve the Structured Text Classification
by Carlos Adriano Gonçalves, Adrián Seara Vieira, Célia Talma Gonçalves, Rui Camacho, Eva Lorenzo Iglesias and Lourdes Borrajo Diz
Information 2022, 13(6), 283; https://0-doi-org.brum.beds.ac.uk/10.3390/info13060283 - 01 Jun 2022
Cited by 4 | Viewed by 2879
Abstract
Multi-view ensemble learning exploits the information of data views. To test its efficiency for full text classification, a technique has been implemented where the views correspond to the document sections. For classification and prediction, we use a stacking generalization based on the idea [...] Read more.
Multi-view ensemble learning exploits the information of data views. To test its efficiency for full text classification, a technique has been implemented where the views correspond to the document sections. For classification and prediction, we use a stacking generalization based on the idea that different learning algorithms provide complementary explanations of the data. The present study implements the stacking approach using support vector machine algorithms as the baseline and a C4.5 implementation as the meta-learner. Views are created with OHSUMED biomedical full text documents. Experimental results lead to the sustained conclusion that the application of multi-view techniques to full texts significantly improves the task of text classification, providing a significant contribution for the biomedical text mining research. We also have evidence to conclude that enriched datasets with text from certain sections are better than using only titles and abstracts. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

25 pages, 1084 KiB  
Article
Transducer Cascades for Biological Literature-Based Discovery
by Denis Maurel, Sandy Chéry, Nicole Bidoit, Philippe Chatalic, Aziza Filali, Christine Froidevaux and Anne Poupon
Information 2022, 13(5), 262; https://0-doi-org.brum.beds.ac.uk/10.3390/info13050262 - 20 May 2022
Viewed by 1875
Abstract
G protein-coupled receptors (GPCRs) control the response of cells to many signals, and as such, are involved in most cellular processes. As membrane receptors, they are accessible at the surface of the cell. GPCRs are also the largest family of membrane receptors, with [...] Read more.
G protein-coupled receptors (GPCRs) control the response of cells to many signals, and as such, are involved in most cellular processes. As membrane receptors, they are accessible at the surface of the cell. GPCRs are also the largest family of membrane receptors, with more than 800 representatives in mammal genomes. For this reason, they are ideal targets for drugs. Although about one third of approved drugs target GPCRs, only about 16% of GPCRs are targeted by drugs. One of the difficulties comes from the lack of knowledge on the intra-cellular events triggered by these molecules. In the last two decades, scientists have started mapping the signaling networks triggered by GPCRs. However, it soon appeared that the system is very complex, which led to the publication of more than 320,000 scientific papers. Clearly, a human cannot take into account such massive sources of information. These papers represent a mine of information about both ontological knowledge and experimental results related to GPCRs, which have to be exploited in order to build signaling networks. The ABLISS project aims at the automatic building of GPCRs networks using automated deductive reasoning, allowing to integrate all available data. Therefore, we processed the automatic extraction of network information from the literature using Natural Language Processing (NLP). We mainly focused on the experimental results about GPCRs reported in the scientific papers, as so far there is no source gathering all these experimental results. We designed a relational database in order to make them available to the scientific community later. After introducing the more general objectives of the ABLISS project, we describe the formalism in detail. We then explain the NLP program using the finite state methods (Unitex graph cascades) we implemented and discuss the extracted facts obtained. Finally, we present the design of the relational database that stores the facts extracted from the selected papers. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

13 pages, 4631 KiB  
Article
LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
by Pengbin Fu, Daxing Liu and Huirong Yang
Information 2022, 13(5), 250; https://0-doi-org.brum.beds.ac.uk/10.3390/info13050250 - 13 May 2022
Cited by 2 | Viewed by 3171
Abstract
Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, [...] Read more.
Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

13 pages, 1141 KiB  
Article
Automatic Fake News Detection for Romanian Online News
by Marius Cristian Buzea, Stefan Trausan-Matu and Traian Rebedea
Information 2022, 13(3), 151; https://0-doi-org.brum.beds.ac.uk/10.3390/info13030151 - 14 Mar 2022
Cited by 20 | Viewed by 7549
Abstract
This paper proposes a supervised machine learning system to detect fake news in online sources published in Romanian. Additionally, this work presents a comparison of the obtained results by using recurrent neural networks based on long short-term memory and gated recurrent unit cells, [...] Read more.
This paper proposes a supervised machine learning system to detect fake news in online sources published in Romanian. Additionally, this work presents a comparison of the obtained results by using recurrent neural networks based on long short-term memory and gated recurrent unit cells, a convolutional neural network, and a Bidirectional Encoder Representations from Transformers (BERT) model, namely RoBERT, a pre-trained Romanian BERT model. The deep learning architectures are compared with the results achieved by two classical classification algorithms: Naïve Bayes and Support Vector Machine. The proposed approach is based on a Romanian news corpus containing 25,841 true news items and 13,064 fake news items. The best result is over 98.20%, achieved by the convolutional neural network, which outperforms the standard classification methods and the BERT models. Moreover, based on irony detection and sentiment analysis systems, additional details are revealed about the irony phenomenon and sentiment analysis field which are used to tackle fake news challenges. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

14 pages, 841 KiB  
Article
Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
by Wandri Jooste, Rejwanul Haque and Andy Way
Information 2022, 13(2), 88; https://0-doi-org.brum.beds.ac.uk/10.3390/info13020088 - 14 Feb 2022
Cited by 11 | Viewed by 4015
Abstract
Neural machine translation (NMT) systems have greatly improved the quality available from machine translation (MT) compared to statistical machine translation (SMT) systems. However, these state-of-the-art NMT models need much more computing power and data than SMT models, a requirement that is unsustainable in [...] Read more.
Neural machine translation (NMT) systems have greatly improved the quality available from machine translation (MT) compared to statistical machine translation (SMT) systems. However, these state-of-the-art NMT models need much more computing power and data than SMT models, a requirement that is unsustainable in the long run and of very limited benefit in low-resource scenarios. To some extent, model compression—more specifically state-of-the-art knowledge distillation techniques—can remedy this. In this work, we investigate knowledge distillation on a simulated low-resource German-to-English translation task. We show that sequence-level knowledge distillation can be used to train small student models on knowledge distilled from large teacher models. Part of this work examines the influence of hyperparameter tuning on model performance when lowering the number of Transformer heads or limiting the vocabulary size. Interestingly, the accuracy of these student models is higher than that of the teachers in some cases even though the student model training times are shorter in some cases. In a novel contribution, we demonstrate for a specific MT service provider that in the post-deployment phase, distilled student models can reduce emissions, as well as cost purely in monetary terms, by almost 50%. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

14 pages, 1623 KiB  
Article
A Bidirectional Context Embedding Transformer for Automatic Speech Recognition
by Lyuchao Liao, Francis Afedzie Kwofie, Zhifeng Chen, Guangjie Han, Yongqiang Wang, Yuyuan Lin and Dongmei Hu
Information 2022, 13(2), 69; https://0-doi-org.brum.beds.ac.uk/10.3390/info13020069 - 29 Jan 2022
Cited by 5 | Viewed by 2732
Abstract
Transformers have become popular in building end-to-end automatic speech recognition (ASR) systems. However, transformer ASR systems are usually trained to give output sequences in the left-to-right order, disregarding the right-to-left context. Currently, the existing transformer-based ASR systems that employ two decoders for bidirectional [...] Read more.
Transformers have become popular in building end-to-end automatic speech recognition (ASR) systems. However, transformer ASR systems are usually trained to give output sequences in the left-to-right order, disregarding the right-to-left context. Currently, the existing transformer-based ASR systems that employ two decoders for bidirectional decoding are complex in terms of computation and optimization. The existing ASR transformer with a single decoder for bidirectional decoding requires extra methods (such as a self-mask) to resolve the problem of information leakage in the attention mechanism This paper explores different options for the development of a speech transformer that utilizes a single decoder equipped with bidirectional context embedding (BCE) for bidirectional decoding. The decoding direction, which is set up at the input level, enables the model to attend to different directional contexts without extra decoders and also alleviates any information leakage. The effectiveness of this method was verified with a bidirectional beam search method that generates bidirectional output sequences and determines the best hypothesis according to the output score. We achieved a word error rate (WER) of 7.65%/18.97% on the clean/other LibriSpeech test set, outperforming the left-to-right decoding style in our work by 3.17%/3.47%. The results are also close to, or better than, other state-of-the-art end-to-end models. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

10 pages, 403 KiB  
Article
Performance Study on Extractive Text Summarization Using BERT Models
by Shehab Abdel-Salam and Ahmed Rafea
Information 2022, 13(2), 67; https://0-doi-org.brum.beds.ac.uk/10.3390/info13020067 - 28 Jan 2022
Cited by 34 | Viewed by 8778
Abstract
The task of summarization can be categorized into two methods, extractive and abstractive. Extractive summarization selects the salient sentences from the original document to form a summary while abstractive summarization interprets the original document and generates the summary in its own words. The [...] Read more.
The task of summarization can be categorized into two methods, extractive and abstractive. Extractive summarization selects the salient sentences from the original document to form a summary while abstractive summarization interprets the original document and generates the summary in its own words. The task of generating a summary, whether extractive or abstractive, has been studied with different approaches in the literature, including statistical-, graph-, and deep learning-based approaches. Deep learning has achieved promising performances in comparison to the classical approaches, and with the advancement of different neural architectures such as the attention network (commonly known as the transformer), there are potential areas of improvement for the summarization task. The introduction of transformer architecture and its encoder model “BERT” produced an improved performance in downstream tasks in NLP. BERT is a bidirectional encoder representation from a transformer modeled as a stack of encoders. There are different sizes for BERT, such as BERT-base with 12 encoders and BERT-larger with 24 encoders, but we focus on the BERT-base for the purpose of this study. The objective of this paper is to produce a study on the performance of variants of BERT-based models on text summarization through a series of experiments, and propose “SqueezeBERTSum”, a trained summarization model fine-tuned with the SqueezeBERT encoder variant, which achieved competitive ROUGE scores retaining the BERTSum baseline model performance by 98%, with 49% fewer trainable parameters. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

19 pages, 871 KiB  
Article
Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive
by Erjon Skenderi, Jukka Huhtamäki and Kostas Stefanidis
Information 2021, 12(12), 491; https://0-doi-org.brum.beds.ac.uk/10.3390/info12120491 - 25 Nov 2021
Cited by 3 | Viewed by 2628
Abstract
In this paper, we consider the task of assigning relevant labels to studies in the social science domain. Manual labelling is an expensive process and prone to human error. Various multi-label text classification machine learning approaches have been proposed to resolve this problem. [...] Read more.
In this paper, we consider the task of assigning relevant labels to studies in the social science domain. Manual labelling is an expensive process and prone to human error. Various multi-label text classification machine learning approaches have been proposed to resolve this problem. We introduce a dataset obtained from the Finnish Social Science Archive and comprised of 2968 research studies’ metadata. The metadata of each study includes attributes, such as the “abstract” and the “set of labels”. We used the Bag of Words (BoW), TF-IDF term weighting and pretrained word embeddings obtained from FastText and BERT models to generate the text representations for each study’s abstract field. Our selection of multi-label classification methods includes a Naive approach, Multi-label k Nearest Neighbours (ML-kNN), Multi-Label Random Forest (ML-RF), X-BERT and Parabel. The methods were combined with the text representation techniques and their performance was evaluated on our dataset. We measured the classification accuracy of the combinations using Precision, Recall and F1 metrics. In addition, we used the Normalized Discounted Cumulative Gain to measure the label ranking performance of the selected methods combined with the text representation techniques. The results showed that the ML-RF model achieved a higher classification accuracy with the TF-IDF features and, based on the ranking score, the Parabel model outperformed the other methods. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

11 pages, 846 KiB  
Article
Topic Modeling for Analyzing Topic Manipulation Skills
by Seok-Ju Hwang, Yoon-Kyoung Lee, Jong-Dae Kim, Chan-Young Park and Yu-Seop Kim
Information 2021, 12(9), 359; https://0-doi-org.brum.beds.ac.uk/10.3390/info12090359 - 31 Aug 2021
Cited by 1 | Viewed by 1819
Abstract
There are many ways to communicate with people, the most representative of which is a conversation. A smooth conversation should not only be written in a grammatically appropriate manner, but also deal with the subject of conversation; this is known as language ability. [...] Read more.
There are many ways to communicate with people, the most representative of which is a conversation. A smooth conversation should not only be written in a grammatically appropriate manner, but also deal with the subject of conversation; this is known as language ability. In the past, this ability has been evaluated by language analysis/therapy experts. However, this process is time-consuming and costly. In this study, the researchers developed a Hallym Systematic Analyzer of Korean language to automate the conversation analysis process traditionally conducted by language analysis/treatment experts. However, current morpheme analyzers or parsing analyzers can only evaluate certain elements of a conversation. Therefore, in this paper, we added the ability to analyze the topic manipulation skills (the number of topics and the rate of topic maintenance) using the existing Hallym Systematic Analyzer of Korean language. The purpose of this study was to utilize the topic modeling technique to automatically evaluate topic manipulation skills. By quantitatively evaluating the topic management capabilities that were previously evaluated in a conventional manner, it was possible to automatically analyze language ability in a wider range of aspects. The experimental results show that the automatic analysis methodology presented in this study achieved a very high level of correlation with language analysis/therapy professionals. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

17 pages, 333 KiB  
Article
A Study of Analogical Density in Various Corpora at Various Granularity
by Rashel Fam and Yves Lepage
Information 2021, 12(8), 314; https://0-doi-org.brum.beds.ac.uk/10.3390/info12080314 - 05 Aug 2021
Cited by 3 | Viewed by 1962
Abstract
In this paper, we inspect the theoretical problem of counting the number of analogies between sentences contained in a text. Based on this, we measure the analogical density of the text. We focus on analogy at the sentence level, based on the level [...] Read more.
In this paper, we inspect the theoretical problem of counting the number of analogies between sentences contained in a text. Based on this, we measure the analogical density of the text. We focus on analogy at the sentence level, based on the level of form rather than on the level of semantics. Experiments are carried on two different corpora in six European languages known to have various levels of morphological richness. Corpora are tokenised using several tokenisation schemes: character, sub-word and word. For the sub-word tokenisation scheme, we employ two popular sub-word models: unigram language model and byte-pair-encoding. The results show that the corpus with a higher Type-Token Ratio tends to have higher analogical density. We also observe that masking the tokens based on their frequency helps to increase the analogical density. As for the tokenisation scheme, the results show that analogical density decreases from the character to word. However, this is not true when tokens are masked based on their frequencies. We find that tokenising the sentences using sub-word models and masking the least frequent tokens increase analogical density. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

12 pages, 325 KiB  
Article
An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
by Tharindu Ranasinghe and Marcos Zampieri
Information 2021, 12(8), 306; https://0-doi-org.brum.beds.ac.uk/10.3390/info12080306 - 29 Jul 2021
Cited by 26 | Viewed by 3462
Abstract
The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last [...] Read more.
The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

18 pages, 697 KiB  
Article
Reinforcement Learning Page Prediction for Hierarchically Ordered Municipal Websites
by Petri Puustinen, Kostas Stefanidis, Jaana Kekäläinen and Marko Junkkari
Information 2021, 12(6), 231; https://0-doi-org.brum.beds.ac.uk/10.3390/info12060231 - 28 May 2021
Viewed by 2467
Abstract
Public websites offer information on a variety of topics and services and are accessed by users with varying skills to browse the kind of electronic document repositories. However, the complex website structure and diversity of web browsing behavior create a challenging task for [...] Read more.
Public websites offer information on a variety of topics and services and are accessed by users with varying skills to browse the kind of electronic document repositories. However, the complex website structure and diversity of web browsing behavior create a challenging task for click prediction. This paper presents the results of a novel reinforcement learning approach to model user browsing patterns in a hierarchically ordered municipal website. We study how accurate predictor the browsing history is, when the target pages are not immediate next pages pointed by hyperlinks, but appear a number of levels down the hierarchy. We compare traditional type of baseline classifiers’ performance against our reinforcement learning-based training algorithm. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

18 pages, 475 KiB  
Article
A Data-Driven Approach for Video Game Playability Analysis Based on Players’ Reviews
by Xiaozhou Li, Zheying Zhang and Kostas Stefanidis
Information 2021, 12(3), 129; https://0-doi-org.brum.beds.ac.uk/10.3390/info12030129 - 17 Mar 2021
Cited by 14 | Viewed by 4861
Abstract
Playability is a key concept in game studies defining the overall quality of video games. Although its definition and frameworks are widely studied, methods to analyze and evaluate the playability of video games are still limited. Using heuristics for playability evaluation has long [...] Read more.
Playability is a key concept in game studies defining the overall quality of video games. Although its definition and frameworks are widely studied, methods to analyze and evaluate the playability of video games are still limited. Using heuristics for playability evaluation has long been the mainstream with its usefulness in detecting playability issues during game development well acknowledged. However, such a method falls short in evaluating the overall playability of video games as published software products and understanding the genuine needs of players. Thus, this paper proposes an approach to analyze the playability of video games by mining a large number of players’ opinions from their reviews. Guided by the game-as-system definition of playability, the approach is a data mining pipeline where sentiment analysis, binary classification, multi-label text classification, and topic modeling are sequentially performed. We also conducted a case study on a particular video game product with its 99,993 player reviews on the Steam platform. The results show that such a review-data-driven method can effectively evaluate the perceived quality of video games and enumerate their merits and defects in terms of playability. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

19 pages, 4062 KiB  
Article
Hybrid System Combination Framework for Uyghur–Chinese Machine Translation
by Yajuan Wang, Xiao Li, Yating Yang, Azmat Anwar and Rui Dong
Information 2021, 12(3), 98; https://0-doi-org.brum.beds.ac.uk/10.3390/info12030098 - 25 Feb 2021
Cited by 5 | Viewed by 2186
Abstract
Both the statistical machine translation (SMT) model and neural machine translation (NMT) model are the representative models in Uyghur–Chinese machine translation tasks with their own merits. Thus, it will be a promising direction to combine the advantages of them to further improve the [...] Read more.
Both the statistical machine translation (SMT) model and neural machine translation (NMT) model are the representative models in Uyghur–Chinese machine translation tasks with their own merits. Thus, it will be a promising direction to combine the advantages of them to further improve the translation performance. In this paper, we present a hybrid framework of developing a system combination for a Uyghur–Chinese machine translation task that works in three layers to achieve better translation results. In the first layer, we construct various machine translation systems including SMT and NMT. In the second layer, the outputs of multiple systems are combined to leverage the advantage of SMT and NMT models by using a multi-source-based system combination approach and the voting-based system combination approaches. Moreover, instead of selecting an individual system’s combined outputs as the final results, we transmit the outputs of the first layer and the second layer into the final layer to make a better prediction. Experiment results on the Uyghur–Chinese translation task show that the proposed framework can significantly outperform the baseline systems in terms of both the accuracy and fluency, which achieves a better performance by 1.75 BLEU points compared with the best individual system and by 0.66 BLEU points compared with the conventional system combination methods, respectively. Full article
(This article belongs to the Special Issue Novel Methods and Applications in Natural Language Processing)
Show Figures

Figure 1

Back to TopTop