Advances in Machine Learning Methods for Natural Language Processing and Computational Linguistics

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Mathematics and Computer Science".

Deadline for manuscript submissions: closed (15 July 2022) | Viewed by 19147

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science, University of Vigo, 32004 Ourense, Spain
Interests: theory of formal languages; machine translation; artificial intelligence; information extraction

E-Mail Website
Guest Editor
Laboratory of Artificial Intelligence & Decision Support,INESC TEC, 4200 465 Porto, Portugal
Interests: machine learning; data mining; metalearning; knowledge discovery in databases; text mining; automatic summarization

E-Mail Website
Guest Editor
University of Caen Normandy, CNRS, GREYC UMR 6072, F-14032 Caen Cedex, France
Interests: natural language processing; information retrieval; affective computing

Special Issue Information

Dear Colleagues,

Machine learning (ML) algorithms can be used to analyze vast volumes of information, identify patterns and generate models capable of recognizing them in new data instances. This allows us to address complex tasks with the only constraint being the necessity of a suitable training database.

Furthermore, today's digital society provides access to a vast range of raw data, but also generates the need for managing them effectively. This makes up natural language processing (NLP), a collective term referring to the automatic computational treatment of human languages for which purely symbolic techniques show clear limitations, a popular field for exploiting ML capacities. The same is true for computational linguistics (CL), which is more concerned with the study of linguistics.

However, this collaborative framework must be based on a formally well-informed strategy to ensure its reliability. In this context, this Special Issue focuses on both the application of ML techniques to solve NLP and CL tasks and on the generation of linguistic resources to enable this, for example, the construction of syntactic structures without recurse to tree banks for training, which would greatly simplify the implementation of statistical-based parsers, especially when dealing with out-of-domain scenarios or low-resource languages. By way of a more applicative issue, we could address the generation of models allowing efficient contextual representations, a nontrivial task when dealing with large-scale or multiple documents, but essential for language understanding.

Prof. Dr. Manuel Vilares-Ferro
Prof. Dr. Pavel Brazdil
Prof. Dr. Gaël Dias
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • ML-based tools for CL and NLP
  • Domain-specific and low-resource languages
  • Generation of training resources from raw data
  • Halting conditions and over–under-fitting detection
  • Integration of symbolic and model-based processing
  • Reasoning about large and multiple documents
  • Sampling strategies

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

21 pages, 513 KiB  
Article
The Relation Dimension in the Identification and Classification of Lexically Restricted Word Co-Occurrences in Text Corpora
by Alexander Shvets and Leo Wanner
Mathematics 2022, 10(20), 3831; https://0-doi-org.brum.beds.ac.uk/10.3390/math10203831 - 17 Oct 2022
Cited by 1 | Viewed by 1234
Abstract
The speech of native speakers is full of idiosyncrasies. Especially prominent are lexically restricted binary word co-occurrences of the type high esteem, strong tea, run [an] experiment, war break(s) out, etc. In lexicography, such co-occurrences are referred [...] Read more.
The speech of native speakers is full of idiosyncrasies. Especially prominent are lexically restricted binary word co-occurrences of the type high esteem, strong tea, run [an] experiment, war break(s) out, etc. In lexicography, such co-occurrences are referred to as collocations. Due to their semi-decompositional nature, collocations are of high relevance to a large number of natural language processing applications as well as to second language learning. A substantial body of work exists on the automatic recognition of collocations in textual material and, increasingly also on their semantic classification, even if not yet in the mainstream research. Especially classification with respect to the lexical function (LF) taxonomy, which is the most detailed semantically oriented taxonomy of collocations available to date, proved to be of real use to human speakers and machines alike. The most recent approaches in the field are based on multilingual neural graph transformer models that use explicit syntactic dependencies. Our goal is to explore whether the extension of such a model by a semantic relation extraction network improves its classification performance or whether it already learns the corresponding semantic relations from the dependencies and the sentential contexts, such that an additional relation extraction network will not improve the overall performance. The experiments show that the semantic relation extraction layer indeed improves the overall performance of a graph transformer. However, this improvement is not very significant, such that we can conclude that graph transformers already learn to a certain extent the semantics of the dependencies between the collocation elements. Full article
Show Figures

Figure 1

17 pages, 471 KiB  
Article
Surfing the Modeling of pos Taggers in Low-Resource Scenarios
by Manuel Vilares Ferro, Víctor M. Darriba Bilbao, Francisco J. Ribadas Pena and Jorge Graña Gil
Mathematics 2022, 10(19), 3526; https://0-doi-org.brum.beds.ac.uk/10.3390/math10193526 - 27 Sep 2022
Viewed by 982
Abstract
The recent trend toward the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, particularly in [...] Read more.
The recent trend toward the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, particularly in low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operational environment. Using as a case study the generation of pos taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations. Full article
Show Figures

Figure 1

24 pages, 365 KiB  
Article
Semi-Automatic Approaches for Exploiting Shifter Patterns in Domain-Specific Sentiment Analysis
by Pavel Brazdil, Shamsuddeen H. Muhammad, Fátima Oliveira, João Cordeiro, Fátima Silva, Purificação Silvano and António Leal
Mathematics 2022, 10(18), 3232; https://0-doi-org.brum.beds.ac.uk/10.3390/math10183232 - 06 Sep 2022
Cited by 3 | Viewed by 1176
Abstract
This paper describes two different approaches to sentiment analysis. The first is a form of symbolic approach that exploits a sentiment lexicon together with a set of shifter patterns and rules. The sentiment lexicon includes single words (unigrams) and is developed automatically by [...] Read more.
This paper describes two different approaches to sentiment analysis. The first is a form of symbolic approach that exploits a sentiment lexicon together with a set of shifter patterns and rules. The sentiment lexicon includes single words (unigrams) and is developed automatically by exploiting labeled examples. The shifter patterns include intensification, attenuation/downtoning and inversion/reversal and are developed manually. The second approach exploits a deep neural network, which uses a pre-trained language model. Both approaches were applied to texts on economics and finance domains from newspapers in European Portuguese. We show that the symbolic approach achieves virtually the same performance as the deep neural network. In addition, the symbolic approach provides understandable explanations, and the acquired knowledge can be communicated to others. We release the shifter patterns to motivate future research in this direction. Full article
Show Figures

Figure 1

30 pages, 1090 KiB  
Article
Language Accent Detection with CNN Using Sparse Data from a Crowd-Sourced Speech Archive
by Veranika Mikhailava, Mariia Lesnichaia, Natalia Bogach, Iurii Lezhenin, John Blake and Evgeny Pyshkin
Mathematics 2022, 10(16), 2913; https://0-doi-org.brum.beds.ac.uk/10.3390/math10162913 - 13 Aug 2022
Cited by 4 | Viewed by 3157
Abstract
The problem of accent recognition has received a lot of attention with the development of Automatic Speech Recognition (ASR) systems. The crux of the problem is that conventional acoustic language models adapted to fit standard language corpora are unable to satisfy the recognition [...] Read more.
The problem of accent recognition has received a lot of attention with the development of Automatic Speech Recognition (ASR) systems. The crux of the problem is that conventional acoustic language models adapted to fit standard language corpora are unable to satisfy the recognition requirements for accented speech. In this research, we contribute to the accent recognition task for a group of up to nine European accents in English and try to provide some evidence in favor of specific hyperparameter choices for neural network models together with the search for the best input speech signal parameters to ameliorate the baseline accent recognition accuracy. Specifically, we used a CNN-based model trained on the audio features extracted from the Speech Accent Archive dataset, which is a crowd-sourced collection of accented speech recordings. We show that harnessing time–frequency and energy features (such as spectrogram, chromogram, spectral centroid, spectral rolloff, and fundamental frequency) to the Mel-frequency cepstral coefficients (MFCC) may increase the accuracy of the accent classification compared to the conventional feature sets of MFCC and/or raw spectrograms. Our experiments demonstrate that the most impact is brought about by amplitude mel-spectrograms on a linear scale fed into the model. Amplitude mel-spectrograms on a linear scale, which are the correlates of the audio signal energy, allow to produce state-of-the-art classification results and brings the recognition accuracy for English with Germanic, Romance and Slavic accents ranged from 0.964 to 0.987; thus, outperforming existing models of classifying accents which use the Speech Accent Archive. We also investigated how the speech rhythm affects the recognition accuracy. Based on our preliminary experiments, we used the audio recordings in their original form (i.e., with all the pauses preserved) for other accent classification experiments. Full article
Show Figures

Figure 1

22 pages, 457 KiB  
Article
Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders
by Francisco J. Ribadas-Pena, Shuyuan Cao and Víctor M. Darriba Bilbao
Mathematics 2022, 10(16), 2867; https://0-doi-org.brum.beds.ac.uk/10.3390/math10162867 - 11 Aug 2022
Viewed by 1450
Abstract
In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k [...] Read more.
In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations. Full article
Show Figures

Figure 1

17 pages, 1163 KiB  
Article
Intent-Controllable Citation Text Generation
by Shing-Yun Jung, Ting-Han Lin, Chia-Hung Liao, Shyan-Ming Yuan and Chuen-Tsai Sun
Mathematics 2022, 10(10), 1763; https://0-doi-org.brum.beds.ac.uk/10.3390/math10101763 - 21 May 2022
Cited by 2 | Viewed by 1875
Abstract
We study the problem of controllable citation text generation by introducing a new concept to generate citation texts. Citation text generation, as an assistive writing approach, has drawn a number of researchers’ attention. However, current research related to citation text generation rarely addresses [...] Read more.
We study the problem of controllable citation text generation by introducing a new concept to generate citation texts. Citation text generation, as an assistive writing approach, has drawn a number of researchers’ attention. However, current research related to citation text generation rarely addresses how to generate the citation texts that satisfy the specified citation intents by the paper’s authors, especially at the beginning of paper writing. We propose a controllable citation text generation model that extends a pre-trained sequence to sequence models, namely, BART and T5, by using the citation intent as the control code to generate the citation text, meeting the paper authors’ citation intent. Experimental results demonstrate that our model can generate citation texts semantically similar to the reference citation texts and satisfy the given citation intent. Additionally, the results from human evaluation also indicate that incorporating the citation intent may enable the models to generate relevant citation texts almost as scientific paper authors do, even when only a little information from the citing paper is available. Full article
Show Figures

Figure 1

14 pages, 720 KiB  
Article
Incorporating Phrases in Latent Query Reformulation for Multi-Hop Question Answering
by Jiuyang Tang, Shengze Hu, Ziyang Chen, Hao Xu and Zhen Tan
Mathematics 2022, 10(4), 646; https://0-doi-org.brum.beds.ac.uk/10.3390/math10040646 - 19 Feb 2022
Cited by 1 | Viewed by 1594
Abstract
In multi-hop question answering (MH-QA), the machine needs to infer the answer to a given question from multiple documents. Existing models usually apply entities as basic units in the reasoning path. Then they use relevant entities (in the same sentence or document) to [...] Read more.
In multi-hop question answering (MH-QA), the machine needs to infer the answer to a given question from multiple documents. Existing models usually apply entities as basic units in the reasoning path. Then they use relevant entities (in the same sentence or document) to expand the path and update the information of these entities to finish the QA. The process might add an entity irrelevant to the answer to the graph and then lead to incorrect predictions. It is further observed that state-of-the-art methods are susceptible to reasoning chains that pivot on compound entities. To make up the deficiency, we present a viable solution, i.e., incorporate phrases in the latent query reformulation method (IP-LQR), which incorporates phrases in the latent query reformulation to improve the cognitive ability of the proposed method for multi-hop question answering. Specifically, IP-LQR utilizes information from relevant contexts to reformulate the question in the semantic space. Then the updated query representations interact with contexts within which the answer is hidden. We also design a semantic-augmented fusion method based on the phrase graph, which is then used to propagate the information. IP-LQR is empirically evaluated on a popular MH-QA benchmark, HotpotQA, and the results of IP-LQR consistently outperform those of the state of the art, verifying its superiority. In summary, by incorporating phrases in the latent query reformulation and employing semantic-augmented embedding fusion, our proposed model can lead to better performance on MH-QA. Full article
Show Figures

Figure 1

25 pages, 811 KiB  
Article
MisRoBÆRTa: Transformers versus Misinformation
by Ciprian-Octavian Truică and Elena-Simona Apostol
Mathematics 2022, 10(4), 569; https://0-doi-org.brum.beds.ac.uk/10.3390/math10040569 - 12 Feb 2022
Cited by 13 | Viewed by 2862
Abstract
Misinformation is considered a threat to our democratic values and principles. The spread of such content on social media polarizes society and undermines public discourse by distorting public perceptions and generating social unrest while lacking the rigor of traditional journalism. Transformers and transfer [...] Read more.
Misinformation is considered a threat to our democratic values and principles. The spread of such content on social media polarizes society and undermines public discourse by distorting public perceptions and generating social unrest while lacking the rigor of traditional journalism. Transformers and transfer learning proved to be state-of-the-art methods for multiple well-known natural language processing tasks. In this paper, we propose MisRoBÆRTa, a novel transformer-based deep neural ensemble architecture for misinformation detection. MisRoBÆRTa takes advantage of two state-of-the art transformers, i.e., BART and RoBERTa, to improve the performance of discriminating between real news and different types of fake news. We also benchmarked and evaluated the performances of multiple transformers on the task of misinformation detection. For training and testing, we used a large real-world news articles dataset (i.e., 100,000 records) labeled with 10 classes, thus addressing two shortcomings in the current research: (1) increasing the size of the dataset from small to large, and (2) moving the focus of fake news detection from binary classification to multi-class classification. For this dataset, we manually verified the content of the news articles to ensure that they were correctly labeled. The experimental results show that the accuracy of transformers on the misinformation detection problem was significantly influenced by the method employed to learn the context, dataset size, and vocabulary dimension. We observe empirically that the best accuracy performance among the classification models that use only one transformer is obtained by BART, while DistilRoBERTa obtains the best accuracy in the least amount of time required for fine-tuning and training. However, the proposed MisRoBÆRTa outperforms the other transformer models in the task of misinformation detection. To arrive at this conclusion, we performed ample ablation and sensitivity testing with MisRoBÆRTa on two datasets. Full article
Show Figures

Figure 1

20 pages, 1245 KiB  
Article
Gulf Countries’ Citizens’ Acceptance of COVID-19 Vaccines—A Machine Learning Approach
by Amerah Alabrah, Husam M. Alawadh, Ofonime Dominic Okon, Talha Meraj and Hafiz Tayyab Rauf
Mathematics 2022, 10(3), 467; https://0-doi-org.brum.beds.ac.uk/10.3390/math10030467 - 31 Jan 2022
Cited by 14 | Viewed by 2843
Abstract
The COVID-19 pandemic created a global emergency in many sectors. The spread of the disease can be subdued through timely vaccination. The COVID-19 vaccination process in various countries is ongoing and is slowing down due to multiple factors. Many studies on European countries [...] Read more.
The COVID-19 pandemic created a global emergency in many sectors. The spread of the disease can be subdued through timely vaccination. The COVID-19 vaccination process in various countries is ongoing and is slowing down due to multiple factors. Many studies on European countries and the USA have been conducted and have highlighted the public’s concern that over-vaccination results in slowing the vaccination rate. Similarly, we analyzed a collection of data from the gulf countries’ citizens’ COVID-19 vaccine-related discourse shared on social media websites, mainly via Twitter. The people’s feedback regarding different types of vaccines needs to be considered to increase the vaccination process. In this paper, the concerns of Gulf countries’ people are highlighted to lessen the vaccine hesitancy. The proposed approach emphasizes the Gulf region-specific concerns related to COVID-19 vaccination accurately using machine learning (ML)-based methods. The collected data were filtered and tokenized to analyze the sentiments extracted using three different methods: Ratio, TextBlob, and VADER methods. The sentiment-scored data were classified into positive and negative tweeted data using a proposed LSTM method. Subsequently, to obtain more confidence in classification, the in-depth features from the proposed LSTM were extracted and given to four different ML classifiers. The ratio, TextBlob, and VADER sentiment scores were separately provided to LSTM and four machine learning classifiers. The VADER sentiment scores had the best classification results using fine-KNN and Ensemble boost with 94.01% classification accuracy. Given the improved accuracy, the proposed scheme is robust and confident in classifying and determining sentiments in Twitter discourse. Full article
Show Figures

Figure 1

Back to TopTop