Machine Learning and Natural Language Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 June 2021) | Viewed by 81150

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science and Engineering, The University of Aizu, Tsuruga, Ikki-machi, Aizu-Wakamatsu 965-8580, Japan
Interests: AI for computer games; computer-assised language learning; software engineering
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Co-Guest Editor
School of Educational Sciences and Psychology, University of Eastern Finland, Kuopio, Finland
Interests: educational technologies; human-language technologies; digital fabrication; user experience; user-centred design

Special Issue Information

Dear Colleagues,

Recent years have been marked with the growing availability of natural language processing technologies for practical everyday use. The rapid development of open source instruments, such as NLTK; the wide availability of training corpora; and the general recognition of computational linguistics as an established scientific and technological field has resulted in a wide adoption of language processing in a broad range of software products.

Numerous exciting results were obtained recently with the help of machine learning technologies that make use of available datasets and corpora to support the whole range of tasks addressed in computational linguistics. These approaches were especially beneficial for regional languages, as they receive state-of-the-art language processing tools as soon as the necessary corpora are developed.

However, the existing tools and corpora still cannot cover all of the needs of researchers and developers working in the area of language processing. So, the whole community would benefit from new research perspectives on harnessing the available data, and the creation and adaptation of new linguisitic resources aimed to advance natural language processing technologies.

Thus, we believe that today’s scientific and technological landscape looks positive for research efforts based on a combination of machine learning and natural language processing. Important results can be achieved with the reliance of modern approaches and datasets, and the expected practical impact is higher than ever. This observation motivates us to propose an Issue of Applied Sciences, dedicated specifically to applications of machine learning in natural language processing tasks. We invite both original research and review articles relevant to the proposed topic.

Prof. Maxim Mozgovoy
Dr. Calkin Suero Montero 
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Lemmatization and part of speech tagging
  • Word analysis
  • Syntactic, semantic, and context parsing and analysis
  • Word sense disambiguation
  • Sentence breaking
  • Named entity recognition
  • Machine translation-related tasks
  • Question answering and chatbot development
  • Discourse analysis
  • Speech synthesis and recognition
  • Information retrieval
  • Ontology
  • Corpora development and evaluation
  • Natural language generation
  • Text and speech analysis.

Published Papers (21 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research, Review

2 pages, 189 KiB  
Editorial
Special Issue on Machine Learning and Natural Language Processing
by Maxim Mozgovoy and Calkin Suero Montero
Appl. Sci. 2022, 12(17), 8894; https://0-doi-org.brum.beds.ac.uk/10.3390/app12178894 - 05 Sep 2022
Viewed by 1158
Abstract
The task of processing natural language automatically has been on the radar of researchers since the dawn of computing, fostering the rise of fields such as computational linguistics and human–language technologies [...] Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)

Research

Jump to: Editorial, Review

16 pages, 333 KiB  
Article
From General Language Understanding to Noisy Text Comprehension
by Buddhika Kasthuriarachchy, Madhu Chetty, Adrian Shatte and Darren Walls
Appl. Sci. 2021, 11(17), 7814; https://0-doi-org.brum.beds.ac.uk/10.3390/app11177814 - 25 Aug 2021
Cited by 4 | Viewed by 1721
Abstract
Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy [...] Read more.
Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

20 pages, 403 KiB  
Article
VivesDebate: A New Annotated Multilingual Corpus of Argumentation in a Debate Tournament
by Ramon Ruiz-Dolz, Montserrat Nofre, Mariona Taulé, Stella Heras and Ana García-Fornes
Appl. Sci. 2021, 11(15), 7160; https://0-doi-org.brum.beds.ac.uk/10.3390/app11157160 - 03 Aug 2021
Cited by 7 | Viewed by 3610
Abstract
The application of the latest Natural Language Processing breakthroughs in computational argumentation has shown promising results, which have raised the interest in this area of research. However, the available corpora with argumentative annotations are often limited to a very specific purpose or are [...] Read more.
The application of the latest Natural Language Processing breakthroughs in computational argumentation has shown promising results, which have raised the interest in this area of research. However, the available corpora with argumentative annotations are often limited to a very specific purpose or are not of adequate size to take advantage of state-of-the-art deep learning techniques (e.g., deep neural networks). In this paper, we present VivesDebate, a large, richly annotated and versatile professional debate corpus for computational argumentation research. The corpus has been created from 29 transcripts of a debate tournament in Catalan and has been machine-translated into Spanish and English. The annotation contains argumentative propositions, argumentative relations, debate interactions and professional evaluations of the arguments and argumentation. The presented corpus can be useful for research on a heterogeneous set of computational argumentation underlying tasks such as Argument Mining, Argument Analysis, Argument Evaluation or Argument Generation, among others. All this makes VivesDebate a valuable resource for computational argumentation research within the context of massive corpora aimed at Natural Language Processing tasks. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

15 pages, 1984 KiB  
Article
MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain
by Muzamil Hussain Syed and Sun-Tae Chung
Appl. Sci. 2021, 11(13), 6007; https://0-doi-org.brum.beds.ac.uk/10.3390/app11136007 - 28 Jun 2021
Cited by 19 | Viewed by 4890
Abstract
Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) [...] Read more.
Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) is gaining prominent attention among various contextualized word embedding models as a state-of-the-art pre-trained language model. It is quite expensive to train a BERT model from scratch for a new application domain since it needs a huge dataset and enormous computing time. In this paper, we focus on menu entity extraction from online user reviews for the restaurant and propose a simple but effective approach for NER task on a new domain where a large dataset is rarely available or difficult to prepare, such as food menu domain, based on domain adaptation technique for word embedding and fine-tuning the popular NER task network model ‘Bi-LSTM+CRF’ with extended feature vectors. The proposed NER approach (named as ‘MenuNER’) consists of two step-processes: (1) Domain adaptation for target domain; further pre-training of the off-the-shelf BERT language model (BERT-base) in semi-supervised fashion on a domain-specific dataset, and (2) Supervised fine-tuning the popular Bi-LSTM+CRF network for downstream task with extended feature vectors obtained by concatenating word embedding from the domain-adapted pre-trained BERT model from the first step, character embedding and POS tag feature information. Experimental results on handcrafted food menu corpus from customers’ review dataset show that our proposed approach for domain-specific NER task, that is: food menu named-entity recognition, performs significantly better than the one based on the baseline off-the-shelf BERT-base model. The proposed approach achieves 92.5% F1 score on the YELP dataset for the MenuNER task. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

22 pages, 442 KiB  
Article
Negation Detection on Mexican Spanish Tweets: The T-MexNeg Corpus
by Gemma Bel-Enguix, Helena Gómez-Adorno, Alejandro Pimentel, Sergio-Luis Ojeda-Trueba and Brian Aguilar-Vizuet
Appl. Sci. 2021, 11(9), 3880; https://0-doi-org.brum.beds.ac.uk/10.3390/app11093880 - 25 Apr 2021
Cited by 6 | Viewed by 2808
Abstract
In this paper, we introduce the T-MexNeg corpus of Tweets written in Mexican Spanish. It consists of 13,704 Tweets, of which 4895 contain negation structures. We performed an analysis of negation statements embedded in the language employed on social media. This research paper [...] Read more.
In this paper, we introduce the T-MexNeg corpus of Tweets written in Mexican Spanish. It consists of 13,704 Tweets, of which 4895 contain negation structures. We performed an analysis of negation statements embedded in the language employed on social media. This research paper aims to present the annotation guidelines along with a novel resource targeted at the negation detection task. The corpus was manually annotated with labels of negation cue, scope, and, event. We report the analysis of the inter-annotator agreement for all the components of the negation structure. This resource is freely available. Furthermore, we performed various experiments to automatically identify negation using the T-MexNeg corpus and the SFU ReviewSP-NEG for training a machine learning algorithm. By comparing two different methodologies, one based on a dictionary and the other based on the Conditional Random Fields algorithm, we found that the results of negation identification on Twitter are lower when the model is trained on the SFU ReviewSP-NEG Corpus. Therefore, this paper shows the importance of having resources built specifically to deal with social media language. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

11 pages, 556 KiB  
Article
A Sequential and Intensive Weighted Language Modeling Scheme for Multi-Task Learning-Based Natural Language Understanding
by Suhyune Son, Seonjeong Hwang, Sohyeun Bae, Soo Jun Park and Jang-Hwan Choi
Appl. Sci. 2021, 11(7), 3095; https://0-doi-org.brum.beds.ac.uk/10.3390/app11073095 - 31 Mar 2021
Cited by 4 | Viewed by 2291
Abstract
Multi-task learning (MTL) approaches are actively used for various natural language processing (NLP) tasks. The Multi-Task Deep Neural Network (MT-DNN) has contributed significantly to improving the performance of natural language understanding (NLU) tasks. However, one drawback is that confusion about the language representation [...] Read more.
Multi-task learning (MTL) approaches are actively used for various natural language processing (NLP) tasks. The Multi-Task Deep Neural Network (MT-DNN) has contributed significantly to improving the performance of natural language understanding (NLU) tasks. However, one drawback is that confusion about the language representation of various tasks arises during the training of the MT-DNN model. Inspired by the internal-transfer weighting of MTL in medical imaging, we introduce a Sequential and Intensive Weighted Language Modeling (SIWLM) scheme. The SIWLM consists of two stages: (1) Sequential weighted learning (SWL), which trains a model to learn entire tasks sequentially and concentrically, and (2) Intensive weighted learning (IWL), which enables the model to focus on the central task. We apply this scheme to the MT-DNN model and call this model the MTDNN-SIWLM. Our model achieves higher performance than the existing reference algorithms on six out of the eight GLUE benchmark tasks. Moreover, our model outperforms MT-DNN by 0.77 on average on the overall task. Finally, we conducted a thorough empirical investigation to determine the optimal weight for each GLUE task. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

12 pages, 512 KiB  
Article
Entity-Centric Fully Connected GCN for Relation Classification
by Jun Long, Ye Wang, Xiangxiang Wei, Zhen Ding, Qianqian Qi, Fang Xie, Zheman Qian and Wenti Huang
Appl. Sci. 2021, 11(4), 1377; https://0-doi-org.brum.beds.ac.uk/10.3390/app11041377 - 03 Feb 2021
Cited by 5 | Viewed by 2237
Abstract
Relation classification is an important task in the field of natural language processing, and it is one of the important steps in constructing a knowledge graph, which can greatly reduce the cost of constructing a knowledge graph. The Graph Convolutional Network (GCN) is [...] Read more.
Relation classification is an important task in the field of natural language processing, and it is one of the important steps in constructing a knowledge graph, which can greatly reduce the cost of constructing a knowledge graph. The Graph Convolutional Network (GCN) is an effective model for accurate relation classification, which models the dependency tree of textual instances to extract the semantic features of relation mentions. Previous GCN based methods treat each node equally. However, the contribution of different words to express a certain relation is different, especially the entity mentions in the sentence. In this paper, a novel GCN based relation classifier is propose, which treats the entity nodes as two global nodes in the dependency tree. These two global nodes directly connect with other nodes, which can aggregate information from the whole tree with only one convolutional layer. In this way, the method can not only simplify the complexity of the model, but also generate expressive relation representation. Experimental results on two widely used data sets, SemEval-2010 Task 8 and TACRED, show that our model outperforms all the compared baselines in this paper, which illustrates that the model can effectively utilize the dependencies between nodes and improve the performance of relation classification. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

20 pages, 499 KiB  
Article
Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models
by Sarang Shaikh, Sher Muhammad Daudpota, Ali Shariq Imran and Zenun Kastrati
Appl. Sci. 2021, 11(2), 869; https://0-doi-org.brum.beds.ac.uk/10.3390/app11020869 - 19 Jan 2021
Cited by 33 | Viewed by 4900
Abstract
Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain [...] Read more.
Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17% when datasets are balanced using generated text. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

16 pages, 5227 KiB  
Article
A Deep Learning Approach for Automatic Hate Speech Detection in the Saudi Twittersphere
by Raghad Alshalan and Hend Al-Khalifa
Appl. Sci. 2020, 10(23), 8614; https://0-doi-org.brum.beds.ac.uk/10.3390/app10238614 - 01 Dec 2020
Cited by 65 | Viewed by 6383
Abstract
With the rise of hate speech phenomena in the Twittersphere, significant research efforts have been undertaken in order to provide automatic solutions for detecting hate speech, varying from simple machine learning models to more complex deep neural network models. Despite this, research works [...] Read more.
With the rise of hate speech phenomena in the Twittersphere, significant research efforts have been undertaken in order to provide automatic solutions for detecting hate speech, varying from simple machine learning models to more complex deep neural network models. Despite this, research works investigating hate speech problem in Arabic are still limited. This paper, therefore, aimed to investigate several neural network models based on convolutional neural network (CNN) and recurrent neural network (RNN) to detect hate speech in Arabic tweets. It also evaluated the recent language representation model bidirectional encoder representations from transformers (BERT) on the task of Arabic hate speech detection. To conduct our experiments, we firstly built a new hate speech dataset that contained 9316 annotated tweets. Then, we conducted a set of experiments on two datasets to evaluate four models: CNN, gated recurrent units (GRU), CNN + GRU, and BERT. Our experimental results in our dataset and an out-domain dataset showed that the CNN model gave the best performance, with an F1-score of 0.79 and area under the receiver operating characteristic curve (AUROC) of 0.89. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

11 pages, 648 KiB  
Article
Memory-Based Deep Neural Attention (mDNA) for Cognitive Multi-Turn Response Retrieval in Task-Oriented Chatbots
by Jenhui Chen, Obinna Agbodike and Lei Wang
Appl. Sci. 2020, 10(17), 5819; https://0-doi-org.brum.beds.ac.uk/10.3390/app10175819 - 22 Aug 2020
Cited by 7 | Viewed by 2968
Abstract
One of the important criteria used in judging the performance of a chatbot is the ability to provide meaningful and informative responses that correspond with the context of a user’s utterance. Nowadays, the number of enterprises adopting and relying on task-oriented chatbots for [...] Read more.
One of the important criteria used in judging the performance of a chatbot is the ability to provide meaningful and informative responses that correspond with the context of a user’s utterance. Nowadays, the number of enterprises adopting and relying on task-oriented chatbots for profit is increasing. Dialog errors and inappropriate response to user queries by chatbots can result in huge cost implications. To achieve high performance, recent AI chatbot models are increasingly adopting the Transformer positional encoding and the attention-based architecture. While the transformer performs optimally in sequential generative chatbot models, recent studies has pointed out the occurrence of logical inconsistency and fuzzy error problems when the Transformer technique is adopted in retrieval-based chatbot models. Our investigation discovers that the encountered errors are caused by information losses. Therefore, in this paper, we address this problem by augmenting the Transformer-based retrieval chatbot architecture with a memory-based deep neural attention (mDNA) model by using an approach similar to late data fusion. The mDNA is a simple encoder-decoder neural architecture that comprises of bidirectional long short-term memory (Bi-LSTM), attention mechanism, and a memory for information retention in the encoder. In our experiments, we trained the model extensively on a large Ubuntu dialog corpus, and the results from recall evaluation scores show that the mDNA augmentation approach slightly outperforms selected state-of-the-art retrieval chatbot models. The results from the mDNA augmentation approach are quite impressive. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

16 pages, 1191 KiB  
Article
A C-BiLSTM Approach to Classify Construction Accident Reports
by Jinyue Zhang, Lijun Zi, Yuexian Hou, Da Deng, Wenting Jiang and Mingen Wang
Appl. Sci. 2020, 10(17), 5754; https://0-doi-org.brum.beds.ac.uk/10.3390/app10175754 - 20 Aug 2020
Cited by 24 | Viewed by 4340
Abstract
The construction sector is widely recognized as having the most hazardous working environment among the various business sectors, and many research studies have focused on injury prevention strategies for use on construction sites. The risk-based theory emphasizes the analysis of accident causes extracted [...] Read more.
The construction sector is widely recognized as having the most hazardous working environment among the various business sectors, and many research studies have focused on injury prevention strategies for use on construction sites. The risk-based theory emphasizes the analysis of accident causes extracted from accident reports to understand, predict, and prevent the occurrence of construction accidents. The first step in the analysis is to classify the incidents from a massive number of reports into different cause categories, a task which is usually performed on a manual basis by domain experts. The research described in this paper proposes a convolutional bidirectional long short-term memory (C-BiLSTM)-based method to automatically classify construction accident reports. The proposed approach was applied on a dataset of construction accident narratives obtained from the Occupational Safety and Health Administration website, and the results indicate that this model performs better than some of the classic machine learning models commonly used in classification tasks, including support vector machine (SVM), naïve Bayes (NB), and logistic regression (LR). The results of this study can help safety managers to develop risk management strategies. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

13 pages, 1055 KiB  
Article
An ERNIE-Based Joint Model for Chinese Named Entity Recognition
by Yu Wang, Yining Sun, Zuchang Ma, Lisheng Gao and Yang Xu
Appl. Sci. 2020, 10(16), 5711; https://0-doi-org.brum.beds.ac.uk/10.3390/app10165711 - 18 Aug 2020
Cited by 21 | Viewed by 3697
Abstract
Named Entity Recognition (NER) is the fundamental task for Natural Language Processing (NLP) and the initial step in building a Knowledge Graph (KG). Recently, BERT (Bidirectional Encoder Representations from Transformers), which is a pre-training model, has achieved state-of-the-art (SOTA) results in various NLP [...] Read more.
Named Entity Recognition (NER) is the fundamental task for Natural Language Processing (NLP) and the initial step in building a Knowledge Graph (KG). Recently, BERT (Bidirectional Encoder Representations from Transformers), which is a pre-training model, has achieved state-of-the-art (SOTA) results in various NLP tasks, including the NER. However, Chinese NER is still a more challenging task for BERT because there are no physical separations between Chinese words, and BERT can only obtain the representations of Chinese characters. Nevertheless, the Chinese NER cannot be well handled with character-level representations, because the meaning of a Chinese word is quite different from that of the characters, which make up the word. ERNIE (Enhanced Representation through kNowledge IntEgration), which is an improved pre-training model of BERT, is more suitable for Chinese NER because it is designed to learn language representations enhanced by the knowledge masking strategy. However, the potential of ERNIE has not been fully explored. ERNIE only utilizes the token-level features and ignores the sentence-level feature when performing the NER task. In this paper, we propose the ERNIE-Joint, which is a joint model based on ERNIE. The ERNIE-Joint can utilize both the sentence-level and token-level features by joint training the NER and text classification tasks. In order to use the raw NER datasets for joint training and avoid additional annotations, we perform the text classification task according to the number of entities in the sentences. The experiments are conducted on two datasets: MSRA-NER and Weibo. These datasets contain Chinese news data and Chinese social media data, respectively. The results demonstrate that the ERNIE-Joint not only outperforms BERT and ERNIE but also achieves the SOTA results on both datasets. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

14 pages, 2672 KiB  
Article
A Topical Category-Aware Neural Text Summarizer
by So-Eon Kim, Nazira Kaibalina and Seong-Bae Park
Appl. Sci. 2020, 10(16), 5422; https://0-doi-org.brum.beds.ac.uk/10.3390/app10165422 - 05 Aug 2020
Cited by 4 | Viewed by 2894
Abstract
The advent of the sequence-to-sequence model and the attention mechanism has increased the comprehension and readability of automatically generated summaries. However, most previous studies on text summarization have focused on generating or extracting sentences only from an original text, even though every text [...] Read more.
The advent of the sequence-to-sequence model and the attention mechanism has increased the comprehension and readability of automatically generated summaries. However, most previous studies on text summarization have focused on generating or extracting sentences only from an original text, even though every text has a latent topic category. That is, even if a topic category helps improve the summarization quality, there have been no efforts to utilize such information in text summarization. Therefore, this paper proposes a novel topical category-aware neural text summarizer which is differentiated from legacy neural summarizers in that it reflects the topic category of an original text into generating a summary. The proposed summarizer adopts the class activation map (CAM) as topical influence of the words in the original text. Since the CAM excerpts the words relevant to a specific category from the text, it allows the attention mechanism to be influenced by the topic category. As a result, the proposed neural summarizer reflects the topical information of a text as well as the content information into a summary by combining the attention mechanism and CAM. The experiments on The New York Times Annotated Corpus show that the proposed model outperforms the legacy attention-based sequence-to-sequence model, which proves that it is effective at reflecting a topic category into automatic summarization. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

14 pages, 280 KiB  
Article
Error Detection for Arabic Text Using Neural Sequence Labeling
by Nora Madi and Hend Al-Khalifa
Appl. Sci. 2020, 10(15), 5279; https://0-doi-org.brum.beds.ac.uk/10.3390/app10155279 - 30 Jul 2020
Cited by 12 | Viewed by 3557
Abstract
The English language has, thus far, received the most attention in research concerning automatic grammar error correction and detection. However, these tasks have been less investigated for other languages. In this paper, we present the first experiments using neural network models for the [...] Read more.
The English language has, thus far, received the most attention in research concerning automatic grammar error correction and detection. However, these tasks have been less investigated for other languages. In this paper, we present the first experiments using neural network models for the task of error detection for Modern Standard Arabic (MSA) text. We investigate several neural network architectures and report the evaluation results acquired by applying cross-validation on the data. All experiments involve a corpus we created and augmented. The corpus has 494 sentences and 620 sentences after augmentation. Our models achieved a maximum precision of 78.09%, recall of 83.95%, and F0.5 score of 79.62% in the error detection task using SimpleRNN. Using an LSTM, we achieved a maximum precision of 79.21%, recall of 93.8%, and F0.5 score of 79.16%. Finally, the best results were achieved using a BiLSTM with a maximum precision of 80.74%, recall of 85.73%, and F0.5 score of 81.55%. We compared the results of the three models to a baseline, which is a commercially available Arabic grammar checker (Microsoft Word 2007). LSTM, BiLSTM, and SimpleRNN all outperformed the baseline in precision and F0.5. Our work shows preliminary results, demonstrating that neural network architectures for error detection through sequence labeling can successfully be applied to Arabic text. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
18 pages, 2531 KiB  
Article
UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study
by Van-Hai Vu, Quang-Phuoc Nguyen, Joon-Choul Shin and Cheol-Young Ock
Appl. Sci. 2020, 10(11), 3904; https://0-doi-org.brum.beds.ac.uk/10.3390/app10113904 - 04 Jun 2020
Cited by 5 | Viewed by 3139
Abstract
Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available [...] Read more.
Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

10 pages, 621 KiB  
Article
Integrated Model for Morphological Analysis and Named Entity Recognition Based on Label Attention Networks in Korean
by Hongjin Kim and Harksoo Kim
Appl. Sci. 2020, 10(11), 3740; https://0-doi-org.brum.beds.ac.uk/10.3390/app10113740 - 28 May 2020
Cited by 4 | Viewed by 2438
Abstract
In well-spaced Korean sentences, morphological analysis is the first step in natural language processing, in which a Korean sentence is segmented into a sequence of morphemes and the parts of speech of the segmented morphemes are determined. Named entity recognition is a natural [...] Read more.
In well-spaced Korean sentences, morphological analysis is the first step in natural language processing, in which a Korean sentence is segmented into a sequence of morphemes and the parts of speech of the segmented morphemes are determined. Named entity recognition is a natural language processing task carried out to obtain morpheme sequences with specific meanings, such as person, location, and organization names. Although morphological analysis and named entity recognition are closely associated with each other, they have been independently studied and have exhibited the inevitable error propagation problem. Hence, we propose an integrated model based on label attention networks that simultaneously performs morphological analysis and named entity recognition. The proposed model comprises two layers of neural network models that are closely associated with each other. The lower layer performs a morphological analysis, whereas the upper layer performs a named entity recognition. In our experiments using a public gold-labeled dataset, the proposed model outperformed previous state-of-the-art models used for morphological analysis and named entity recognition. Furthermore, the results indicated that the integrated architecture could alleviate the error propagation problem. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

11 pages, 643 KiB  
Article
Knowledge-Grounded Chatbot Based on Dual Wasserstein Generative Adversarial Networks with Effective Attention Mechanisms
by Sihyung Kim, Oh-Woog Kwon and Harksoo Kim
Appl. Sci. 2020, 10(9), 3335; https://0-doi-org.brum.beds.ac.uk/10.3390/app10093335 - 11 May 2020
Cited by 17 | Viewed by 3453
Abstract
A conversation is based on internal knowledge that the participants already know or external knowledge that they have gained during the conversation. A chatbot that communicates with humans by using its internal and external knowledge is called a knowledge-grounded chatbot. Although previous studies [...] Read more.
A conversation is based on internal knowledge that the participants already know or external knowledge that they have gained during the conversation. A chatbot that communicates with humans by using its internal and external knowledge is called a knowledge-grounded chatbot. Although previous studies on knowledge-grounded chatbots have achieved reasonable performance, they may still generate unsuitable responses that are not associated with the given knowledge. To address this problem, we propose a knowledge-grounded chatbot model that effectively reflects the dialogue context and given knowledge by using well-designed attention mechanisms. The proposed model uses three kinds of attention: Query-context attention, query-knowledge attention, and context-knowledge attention. In our experiments with the Wizard-of-Wikipedia dataset, the proposed model showed better performances than the state-of-the-art model in a variety of measures. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

18 pages, 423 KiB  
Article
Predicting Reputation in the Sharing Economy with Twitter Social Data
by Antonio Prada and Carlos A. Iglesias
Appl. Sci. 2020, 10(8), 2881; https://0-doi-org.brum.beds.ac.uk/10.3390/app10082881 - 21 Apr 2020
Cited by 9 | Viewed by 3458
Abstract
In recent years, the sharing economy has become popular, with outstanding examples such as Airbnb, Uber, or BlaBlaCar, to name a few. In the sharing economy, users provide goods and services in a peer-to-peer scheme and expose themselves to material and personal risks. [...] Read more.
In recent years, the sharing economy has become popular, with outstanding examples such as Airbnb, Uber, or BlaBlaCar, to name a few. In the sharing economy, users provide goods and services in a peer-to-peer scheme and expose themselves to material and personal risks. Thus, an essential component of its success is its capability to build trust among strangers. This goal is achieved usually by creating reputation systems where users rate each other after each transaction. Nevertheless, these systems present challenges such as the lack of information about new users or the reliability of peer ratings. However, users leave their digital footprints on many social networks. These social footprints are used for inferring personal information (e.g., personality and consumer habits) and social behaviors (e.g., flu propagation). This article proposes to advance the state of the art on reputation systems by researching how digital footprints coming from social networks can be used to predict future behaviors on sharing economy platforms. In particular, we have focused on predicting the reputation of users in the second-hand market Wallapop based solely on their users’ Twitter profiles. The main contributions of this research are twofold: (a) a reputation prediction model based on social data; and (b) an anonymized dataset of paired users in the sharing economy site Wallapop and Twitter, which has been collected using the user self-mentioning strategy. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

15 pages, 581 KiB  
Article
Named Entity Recognition for Sensitive Data Discovery in Portuguese
by Mariana Dias, João Boné, João C. Ferreira, Ricardo Ribeiro and Rui Maia
Appl. Sci. 2020, 10(7), 2303; https://0-doi-org.brum.beds.ac.uk/10.3390/app10072303 - 27 Mar 2020
Cited by 24 | Viewed by 7498
Abstract
The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them [...] Read more.
The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

15 pages, 995 KiB  
Article
Comprehensive Document Summarization with Refined Self-Matching Mechanism
by Biqing Zeng, Ruyang Xu, Heng Yang, Zibang Gan and Wu Zhou
Appl. Sci. 2020, 10(5), 1864; https://0-doi-org.brum.beds.ac.uk/10.3390/app10051864 - 09 Mar 2020
Cited by 3 | Viewed by 2381
Abstract
Under the constraint of memory capacity of the neural network and the document length, it is difficult to generate summaries with adequate salient information. In this work, the self-matching mechanism is incorporated into the extractive summarization system at the encoder side, which allows [...] Read more.
Under the constraint of memory capacity of the neural network and the document length, it is difficult to generate summaries with adequate salient information. In this work, the self-matching mechanism is incorporated into the extractive summarization system at the encoder side, which allows the encoder to optimize the encoding information at the global level and effectively improves the memory capacity of conventional LSTM. Inspired by human coarse-to-fine understanding mode, localness is modeled by Gaussian bias to improve contextualization for each sentence, and merged into the self-matching energy. The refined self-matching mechanism not only establishes global document attention but perceives association with neighboring signals. At the decoder side, the pointer network is utilized to perform a two-hop attention on context and extraction state. Evaluations on the CNN/Daily Mail dataset verify that the proposed model outperforms the strong baseline models and statistical significantly. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Graphical abstract

Review

Jump to: Editorial, Research

57 pages, 3272 KiB  
Review
A Survey on Machine Reading Comprehension—Tasks, Evaluation Metrics and Benchmark Datasets
by Changchang Zeng, Shaobo Li, Qin Li, Jie Hu and Jianjun Hu
Appl. Sci. 2020, 10(21), 7640; https://0-doi-org.brum.beds.ac.uk/10.3390/app10217640 - 29 Oct 2020
Cited by 46 | Viewed by 6962
Abstract
Machine Reading Comprehension (MRC) is a challenging Natural Language Processing (NLP) research field with wide real-world applications. The great progress of this field in recent years is mainly due to the emergence of large-scale datasets and deep learning. At present, a lot of [...] Read more.
Machine Reading Comprehension (MRC) is a challenging Natural Language Processing (NLP) research field with wide real-world applications. The great progress of this field in recent years is mainly due to the emergence of large-scale datasets and deep learning. At present, a lot of MRC models have already surpassed human performance on various benchmark datasets despite the obvious giant gap between existing MRC models and genuine human-level reading comprehension. This shows the need for improving existing datasets, evaluation metrics, and models to move current MRC models toward “real” understanding. To address the current lack of comprehensive survey of existing MRC tasks, evaluation metrics, and datasets, herein, (1) we analyze 57 MRC tasks and datasets and propose a more precise classification method of MRC tasks with 4 different attributes; (2) we summarized 9 evaluation metrics of MRC tasks, 7 attributes and 10 characteristics of MRC datasets; (3) We also discuss key open issues in MRC research and highlighted future research directions. In addition, we have collected, organized, and published our data on the companion website where MRC researchers could directly access each MRC dataset, papers, baseline projects, and the leaderboard. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Graphical abstract

Back to TopTop