entropy-logo

Journal Browser

Journal Browser

Information Theory and Language

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Multidisciplinary Applications".

Deadline for manuscript submissions: closed (31 October 2019) | Viewed by 56264

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail
Guest Editor
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland
Interests: information theory; natural language; ergodic theorems; hidden Markov process; long-range dependence; universal coding; Renyi entropies; recurrence times; subword complexity; repetitions; Kolmogorov complexity; algorithmic randomness

E-Mail
Guest Editor
1. URPP Language and Space, University of Zürich, Freiestrasse 16, CH-8032 Zürich, Switzerland
2. DFG Center for Advanced Studies, University of Tübingen, Rümelinstraße 23, D-72070 Tübingen, Germany
Interests: information theory; linguistic typology; language evolution

Special Issue Information

Dear Colleagues,

The historical roots of information theory lie in statistical investigations of communication in natural language during the 1950s. In the decades that followed, however, linguistics and information theory developed largely independently, due to influential non-probabilistic theories of language. Recently, statistical investigations into natural language(s) have gained momentum again, driven by progress in computational linguistics, machine learning, and cognitive science. These developments are reopening the communication channel between information theorists and language researchers. Both information theory and linguistics have made important discoveries since the 1950s. While the two frameworks have sometimes been framed as irreconcilable, we believe that they are fully compatible. We expect fruitful cross-fertilization between the two fields in the near future.

In this Special Issue, we invite researchers working at the interface of information theory and natural language to present their original and recent developments. Possible topics include but are not limited to the following:

  • Applications of information-theoretic concepts to the research of natural language(s);
  • Mathematical work in information theory inspired by natural language phenomena;
  • Empirical and theoretical investigation of quantitative laws of natural language;
  • Empirical and theoretical evaluation of statistical language models.

Dr. Łukasz Dębowski
Dr. Christian Bentz
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language
  • entropy
  • mutual information
  • entropy rate
  • excess entropy
  • maximal repetition
  • power laws
  • minimum description length
  • statistical language models
  • neural networks
  • quantitative linguistics
  • linguistic typology

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

6 pages, 200 KiB  
Editorial
Information Theory and Language
by Łukasz Dębowski and Christian Bentz
Entropy 2020, 22(4), 435; https://0-doi-org.brum.beds.ac.uk/10.3390/e22040435 - 11 Apr 2020
Cited by 1 | Viewed by 3585
Abstract
Human language is a system of communication [...] Full article
(This article belongs to the Special Issue Information Theory and Language)

Research

Jump to: Editorial

14 pages, 585 KiB  
Communication
The Brevity Law as a Scaling Law, and a Possible Origin of Zipf’s Law for Word Frequencies
by Álvaro Corral and Isabel Serra
Entropy 2020, 22(2), 224; https://0-doi-org.brum.beds.ac.uk/10.3390/e22020224 - 17 Feb 2020
Cited by 17 | Viewed by 4331
Abstract
An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no unified framework that encompasses all them. This [...] Read more.
An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no unified framework that encompasses all them. This paper presents a new perspective to establish a connection between different statistical linguistic laws. Characterizing each word type by two random variables—length (in number of characters) and absolute frequency—we show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the type-frequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon. The type-length distribution turns out to be well fitted by a gamma distribution (much better than with the previously proposed lognormal), and the conditional frequency distributions at fixed length display power-law-decay behavior with a fixed exponent α 1.4 and a characteristic-frequency crossover that scales as an inverse power δ 2.8 of length, which implies the fulfillment of a scaling law analogous to those found in the thermodynamics of critical phenomena. As a by-product, we find a possible model-free explanation for the origin of Zipf’s law, which should arise as a mixture of conditional frequency distributions governed by the crossover length-dependent frequency. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

34 pages, 1032 KiB  
Article
Asymptotic Analysis of the kth Subword Complexity
by Lida Ahmadi and Mark Daniel Ward
Entropy 2020, 22(2), 207; https://0-doi-org.brum.beds.ac.uk/10.3390/e22020207 - 12 Feb 2020
Cited by 1 | Viewed by 2381
Abstract
Patterns within strings enable us to extract vital information regarding a string’s randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth [...] Read more.
Patterns within strings enable us to extract vital information regarding a string’s randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth Subword Complexity of the character string. By definition, the kth Subword Complexity is the number of distinct substrings of length k that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the kth Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns’ auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of k = Θ ( log n ) . In the proof, we compare the distribution of the kth Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

16 pages, 519 KiB  
Article
Criticality in Pareto Optimal Grammars?
by Luís F Seoane and Ricard Solé
Entropy 2020, 22(2), 165; https://0-doi-org.brum.beds.ac.uk/10.3390/e22020165 - 31 Jan 2020
Cited by 3 | Viewed by 2729
Abstract
What are relevant levels of description when investigating human language? How are these levels connected to each other? Does one description yield smoothly into the next one such that different models lie naturally along a hierarchy containing each other? Or, instead, are there [...] Read more.
What are relevant levels of description when investigating human language? How are these levels connected to each other? Does one description yield smoothly into the next one such that different models lie naturally along a hierarchy containing each other? Or, instead, are there sharp transitions between one description and the next, such that to gain a little bit accuracy it is necessary to change our framework radically? Do different levels describe the same linguistic aspects with increasing (or decreasing) accuracy? Historically, answers to these questions were guided by intuition and resulted in subfields of study, from phonetics to syntax and semantics. Need for research at each level is acknowledged, but seldom are these different aspects brought together (with notable exceptions). Here, we propose a methodology to inspect empirical corpora systematically, and to extract from them, blindly, relevant phenomenological scales and interactions between them. Our methodology is rigorously grounded in information theory, multi-objective optimization, and statistical physics. Salient levels of linguistic description are readily interpretable in terms of energies, entropies, phase transitions, or criticality. Our results suggest a critical point in the description of human language, indicating that several complementary models are simultaneously necessary (and unavoidable) to describe it. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

14 pages, 636 KiB  
Article
A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
by Martin Gerlach and Francesc Font-Clos
Entropy 2020, 22(1), 126; https://0-doi-org.brum.beds.ac.uk/10.3390/e22010126 - 20 Jan 2020
Cited by 33 | Viewed by 7893
Abstract
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to [...] Read more.
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

23 pages, 709 KiB  
Article
How the Probabilistic Structure of Grammatical Context Shapes Speech
by Maja Linke and Michael Ramscar
Entropy 2020, 22(1), 90; https://0-doi-org.brum.beds.ac.uk/10.3390/e22010090 - 11 Jan 2020
Cited by 10 | Viewed by 3270
Abstract
Does systematic covariation in the usage patterns of forms shape the sublexical variance observed in conversational speech? We address this question in terms of a recently proposed discriminative theory of human communication that argues that the distribution of events in communicative contexts should [...] Read more.
Does systematic covariation in the usage patterns of forms shape the sublexical variance observed in conversational speech? We address this question in terms of a recently proposed discriminative theory of human communication that argues that the distribution of events in communicative contexts should maintain mutual predictability between language users, present evidence that the distributions of words in the empirical contexts in which they are learned and used are geometric, and thus support this. Here, we extend this analysis to a corpus of conversational English, showing that the distribution of grammatical regularities and the sub-distributions of tokens discriminated by them are also geometric. Further analyses reveal a range of structural differences in the distribution of types in parts of speech categories that further support the suggestion that linguistic distributions (and codes) are subcategorized by context at multiple levels of abstraction. Finally, a series of analyses of the variation in spoken language reveals that quantifiable differences in the structure of lexical subcategories appears in turn to systematically shape sublexical variation in speech signal. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

10 pages, 297 KiB  
Article
Approximating Information Measures for Fields
by Łukasz Dębowski
Entropy 2020, 22(1), 79; https://0-doi-org.brum.beds.ac.uk/10.3390/e22010079 - 09 Jan 2020
Cited by 4 | Viewed by 2291
Abstract
We supply corrected proofs of the invariance of completion and the chain rule for the Shannon information measures of arbitrary fields, as stated by Dębowski in 2009. Our corrected proofs rest on a number of auxiliary approximation results for Shannon information measures, which [...] Read more.
We supply corrected proofs of the invariance of completion and the chain rule for the Shannon information measures of arbitrary fields, as stated by Dębowski in 2009. Our corrected proofs rest on a number of auxiliary approximation results for Shannon information measures, which may be of an independent interest. As also discussed briefly in this article, the generalized calculus of Shannon information measures for fields, including the invariance of completion and the chain rule, is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language. Full article
(This article belongs to the Special Issue Information Theory and Language)
19 pages, 1056 KiB  
Article
Productivity and Predictability for Measuring Morphological Complexity
by Ximena Gutierrez-Vasques and Victor Mijangos
Entropy 2020, 22(1), 48; https://0-doi-org.brum.beds.ac.uk/10.3390/e22010048 - 30 Dec 2019
Cited by 7 | Viewed by 3415
Abstract
We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but [...] Read more.
We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but also the predictability of those morphological processes. We use a language model that predicts the probability of sub-word sequences within a word; we calculate the entropy rate of this model and use it as a measure of predictability of the internal structure of words. Our results show that it is important to integrate these two dimensions when measuring morphological complexity, since languages can be complex under one measure but simpler under another one. We calculated the complexity measures in two different parallel corpora for a typologically diverse set of languages. Our approach is corpus-based and it does not require the use of linguistic annotated data. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

15 pages, 2158 KiB  
Article
Entropy Rate Estimation for English via a Large Cognitive Experiment Using Mechanical Turk
by Geng Ren, Shuntaro Takahashi and Kumiko Tanaka-Ishii
Entropy 2019, 21(12), 1201; https://0-doi-org.brum.beds.ac.uk/10.3390/e21121201 - 06 Dec 2019
Cited by 3 | Viewed by 3195
Abstract
The entropy rate h of a natural language quantifies the complexity underlying the language. While recent studies have used computational approaches to estimate this rate, their results rely fundamentally on the performance of the language model used for prediction. On the other hand, [...] Read more.
The entropy rate h of a natural language quantifies the complexity underlying the language. While recent studies have used computational approaches to estimate this rate, their results rely fundamentally on the performance of the language model used for prediction. On the other hand, in 1951, Shannon conducted a cognitive experiment to estimate the rate without the use of any such artifact. Shannon’s experiment, however, used only one subject, bringing into question the statistical validity of his value of h = 1.3 bits per character for the English language entropy rate. In this study, we conducted Shannon’s experiment on a much larger scale to reevaluate the entropy rate h via Amazon’s Mechanical Turk, a crowd-sourcing service. The online subjects recruited through Mechanical Turk were each asked to guess the succeeding character after being given the preceding characters until obtaining the correct answer. We collected 172,954 character predictions and analyzed these predictions with a bootstrap technique. The analysis suggests that a large number of character predictions per context length, perhaps as many as 10 3 , would be necessary to obtain a convergent estimate of the entropy rate, and if fewer predictions are used, the resulting h value may be underestimated. Our final entropy estimate was h 1.22 bits per character. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

21 pages, 1323 KiB  
Article
Semantic Entropy in Language Comprehension
by Noortje J. Venhuizen, Matthew W. Crocker and Harm Brouwer
Entropy 2019, 21(12), 1159; https://0-doi-org.brum.beds.ac.uk/10.3390/e21121159 - 27 Nov 2019
Cited by 16 | Viewed by 6370
Abstract
Language is processed on a more or less word-by-word basis, and the processing difficulty induced by each word is affected by our prior linguistic experience as well as our general knowledge about the world. Surprisal and entropy reduction have been independently proposed as [...] Read more.
Language is processed on a more or less word-by-word basis, and the processing difficulty induced by each word is affected by our prior linguistic experience as well as our general knowledge about the world. Surprisal and entropy reduction have been independently proposed as linking theories between word processing difficulty and probabilistic language models. Extant models, however, are typically limited to capturing linguistic experience and hence cannot account for the influence of world knowledge. A recent comprehension model by Venhuizen, Crocker, and Brouwer (2019, Discourse Processes) improves upon this situation by instantiating a comprehension-centric metric of surprisal that integrates linguistic experience and world knowledge at the level of interpretation and combines them in determining online expectations. Here, we extend this work by deriving a comprehension-centric metric of entropy reduction from this model. In contrast to previous work, which has found that surprisal and entropy reduction are not easily dissociated, we do find a clear dissociation in our model. While both surprisal and entropy reduction derive from the same cognitive process—the word-by-word updating of the unfolding interpretation—they reflect different aspects of this process: state-by-state expectation (surprisal) versus end-state confirmation (entropy reduction). Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

16 pages, 1064 KiB  
Article
Linguistic Laws in Speech: The Case of Catalan and Spanish
by Antoni Hernández-Fernández, Iván G. Torre, Juan-María Garrido and Lucas Lacasa
Entropy 2019, 21(12), 1153; https://0-doi-org.brum.beds.ac.uk/10.3390/e21121153 - 26 Nov 2019
Cited by 14 | Viewed by 5824
Abstract
In this work we consider Glissando Corpus—an oral corpus of Catalan and Spanish—and empirically analyze the presence of the four classical linguistic laws (Zipf’s law, Herdan’s law, Brevity law, and Menzerath–Altmann’s law) in oral communication, and further complement this with the analysis of [...] Read more.
In this work we consider Glissando Corpus—an oral corpus of Catalan and Spanish—and empirically analyze the presence of the four classical linguistic laws (Zipf’s law, Herdan’s law, Brevity law, and Menzerath–Altmann’s law) in oral communication, and further complement this with the analysis of two recently formulated laws: lognormality law and size-rank law. By aligning the acoustic signal of speech production with the speech transcriptions, we are able to measure and compare the agreement of each of these laws when measured in both physical and symbolic units. Our results show that these six laws are recovered in both languages but considerably more emphatically so when these are examined in physical units, hence reinforcing the so-called ‘physical hypothesis’ according to which linguistic laws might indeed have a physical origin and the patterns recovered in written texts would, therefore, be just a byproduct of the regularities already present in the acoustic signals of oral communication. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

27 pages, 1391 KiB  
Article
Estimating Predictive Rate–Distortion Curves via Neural Variational Inference
by Michael Hahn and Richard Futrell
Entropy 2019, 21(7), 640; https://0-doi-org.brum.beds.ac.uk/10.3390/e21070640 - 28 Jun 2019
Cited by 7 | Viewed by 4265
Abstract
The Predictive Rate–Distortion curve quantifies the trade-off between compressing information about the past of a stochastic process and predicting its future accurately. Existing estimation methods for this curve work by clustering finite sequences of observations or by utilizing analytically known causal states. Neither [...] Read more.
The Predictive Rate–Distortion curve quantifies the trade-off between compressing information about the past of a stochastic process and predicting its future accurately. Existing estimation methods for this curve work by clustering finite sequences of observations or by utilizing analytically known causal states. Neither type of approach scales to processes such as natural languages, which have large alphabets and long dependencies, and where the causal states are not known analytically. We describe Neural Predictive Rate–Distortion (NPRD), an estimation method that scales to such processes, leveraging the universal approximation capabilities of neural networks. Taking only time series data as input, the method computes a variational bound on the Predictive Rate–Distortion curve. We validate the method on processes where Predictive Rate–Distortion is analytically known. As an application, we provide bounds on the Predictive Rate–Distortion of natural language, improving on bounds provided by clustering sequences. Based on the results, we argue that the Predictive Rate–Distortion curve is more useful than the usual notion of statistical complexity for characterizing highly complex processes such as natural language. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

18 pages, 2123 KiB  
Article
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
by Alexander Koplenig, Sascha Wolfer and Carolin Müller-Spitzer
Entropy 2019, 21(5), 464; https://0-doi-org.brum.beds.ac.uk/10.3390/e21050464 - 03 May 2019
Cited by 8 | Viewed by 4482
Abstract
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales [...] Read more.
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Graphical abstract

Back to TopTop