entropy-logo

Journal Browser

Journal Browser

Information Complexity in Structured Data

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Complexity".

Deadline for manuscript submissions: closed (15 February 2022) | Viewed by 10359

Special Issue Editor


E-Mail Website
Guest Editor
Department of Statistics, University of California, Davis, CA 95616, USA
Interests: data mechanics and integrative pattern inferences on complex systems

Special Issue Information

Structured data, including high-dimensional time series, represented in a form of data matrix, are the most ubiquitous data format in data analysis. Nevertheless, fundamental questions in analyzing a data matrix remain widely open and yet to be resolved. For instance, Euclidean distance is too simplistic for similarity among subjects, while correlation only measures the linear relationship between two continuous features. It is not suitable for categorical features, which is considered the most fundamental data type. On the other hand, it is now well known that directional association measures based on conditional Shannon entropy are non-linear. Since they are good for all data types, such directional associations also allow us to replace linearity- or functionality-based modeling for inferential purposes, and at the same time allow us to accommodate multiple response variables. It is expected that computational approaches based on Shannon entropy could and would resolve the ultimate tasks in data analysis: finding out a data matrix’s full information content. Such information content surely contains dependency structures constituted by all involving features in a collective fashion. Further, patterns of such structural dependency could be identified and collected into a collection of system states. When their temporal coordinates are recovered, the data’s complexity can be evaluated through various complexity measures, such as Lampel–Ziv complexity. From this perspective, data analysis is free from any stationarity requirement.

Data analysis on a data matrix is far from being settled. There are diverse fundamental and interesting issues to be recognized and resolved from many real-world applications in sciences and in industry. By not ignoring the categorical nature of all data types, combinatorial information theory would become critically relevant. Specifically, computational approaches based on Shannon entropy would play critical roles in every aspect of data analysis. We emphasize on discovering visible and explainable pattern-based information content embraced by a data matrix. Contributions aiming for such a goal of data analysis would be very much welcome.

This Special Issue intends to be a forum for developing and applying computational techniques of combinatorial information theory for a better understanding of real-world complex systems. All pattern-discovering approaches based on Shannon information theory are considered to be within the scope of this Special Issue.

Prof. Dr. Fushing Hsieh
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

25 pages, 1739 KiB  
Article
Learned Practical Guidelines for Evaluating Conditional Entropy and Mutual Information in Discovering Major Factors of Response-vs.-Covariate Dynamics
by Ting-Li Chen, Hsieh Fushing and Elizabeth P. Chou
Entropy 2022, 24(10), 1382; https://0-doi-org.brum.beds.ac.uk/10.3390/e24101382 - 28 Sep 2022
Cited by 4 | Viewed by 1176
Abstract
We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics’ data analysis tasks by discovering major factors underlying such Re-Co dynamics [...] Read more.
We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics’ data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data’s categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon’s conditional entropy (CE) and mutual information (I[Re;Co]) as the two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and I[Re;Co] in accordance with the criterion called [C1:confirmable]. Following the [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening the effects of the curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed. Full article
(This article belongs to the Special Issue Information Complexity in Structured Data)
Show Figures

Figure 1

30 pages, 1722 KiB  
Article
Unraveling Hidden Major Factors by Breaking Heterogeneity into Homogeneous Parts within Many-System Problems
by Elizabeth P. Chou, Ting-Li Chen and Hsieh Fushing
Entropy 2022, 24(2), 170; https://0-doi-org.brum.beds.ac.uk/10.3390/e24020170 - 24 Jan 2022
Cited by 4 | Viewed by 1779
Abstract
For a large ensemble of complex systems, a Many-System Problem (MSP) studies how heterogeneity constrains and hides structural mechanisms, and how to uncover and reveal hidden major factors from homogeneous parts. All member systems in an MSP share common governing principles of dynamics, [...] Read more.
For a large ensemble of complex systems, a Many-System Problem (MSP) studies how heterogeneity constrains and hides structural mechanisms, and how to uncover and reveal hidden major factors from homogeneous parts. All member systems in an MSP share common governing principles of dynamics, but differ in idiosyncratic characteristics. A typical dynamic is found underlying response features with respect to covariate features of quantitative or qualitative data types. Neither all-system-as-one-whole nor individual system-specific functional structures are assumed in such response-vs-covariate (Re–Co) dynamics. We developed a computational protocol for identifying various collections of major factors of various orders underlying Re–Co dynamics. We first demonstrate the immanent effects of heterogeneity among member systems, which constrain compositions of major factors and even hide essential ones. Secondly, we show that fuller collections of major factors are discovered by breaking heterogeneity into many homogeneous parts. This process further realizes Anderson’s “More is Different” phenomenon. We employ the categorical nature of all features and develop a Categorical Exploratory Data Analysis (CEDA)-based major factor selection protocol. Information theoretical measurements—conditional mutual information and entropy—are heavily used in two selection criteria: C1—confirmable and C2—irreplaceable. All conditional entropies are evaluated through contingency tables with algorithmically computed reliability against the finite sample phenomenon. We study one artificially designed MSP and then two real collectives of Major League Baseball (MLB) pitching dynamics with 62 slider pitchers and 199 fastball pitchers, respectively. Finally, our MSP data analyzing techniques are applied to resolve a scientific issue related to the Rosenberg Self-Esteem Scale. Full article
(This article belongs to the Special Issue Information Complexity in Structured Data)
Show Figures

Figure 1

24 pages, 1192 KiB  
Article
Categorical Nature of Major Factor Selection via Information Theoretic Measurements
by Ting-Li Chen, Elizabeth P. Chou and Hsieh Fushing
Entropy 2021, 23(12), 1684; https://0-doi-org.brum.beds.ac.uk/10.3390/e23121684 - 15 Dec 2021
Cited by 7 | Viewed by 1964
Abstract
Without assuming any functional or distributional structure, we select collections of major factors embedded within response-versus-covariate (Re-Co) dynamics via selection criteria [C1: confirmable] and [C2: irrepaceable], which are based on information theoretic measurements. The two criteria are constructed based on the computing paradigm [...] Read more.
Without assuming any functional or distributional structure, we select collections of major factors embedded within response-versus-covariate (Re-Co) dynamics via selection criteria [C1: confirmable] and [C2: irrepaceable], which are based on information theoretic measurements. The two criteria are constructed based on the computing paradigm called Categorical Exploratory Data Analysis (CEDA) and linked to Wiener–Granger causality. All the information theoretical measurements, including conditional mutual information and entropy, are evaluated through the contingency table platform, which primarily rests on the categorical nature within all involved features of any data types: quantitative or qualitative. Our selection task identifies one chief collection, together with several secondary collections of major factors of various orders underlying the targeted Re-Co dynamics. Each selected collection is checked with algorithmically computed reliability against the finite sample phenomenon, and so is each member’s major factor individually. The developments of our selection protocol are illustrated in detail through two experimental examples: a simple one and a complex one. We then apply this protocol on two data sets pertaining to two somewhat related but distinct pitching dynamics of two pitch types: slider and fastball. In particular, we refer to a specific Major League Baseball (MLB) pitcher and we consider data of multiple seasons. Full article
(This article belongs to the Special Issue Information Complexity in Structured Data)
Show Figures

Figure 1

34 pages, 11427 KiB  
Article
Categorical Exploratory Data Analysis: From Multiclass Classification and Response Manifold Analytics Perspectives of Baseball Pitching Dynamics
by Fushing Hsieh and Elizabeth P. Chou
Entropy 2021, 23(7), 792; https://0-doi-org.brum.beds.ac.uk/10.3390/e23070792 - 22 Jun 2021
Cited by 7 | Viewed by 2686
Abstract
All features of any data type are universally equipped with categorical nature revealed through histograms. A contingency table framed by two histograms affords directional and mutual associations based on rescaled conditional Shannon entropies for any feature-pair. The heatmap of the mutual association matrix [...] Read more.
All features of any data type are universally equipped with categorical nature revealed through histograms. A contingency table framed by two histograms affords directional and mutual associations based on rescaled conditional Shannon entropies for any feature-pair. The heatmap of the mutual association matrix of all features becomes a roadmap showing which features are highly associative with which features. We develop our data analysis paradigm called categorical exploratory data analysis (CEDA) with this heatmap as a foundation. CEDA is demonstrated to provide new resolutions for two topics: multiclass classification (MCC) with one single categorical response variable and response manifold analytics (RMA) with multiple response variables. We compute visible and explainable information contents with multiscale and heterogeneous deterministic and stochastic structures in both topics. MCC involves all feature-group specific mixing geometries of labeled high-dimensional point-clouds. Upon each identified feature-group, we devise an indirect distance measure, a robust label embedding tree (LET), and a series of tree-based binary competitions to discover and present asymmetric mixing geometries. Then, a chain of complementary feature-groups offers a collection of mixing geometric pattern-categories with multiple perspective views. RMA studies a system’s regulating principles via multiple dimensional manifolds jointly constituted by targeted multiple response features and selected major covariate features. This manifold is marked with categorical localities reflecting major effects. Diverse minor effects are checked and identified across all localities for heterogeneity. Both MCC and RMA information contents are computed for data’s information content with predictive inferences as by-products. We illustrate CEDA developments via Iris data and demonstrate its applications on data taken from the PITCHf/x database. Full article
(This article belongs to the Special Issue Information Complexity in Structured Data)
Show Figures

Figure 1

28 pages, 7583 KiB  
Article
Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis
by Fushing Hsieh, Elizabeth P. Chou and Ting-Li Chen
Entropy 2021, 23(5), 594; https://0-doi-org.brum.beds.ac.uk/10.3390/e23050594 - 11 May 2021
Cited by 5 | Viewed by 1942
Abstract
We develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural dependency with heterogeneity. [...] Read more.
We develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural dependency with heterogeneity. CEDA is developed upon all features’ categorical nature via histogram and it is guided by all features’ associative patterns (order-2 dependence) in a mutual conditional entropy matrix. Higher-order structural dependency of k(3) features is exhibited through block patterns within heatmaps that are constructed by permuting contingency-kD-lattices of counts. By growing k, the resultant heatmap series contains global and large scales of structural dependency that constitute the data matrix’s information content. When involving continuous features, the principal component analysis (PCA) extracts fine-scale information content from each block in the final heatmap. Our mimicking protocol coherently simulates this heatmap series by preserving global-to-fine scales structural dependency. Upon every step of mimicking process, each accepted simulated heatmap is subject to constraints with respect to all of the reliable observed categorical patterns. For reliability and robustness in sciences, CEDA with mimicking enhances data visualization by revealing deterministic and stochastic structures within each scale-specific structural dependency. For inferences in Machine Learning (ML) and Statistics, it clarifies, upon which scales, which covariate feature-groups have major-vs.-minor predictive powers on response features. For the social justice of Artificial Intelligence (AI) products, it checks whether a data matrix incompletely prescribes the targeted system. Full article
(This article belongs to the Special Issue Information Complexity in Structured Data)
Show Figures

Figure 1

Back to TopTop