Research

40 pages, 1007 KiB

Open AccessEditor’s ChoiceArticle

Phase Transitions in Transfer Learning for High-Dimensional Perceptrons

by Oussama Dhifallah and Yue M. Lu

Entropy 2021, 23(4), 400; https://0-doi-org.brum.beds.ac.uk/10.3390/e23040400 - 27 Mar 2021

Cited by 5 | Viewed by 3422

Transfer learning seeks to improve the generalization performance of a target task by exploiting the knowledge learned from a related source task. Central questions include deciding what information one should transfer and when transfer can be beneficial. The latter question is related to [...] Read more.

Transfer learning seeks to improve the generalization performance of a target task by exploiting the knowledge learned from a related source task. Central questions include deciding what information one should transfer and when transfer can be beneficial. The latter question is related to the so-called negative transfer phenomenon, where the transferred source information actually reduces the generalization performance of the target task. This happens when the two tasks are sufficiently dissimilar. In this paper, we present a theoretical analysis of transfer learning by studying a pair of related perceptron learning tasks. Despite the simplicity of our model, it reproduces several key phenomena observed in practice. Specifically, our asymptotic analysis reveals a phase transition from negative transfer to positive transfer as the similarity of the two tasks moves past a well-defined threshold. Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

46 pages, 1342 KiB

Open AccessArticle

Sharp Guarantees and Optimal Performance for Inference in Binary and Gaussian-Mixture Models

by Hossein Taheri, Ramtin Pedarsani and Christos Thrampoulidis

Entropy 2021, 23(2), 178; https://0-doi-org.brum.beds.ac.uk/10.3390/e23020178 - 30 Jan 2021

Cited by 2 | Viewed by 2053

Abstract

We study convex empirical risk minimization for high-dimensional inference in binary linear classification under both discriminative binary linear models, as well as generative Gaussian-mixture models. Our first result sharply predicts the statistical performance of such estimators in the proportional asymptotic regime under isotropic [...] Read more.

We study convex empirical risk minimization for high-dimensional inference in binary linear classification under both discriminative binary linear models, as well as generative Gaussian-mixture models. Our first result sharply predicts the statistical performance of such estimators in the proportional asymptotic regime under isotropic Gaussian features. Importantly, the predictions hold for a wide class of convex loss functions, which we exploit to prove bounds on the best achievable performance. Notably, we show that the proposed bounds are tight for popular binary models (such as signed and logistic) and for the Gaussian-mixture model by constructing appropriate loss functions that achieve it. Our numerical simulations suggest that the theory is accurate even for relatively small problem dimensions and that it enjoys a certain universality property. Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

24 pages, 456 KiB

Open AccessArticle

Common Information Components Analysis

by Erixhen Sula and Michael C. Gastpar

Entropy 2021, 23(2), 151; https://0-doi-org.brum.beds.ac.uk/10.3390/e23020151 - 26 Jan 2021

Cited by 3 | Viewed by 1938

Abstract

Wyner’s common information is a measure that quantifies and assesses the commonality between two random variables. Based on this, we introduce a novel two-step procedure to construct features from data, referred to as Common Information Components Analysis (CICA). The first step can be [...] Read more.

Wyner’s common information is a measure that quantifies and assesses the commonality between two random variables. Based on this, we introduce a novel two-step procedure to construct features from data, referred to as Common Information Components Analysis (CICA). The first step can be interpreted as an extraction of Wyner’s common information. The second step is a form of back-projection of the common information onto the original variables, leading to the extracted features. A free parameter

γ

controls the complexity of the extracted features. We establish that, in the case of Gaussian statistics, CICA precisely reduces to Canonical Correlation Analysis (CCA), where the parameter

γ

determines the number of CCA components that are extracted. In this sense, we establish a novel rigorous connection between information measures and CCA, and CICA is a strict generalization of the latter. It is shown that CICA has several desirable features, including a natural extension to beyond just two data sets. Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

28 pages, 775 KiB

Open AccessArticle

Information-Theoretic Generalization Bounds for Meta-Learning and Applications

by Sharu Theresa Jose and Osvaldo Simeone

Entropy 2021, 23(1), 126; https://0-doi-org.brum.beds.ac.uk/10.3390/e23010126 - 19 Jan 2021

Cited by 20 | Viewed by 3427

Abstract

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, [...] Read more.

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning. Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

34 pages, 612 KiB

Open AccessArticle

No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors

by Jorio Cocola, Paul Hand and Vladislav Voroninski

Entropy 2021, 23(1), 115; https://0-doi-org.brum.beds.ac.uk/10.3390/e23010115 - 16 Jan 2021

Cited by 4 | Viewed by 2574

Abstract

We provide a non-asymptotic analysis of the spiked Wishart and Wigner matrix models with a generative neural network prior. Spiked random matrices have the form of a rank-one signal plus noise and have been used as models for high dimensional Principal Component Analysis [...] Read more.

We provide a non-asymptotic analysis of the spiked Wishart and Wigner matrix models with a generative neural network prior. Spiked random matrices have the form of a rank-one signal plus noise and have been used as models for high dimensional Principal Component Analysis (PCA), community detection and synchronization over groups. Depending on the prior imposed on the spike, these models can display a statistical-computational gap between the information theoretically optimal reconstruction error that can be achieved with unbounded computational resources and the sub-optimal performances of currently known polynomial time algorithms. These gaps are believed to be fundamental, as in the emblematic case of Sparse PCA. In stark contrast to such cases, we show that there is no statistical-computational gap under a generative network prior, in which the spike lies on the range of a generative neural network. Specifically, we analyze a gradient descent method for minimizing a nonlinear least squares objective over the range of an expansive-Gaussian neural network and show that it can recover in polynomial time an estimate of the underlying spike with a rate-optimal sample complexity and dependence on the noise level. Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

18 pages, 2691 KiB

Open AccessEditor’s ChoiceArticle

Deep Task-Based Quantization

by Nir Shlezinger and Yonina C. Eldar

Entropy 2021, 23(1), 104; https://0-doi-org.brum.beds.ac.uk/10.3390/e23010104 - 13 Jan 2021

Cited by 28 | Viewed by 4757

Abstract

Quantizers play a critical role in digital signal processing systems. Recent works have shown that the performance of acquiring multiple analog signals using scalar analog-to-digital converters (ADCs) can be significantly improved by processing the signals prior to quantization. However, the design of such [...] Read more.

Quantizers play a critical role in digital signal processing systems. Recent works have shown that the performance of acquiring multiple analog signals using scalar analog-to-digital converters (ADCs) can be significantly improved by processing the signals prior to quantization. However, the design of such hybrid quantizers is quite complex, and their implementation requires complete knowledge of the statistical model of the analog signal. In this work we design data-driven task-oriented quantization systems with scalar ADCs, which determine their analog-to-digital mapping using deep learning tools. These mappings are designed to facilitate the task of recovering underlying information from the quantized signals. By using deep learning, we circumvent the need to explicitly recover the system model and to find the proper quantization rule for it. Our main target application is multiple-input multiple-output (MIMO) communication receivers, which simultaneously acquire a set of analog signals, and are commonly subject to constraints on the number of bits. Our results indicate that, in a MIMO channel estimation setup, the proposed deep task-bask quantizer is capable of approaching the optimal performance limits dictated by indirect rate-distortion theory, achievable using vector quantizers and requiring complete knowledge of the underlying statistical model. Furthermore, for a symbol detection scenario, it is demonstrated that the proposed approach can realize reliable bit-efficient hybrid MIMO receivers capable of setting their quantization rule in light of the task. Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

13 pages, 792 KiB

Open AccessArticle

Deep Ensemble of Weighted Viterbi Decoders for Tail-Biting Convolutional Codes

by Tomer Raviv, Asaf Schwartz and Yair Be’ery

Entropy 2021, 23(1), 93; https://0-doi-org.brum.beds.ac.uk/10.3390/e23010093 - 10 Jan 2021

Cited by 6 | Viewed by 2581

Abstract

Tail-biting convolutional codes extend the classical zero-termination convolutional codes: Both encoding schemes force the equality of start and end states, but under the tail-biting each state is a valid termination. This paper proposes a machine learning approach to improve the state-of-the-art decoding of [...] Read more.

Tail-biting convolutional codes extend the classical zero-termination convolutional codes: Both encoding schemes force the equality of start and end states, but under the tail-biting each state is a valid termination. This paper proposes a machine learning approach to improve the state-of-the-art decoding of tail-biting codes, focusing on the widely employed short length regime as in the LTE standard. This standard also includes a CRC code. First, we parameterize the circular Viterbi algorithm, a baseline decoder that exploits the circular nature of the underlying trellis. An ensemble combines multiple such weighted decoders, and each decoder specializes in decoding words from a specific region of the channel words’ distribution. A region corresponds to a subset of termination states; the ensemble covers the entire states space. A non-learnable gating satisfies two goals: it filters easily decoded words and mitigates the overhead of executing multiple weighted decoders. The CRC criterion is employed to choose only a subset of experts for decoding purpose. Our method achieves FER improvement of up to 0.75 dB over the CVA in the waterfall region for multiple code lengths, adding negligible computational complexity compared to the circular Viterbi algorithm in high signal-to-noise ratios (SNRs). Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

14 pages, 1523 KiB

Open AccessEditor’s ChoiceArticle

Examining the Causal Structures of Deep Neural Networks Using Information Theory

by Scythia Marrow, Eric J. Michaud and Erik Hoel

Entropy 2020, 22(12), 1429; https://0-doi-org.brum.beds.ac.uk/10.3390/e22121429 - 18 Dec 2020

Cited by 3 | Viewed by 4620

Abstract

Deep Neural Networks (DNNs) are often examined at the level of their response to input, such as analyzing the mutual information between nodes and data sets. Yet DNNs can also be examined at the level of causation, exploring “what does what” within the [...] Read more.

Deep Neural Networks (DNNs) are often examined at the level of their response to input, such as analyzing the mutual information between nodes and data sets. Yet DNNs can also be examined at the level of causation, exploring “what does what” within the layers of the network itself. Historically, analyzing the causal structure of DNNs has received less attention than understanding their responses to input. Yet definitionally, generalizability must be a function of a DNN’s causal structure as it reflects how the DNN responds to unseen or even not-yet-defined future inputs. Here, we introduce a suite of metrics based on information theory to quantify and track changes in the causal structure of DNNs during training. Specifically, we introduce the effective information (EI) of a feedforward DNN, which is the mutual information between layer input and output following a maximum-entropy perturbation. The EI can be used to assess the degree of causal influence nodes and edges have over their downstream targets in each layer. We show that the EI can be further decomposed in order to examine the sensitivity of a layer (measured by how well edges transmit perturbations) and the degeneracy of a layer (measured by how edge overlap interferes with transmission), along with estimates of the amount of integrated information of a layer. Together, these properties define where each layer lies in the “causal plane”, which can be used to visualize how layer connectivity becomes more sensitive or degenerate over time, and how integration changes during training, revealing how the layer-by-layer causal structure differentiates. These results may help in understanding the generalization capabilities of DNNs and provide foundational tools for making DNNs both more generalizable and more explainable. Full article

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

► Show Figures

Figure 1

Journal Menu

Journal Browser

The Role of Signal Processing and Information Theory in Modern Machine Learning

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (8 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI