Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning

Zhang, Xin; Fan, Jing; Hei, Mengzhe

doi:10.3390/app122312055

Open AccessArticle

Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning

by

Xin Zhang

^1,†

,

Jing Fan

^1,2,† and

Mengzhe Hei

^1,*

¹

Science and Technology on Information Systems Engineering Laboratory, College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

²

Institute of Artificial Intelligence, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2022, 12(23), 12055; https://0-doi-org.brum.beds.ac.uk/10.3390/app122312055

Submission received: 10 October 2022 / Revised: 14 November 2022 / Accepted: 23 November 2022 / Published: 25 November 2022

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The proposed approach can be used to reduce the memory footprint, training and inference time of BERT based binary text classifiers which could serve as spam or anomaly detector, sentiment discriminator, and so on.

Abstract

Large-scale pre-trained language models such as BERT have brought much better performance to text classification. However, their large sizes can lead to sometimes prohibitively slow fine-tuning and inference. To alleviate this, various compression methods have been proposed; however, most of these methods solely consider reducing inference time, often ignoring significant increases in training time, and thus are even more resource consuming. In this article, we focus on lottery ticket extraction for the BERT architecture. Inspired by observations that representations at lower layers are often more useful for text classification, we propose that we can identify the winning ticket of BERT for binary text classification through adaptive truncation, i.e., a process that drops the top-k layers of the pre-trained model based on simple, fast computations. In this way, the cost for compressing and fine-tuning, as well as inference, can be vastly reduced. We present experiments on eight mainstream binary text classification datasets covering different input styles (i.e., single-text and text-pair), as well as different typical tasks (e.g., sentiment analysis, acceptability judgement, textual entailment, semantic similarity analysis and natural language inference). Compared with some strong baselines, our method saved 78.1% time and 31.7% memory on average, and up to 86.7 and 48% in extreme cases, respectively. We also saw good performance, often outperforming the original language model.

Keywords:

model compression; binary text classification; lottery ticket hypothesis; layer truncation; fine-tuning

1. Introduction

Binary text classification, abbreviated to BTC throughout this article, is an important and classical research problem that can support a long list of practical applications, such as sentiment analysis [1], paraphrase identification [2] and spam detection [3]. Recently, the state of the art in BTC has significantly advanced via the use of large-scale, pre-trained language models such as BERT (bidirectional encoder representations from transformers) [4]. However, such models commonly have millions, if not billions, of parameters, and therefore require enormous resources for training, fine-tuning and inference. Even for relatively modest-sized languages models such as BERT_base [4], fine-tuning sometimes takes several hours per epoch [5]. This prohibits them from being applied in resource-limited, on-device or streaming data scenarios [6,7].

Model compression is one approach to mitigating this issue. Various methods for compressing large-scale language models have been proposed in the last two years [8,9,10,11,12,13]. From the perspective of downstream tasks, current model compression methods could be classified as task-agnostic and task-specific compression. The former aims at compressing pre-trained language models for any downstream NLP task, whereas the latter shrinks models that are fine-tuned for a particular task. As discussed in [8,14,15], task-specific compression has two important advantages over task-agnostic compression: First, it is often faster and more sample-efficient. In contrast, task-agnostic compression requires (re-)training on very large (unlabeled) datasets, and hence is resource-hungry. Second, models obtained by task-specific compression can adopt specialized attention patterns that are useful for downstream tasks, but not necessarily useful for general language modeling.

Contemporary task-specific compression methods fall into four categories: quantization, pruning, distillation and matrix decomposition [8,16,17,18]. Among them, distillation and pruning are perhaps the most used. The general procedure of these two kinds of methods is illustrated in Figure 1 (upper box). (Note that pruning methods commonly require fine-tuning again after model compression, which has been omitted in Figure 1 due to this process being very simple and rarely mentioned in the literature.) As shown, knowledge distillation implements compression via transferring knowledge from a large teacher to a much smaller student, whereas the pruning methods finish compression based on one model. There exist two common weaknesses of these methods: First, they reduce the calculations needed in the inference stage at the cost of a much longer training time (fine-tuning and compressing). In fact, the resources consumed by fine-tuning are often prohibitive of the application of these methods [6,7]. Second, few of them consider the complexity of downstream tasks, as well as how the language models and target tasks correlate, when determining the architecture of the student model or the units in the language model to be pruned.

To address these weaknesses, we present a new task-specific model compression method and apply it to the BERT language model [4], which is also illustrated in Figure 1 (lower box) for ease of comparison. As shown, the proposed method relies on adaptively truncating BERT for a given BTC task before fine-tuning. Specifically, by dropping the top-k Transformer layers in BERT, we observe that the full BERT is not always the best choice when fine-tuning on downstream BTC tasks. The classification accuracy achieved by some truncated ones can get very close to or even higher than that of the full BERT. This indicates that compressing BERT via appropriate truncation could cause no significant loss and even bring notable gain to the prediction accuracy of downstream BTC tasks. To properly and efficiently determine the truncation position, we conduct some theoretic analysis of the fine-tuning process, together with a series of tentative experiments, which motivate us to choose a class separability index as a proxy. The chosen measure can be rapidly calculated before fine-tuning and is found to be highly correlated with the final classification accuracy. As a result, only truncated BERT, plus a task-specific head, needs to be fine-tuned, and hence resources consumed in both fine-tuning and inference stages could be saved with negligible additional cost. For validation, we conduct experiments on eight commonly used BTC tasks. The results show that our method, albeit simple, is very effective, outperforming both knowledge distillation and weight pruning methods.

In summary, the main contributions of this article include:

Proposition of a simple, but effective model compression method that adaptively truncates BERT by layer-dropping prior to fine-tuning, substantially reducing the cost of fine-tuning and inference. We observe that the full BERT model is not always the best point of departure for fine-tuning models for particular downstream tasks. Sometimes shrinking BERT by truncating layers leads to better performance, as well as faster fine-tuning and inference.
A criterion to estimate truncating position, according to pre-trained representations of instances. As a result, our method finds winning tickets [19,20,21] more before fine-tuning. In addition, the provided criterion makes the balance between the loss in classification accuracy and reduction of the model size controllable, and thus is adaptive for downstream tasks and supports compression on-demand.
Experiments that show that, compared with some strong baselines, the proposed method reduces fine-tuning time by 78% and memory requirements by 31.7%, on average, with up to 86.7 and 48% reductions in some extreme cases, respectively. On average, 39.6% of storage can be saved compared with original BERT-based models in both fine-tuning and inference. In addition, the accuracy achieved surpasses all our baselines on six out of eight tasks, and is higher than the full BERT model on five tasks.

The remainder of this article is structured as follows: Section 2 briefly reviews related work; Section 3 discusses how to truncate BERT and how the resulting models perform on BTC tasks; Section 4 presents our compression method of truncating BERT before fine-tuning; Section 5 describes the experimental settings, the results of our approach and competitive methods. Finally, we draw a conclusion and evaluate future work in Section 6.

2. Related Work

Three broad classes of related work pertinent to that presented in this article are briefly surveyed in the following section.

2.1. Pre-Trained Language Models

Since about 2017, complex language models began to gradually appear. The widely used BERT, proposed by Devlin et al. [4], achieved pretty good results on GLUE [22], SQuAD [23] and SWAG [24] Before long, XLNet [25], RoBERTa [26] and some others [27,28,29] emerged and pushed the evaluation scores of NLP tasks much higher. With the development of pre-trained language models, a wide range of NLP tasks obtained new state-of-the-art results again and again, including question answering [13], language inference [30,31], text classification [32] and text summarization [33]. Consequently, language models have become an indispensable part in solving NLP assignments. Although powerful, these models often contain huge quantities of parameters, and hence are resource-hungry. To deploy them in a resource-limited scenario, compression, i.e., reducing model size without significant accuracy loss, becomes necessary.

2.2. Neural Model Compression

As previously mentioned, contemporary neural model compression methods fall into four main categories: quantization, matrix decomposition, knowledge distillation and model pruning [17]. All have been successfully applied to compressing BERT. Among them, the last category is most closely related to our proposed method. Thus, we only recap the key ideas of the first three, and devote more attention to existing pruning methods below.

Quantization uses a smaller number of unique values to represent model weights, and hence reduces the resources required to store and compute them. Matrix decomposition refers to decomposing large parameter matrices into small ones, and thus reduces resource consumption. Knowledge distillation is transferring knowledge from one or more highly regularized large models (called the teachers) into a much smaller, distilled model (called the student). Pruning accomplishes compression by identifying and dropping unimportant or redundant model weights or units. The lottery ticket hypothesis (LTH), first proposed by Frankle and Carbin [19], substantiates the existence of sub-networks that can reach test accuracy that is comparable with the original network, and has inspired many later works on pruning. Existing methods can be grouped into two classes, viz., structured and unstructured pruning. The former reduces models by dropping several entire layers or some coherent groups of blocks [10,34,35]. In contrast, unstructured pruning is a form of dropout that compresses model by reducing weights independently [19,20,21,36].

Generally, these existing compression methods, distillation and pruning in particular, require a lot of extra training to transfer knowledge from teachers to student or identify units or weights to be removed. They sometimes even need to conduct fine-tuning followed by iterative compressing. In comparison, our approach, though it looks like a kind of structured pruning as it drops contiguous model units (viz., Transformer layers), differs significantly from existing compression methods in two main aspects: (i) it only needs to fine-tune the (truncated) model once and requires no extra training to determine the truncation position, and (ii) it decides the number of Transformer layers to be retained for each task using a metric evaluated on the corresponding labeled dataset, and hence is adaptive to downstream tasks. Thus, it not only saves resources for inference, but also those for training. It is noteworthy that the method proposed in this article is complementary to quantization, matrix decomposition, distillation, and even some pruning methods, as it can provide a better start point than the full BERT, in terms of its smaller model size and often higher classification accuracy, for these existing compression methods.

2.3. Class Separability

Class separability refers to a collective geometric feature of a set of labeled samples that gauges how the samples with different class labels separate from each other in the representation space. Various separability indices have been proposed, e.g., those in [37,38,39]. The key idea shared by most of them is to measure and compare the (maximal) intra-class distance to the (minimal) inter-class distance. As it is impossible to know the performance of machine learning models before they are applied, separability indices make it possible to get a prior glimpse into the model outcome [40,41,42]. Moreover, it is used in data-driven analysis for data peculiarities [43,44], providing a better understanding of the data to obtain better prediction results. In the current era of deep learning, it is also used to provide different insights into the deep neural network and the spatial distribution of the resulting representations [39,45,46]. In this article, we select a simple separability measure that is found to closely correlate with the classification accuracy of a classifier after being fine-tuned. We use this separability measure as a proxy to determine, before fine-tuning, the number of top-most layers in BERT to be dropped for each downstream BTC task, under the constraint that no significant accuracy decline would be incurred.

3. Full BERT Does Not Always Perform the Best

This section first provides an overview of BERT, which is the target model to be compressed, and then describes how to truncate it and how the resulting models perform on various BTC tasks.

3.1. A Brief Introduction to BERT

A BERT model [4] consists of multiple stacked Transformers [47]. Its size is mainly determined by three hyper-parameters: the number of Transformer layers, the dimensionality of hidden state vectors and the number of self-attention heads in each Transformer. Typical settings include (12, 768, 12) and (24, 1024, 16). The corresponding models are called BERT_base and BERT_large, which contain 110M and 360M parameters, respectively, and hence need huge computation resources and enormous corpora for training. To mitigate this, BERT models are pre-trained in a self-supervised manner using large-scale unlabeled texts, such as those collected from Wikipedia, on two tasks, i.e., masked language modeling (MLM) and next sentence prediction (NSP). In addition, the model parameters learned via pre-training have been made publicly available. Thus, a BERT-based model for a downstream task such as BTC (see Figure 2 for example) can be easily obtained by (i) appending a task layer to the top of BERT, (ii) loading the pre-trained parameter values as initialization for BERT, (iii) randomly initializing the parameters in the task layer and (iv) training those model parameters, i.e., those both in BERT and the task layer, using a set of labeled samples of the downstream task. As the dataset involved in step (iv) commonly contains no more than tens of thousands of samples, which is quite small compared with the number of model parameters, the training process can often slightly change the parameter values, particularly those in BERT, and hence is called fine-tuning. In short, the training process of a BERT-based model consists of two stages: pre-training and fine-tuning. Subsequently, the trained model could be applied to unseen test samples to predict their labels. The latter process is called inference.

As pre-trained BERTs have been shared online, the resource issues that need to be considered are mainly regarding those consumed during fine-tuning and inference. However, as previously mentioned, existing studies chiefly focus on the problem involved in inference, while ignoring those involved in fine-tuning. Even worse, they usually reduce the resources needed for inference at the cost of much larger investments required for the training process.

3.2. Truncating BERT for Binary Text Classification

It is widely accepted that BERT is over-parameterized [20,21], which motivates us to conduct a series of experiments to empirically study how this would influence BTC tasks that are the focus of this study. As BERT_base and BERT_large have identical structures but different sizes, we focus only on the former and refer to it as BERT from now on for notation simplicity, unless otherwise specified. Further study on BERT_large should be pursued in future work.

As illustrated in Figure 3, a model for BTC can be built via truncating BERT in two steps: The first step is to drop the top-k Transformer layers from the pre-trained BERT and preserve the bottom part as an encoder. The second step is to add a task head to the top of the resulting encoder, composed of a fully connected (FC) layer and a sigmoid layer to construct a binary classifier. In this way, when

k

ranges from 11 to 0, we obtain 12 different classifiers, which are denoted by

M_{1}

,…,

M_{12}

, respectively. To train such a classifier

M_{m} (m = 1, \dots, 12)

, the parameters in different parts are initialized differently: those in the encoder, i.e., the bottom Transformer layers retained, are initialized using their weights learned in pre-training, whereas those in the task head are randomly initialized. After that,

M_{m}

is fine-tuned using labeled datasets of downstream BTC tasks.

To investigate the performance of these truncated models, we fine-tune and test them on the datasets of three different BTC tasks:

ChnSentiCorpHou [48] is a Chinese sentiment classification corpus consisting of hotel comments collected by Tan and Zhang [48]. It consists of 3000 positive and 3000 negative reviews of different hotels. Given a comment, the task is to predict the corresponding sentiment polarity.
CoLA [49] consists of 10,657 English sentences from published linguistics literature. Given a sentence, the task is to judge the grammatical acceptability of it. Note that, as the test set is unlabeled, we only utilize publicly available training and dev sets in our experiments.
RTE [22] refers to the Recognizing Textual Entailment dataset, consisting of 2.7 k English sentence pairs. Each pair is labeled with a binary label indicating whether one sentence entails the other or not.

For fairness, different truncated models adopt the same splitting of training (or rather, fine-tuning), validation and test set. The test accuracy each model achieved on these three datasets is given in Table 1, which is evaluated by

a c c u r a c y = \frac{# T P + # T N}{# T P + # T N + # F P + # F N}

(1)

where

# T P

,

# T N

,

# F P

and

# F N

denote the number of true positives, true negatives, false positives and false negatives, respectively.

Note that, as each Transformer has more than 9M parameters, fine-tuning and inference times would dramatically increase as the number of Transformer layers grows. However, based on the results in Table 1, it is evident that layers seem to come with diminishing returns in terms of classification accuracy. As more layers are added, the accuracy fluctuates and then plateaus. More concretely, for CoLA, truncated models with 9 or 10 Transformer layers outperform the full BERT model, i.e.,

M_{12}

, and models with 8 and 11 Transformer layers exhibit competitive performance; for RTE, a model containing 8 Transformers has the best test accuracy; and for ChnSentiCorpHou, no truncated model outperforms the full BERT model, but those with 2 to 11 Transformers are competitive. In addition, the optimal numbers of Transformer layers are different for different datasets, as reflected in Table 1 by the fact that the highest accuracy (bold) in different datasets are located in different rows.

These observations indicate that BERT can be compressed for BTC tasks through truncation with very minor loss or even notable gain in terms of classification accuracy, and the number of Transformer layers to be dropped should be determined according to the downstream BTC task, i.e., adaptively.

4. Truncating BERT before Fine-Tuning

Below, we first conduct some theoretical analysis of the feasibility of truncating BERT before fine-tuning, then describe how to use class separability as a criterion to determine truncation position adaptively for downstream BTC tasks, and finally present the proposed compression method.

4.1. Feasibility Analysis

Recall that, in Section 3, we constructed 12 binary classifiers, denoted by

M_{1}

,… ,

M_{12}

, each of which contains a truncated BERT with

m

(

= 1, \dots, 12

) Transformer layers, followed by an FC layer and a sigmoid layer. More formally, for a given sample (e.g., a text or text-pair) vectorized as

x_{n}

, the prediction function of

M_{m}

is

M_{m} (x_{n}) = σ (g_{m} (h_{m} (x_{n}))),

(2)

h_{m} (x_{n}) = f_{m} (f_{m - 1} (\dots f_{1} (x_{n}) \dots)),

(3)

where

σ

is the sigmoid function,

g_{m} (\cdot)

denotes the transform function of a fully-connected layer,

h_{m} (\cdot)

is the function of the encoder, and

f_{j} (\cdot)

is the function of the

j

-th (

j = 1, \dots, m

) Transformer layer.

For binary classification tasks,

σ (g_{m} (\cdot))

in Equation (2) is a generalized linear transform targeted to find a hyperplane best separating examples without transforming the input representations, namely the embedding vectors of the input examples. Such embeddings are produced by the encoder

h_{m} (\cdot)

, which plays a more critical role as it determines the space that the separating hyperplane is located in and the feature vectors of the input examples, and hence the degree of how the examples are (linearly) separable. More precisely, the prediction accuracy of

M_{m} (\cdot)

, denoted by

a c c_{m}

, on a dataset

S

containing

N

examples vectorized as

x_{1}, \dots, x_{N}

, relies mainly on the linear separability of the embedding set

\{h_{m} (x_{1}), \dots, h_{m} (x_{N})\}

generated by the encoder

h_{m} (\cdot)

. The higher the separability of the embeddings set, the higher the test accuracy

a c c_{m}

attained by the model

M_{m} (\cdot)

.

In addition, for BTC tasks, we observed relatively small changes were brought by the fine-tuning process to the parameters of the encoder

h_{m} (\cdot)

in our experiments, particularly for the deeper layers. We owe this to three main factors: (i) the labeled dataset used for fine-tuning is of a small size, commonly ranging from hundreds to tens of thousands, (ii) the vanishing effect caused by computing the gradients of model parameters by the chain rule and (iii) the often very slow learning rate, e.g., 5 × 10⁻⁵, applied to the pre-trained encoder. Similar phenomena have been previously observed by Merchant et al. [50] on the MNLI dataset [31]. Even the very recent studies conducted by Zhou and Srikumar [51,52] found that fine-tuning largely preserved the spatial structure of the data points, i.e., the vectorized examples.

Inspired by the above analyses and observations, we conjecture that the ordering of the separability of the embedding sets generated by different pre-trained encoders remains consistent during fine-tuning. Thus, the separability of the embedding set produced by

h_{m} (\cdot)

before fine-tuning could be used as a precursor for the ranking of the accuracy that the corresponding classifier

M_{m} (\cdot)

could achieve after being fine-tuned.

4.2. Using Class Separability as a Proxy

We conducted additional experiments to validate the conjecture described above. Before that, three widely used separability indices, viz., class scatter matrices (CSM) [53], Thornton’s separability index (SI) [54] and hypothesis margin (HM) [55], are briefly reviewed in the following section. To simplify notation, the embedding vector generated by

h_{m} (\cdot)

from an input example

x

is denoted by

z

from now on, viz.,

z = h_{m} (x), m = 1, \dots, 12

, in which the model subscript is omitted.

CSM is the fraction of the trace of between-cluster scatter matrix (

S_{B}

) and within-cluster scatter matrix (

S_{W}

). It could be formulated as follows for two-class datasets:

C S M = t r (S_{B}) / t r (S_{W}),

(4)

S_{B} = \sum_{c = 1}^{2} N_{c} (\bar{z_{c}} - \bar{z}) {(\bar{z_{c}} - \bar{z})}^{T},

(5)

S_{W} = \sum_{c = 1}^{2} \sum_{j = 1}^{N_{c}} (z_{c j} - \bar{z_{c}}) {(z_{c j} - \bar{z_{c}})}^{T},

(6)

where

t r (\cdot)

denotes the trace operation, calculating the sum of the diagonal elements of a given matrix;

S_{B}

is the between-cluster scatter evaluating the distance between the mean vector of each category and the mean vector of the whole dataset;

S_{W}

is the within-cluster scatter measuring the distance between an instance and the mean vector of the class it belongs to;

N_{c} (c = 1, 2)

is the number of examples in the

c

-th class;

z_{c j}

is the

j

-th instance in class

c

; and

\bar{z_{c}}

and

\bar{z}

denote the mean vectors of class

c

and the whole dataset, respectively.

SI measures the inter-class overlap by calculating the ratio of instances that share the same class label with their nearest neighbors. Formally, it is defined to be

S I = \frac{\sum_{i = 1}^{N} (ϕ (z_{i}) + ϕ (z_{i}^{'}) + 1) \mod 2}{N},

(7)

in which

N

is the total number of instances in the dataset,

z_{i}

is the

i

-th instance, and

ϕ (z_{i}) \in \{0, 1\}

and

z_{i}^{'}

refer to the class label and the nearest neighbor of

z_{i}

, respectively.

HM evaluates the sum of the distance between the hypothesis and the closest hypothesis that assigns the alternative label to the given instance.

H M = \frac{1}{2} \sum_{i = 1}^{N} (∥ z_{i} - nearmiss (z_{i}) ∥ - ∥ z_{i} - nearhit (z_{i}) ∥),

(8)

where

N

denotes the number of instances in the dataset,

nearmiss (z_{i})

and

nearhit (z_{i})

are the nearest instances to

z_{i}

with the same and different class labels, respectively, and

∥ \cdot ∥

stands for the Euclidean norm.

To evaluate the appropriateness of these three separability indices for use as proxies for truncating BERT before fine-tuning, we computed them on the embedding sets generated by the pre-trained (but not fine-tuned) encoders,

h_{1} (\cdot), \dots, h_{12} (\cdot)

, from the three BTC datasets described in Section 3.2. Figure 4 shows the results. For the convenience of comparison, we draw the results of one separability index together with the classification accuracy of the corresponding classifiers in each subplot, i.e.,

M_{1}

,…,

M_{12}

, on the same dataset. Intuitively, the trends of CSM, compared with SI and HM, are more similar to that of test accuracy across encoders and datasets. As shown in the first column in Figure 4, CSM has similar test accuracy to that achieved by the corresponding classifiers after fine-tuning. This somewhat validates our conjecture presented at the end of Section 4.1. We further compute the Pearson’s correlation coefficient between classification accuracy and each separability index. The results are given in Table 2, which quantitatively confirm that CSM most closely correlates with test accuracy, and hence is the best choice among the three indices to be used as a proxy.

4.3. The Compression Method

Now, we describe how to use CSM as a proxy to truncate BERT before fine-tuning, and hence reduce resources to be consumed in both fine-tuning and inference without a significant loss in classification accuracy.

Let

S_{t r}

be the training set of a BTC task. The maximum and minimum of the CSMs of the embedding sets generated from

S_{t r}

by encoders with different numbers of Transformer layers are denoted by

C S M^{*}

and

C S M_{*}

, respectively. That is,

C S M^{*} = \max_{m} \{C S M (h_{m} (S_{t r}))\},

(9)

C S M_{*} = \min_{m} \{C S M (h_{m} (S_{t r}))\},

(10)

where

h_{m} (S_{t r})

is the embedding set generated by

h_{m}

, a pre-trained encoder consisting of

m

Transformer layers. The encoder indices/subscripts corresponding to

C S M^{*}

and

C S M_{*}

are denoted by

m^{*}

and

m_{*}

, respectively. As the goal of compression is to shrink the model under consideration as much as possible without significantly diminished accuracy, we define

ρ

(

\in [0, 1]

) to be the maximally tolerable ratio of (relative) accuracy decline, and determine the number of Transformers to be preserved during truncation, denoted by

m^{△}

, as follows

Δ_{C S M} = C S M^{*} - C S M_{*},

(11)

m^{△} = \min_{C S M (h_{m} (S_{t r})) \geq C S M^{*} - ρ Δ_{C S M}} \{1, \dots, m^{*}\} .

(12)

By definition,

ρ

should be a small constant. Otherwise, significant decreases in accuracy may occur. In our experiments (see Section 5.3.1), we find

0.20 \leq ρ \leq

0.25 to be a good choice that ensures very small accuracy loss while achieving a considerable compression ratio. As an additional advantage,

ρ

makes the balance between accuracy loss and model size reduction somewhat controllable, and hence provides the possibility to implement compression on-demand.

To summarize, truncating BERT before adaptively fine-tuning for a specified BTC task can be implemented in three steps: (i) Vectorize each example in the training set

S_{t r}

of the target task as

x_{n} (n = 1, \dots, N)

. (ii) Feed each

x_{n}

into a pre-trained BERT, and collect all the hidden states, each of which is denoted as

z_{n}^{(m)} (m = 1, \dots, 12)

, generated by each Transformer layer into an embedding set

Z^{(m)}

. Namely,

Z^{(m)} = h_{m} (S_{t r}) = \{z_{1}^{(m)}, \dots, z_{N}^{(m)}\}

. (iii) Calculate the CSM of each embedding set

Z^{(m)}

for

m = 1, \dots, 12

, and find the truncation position according to Equation (12), described above.

5. Experiments

To validate the effectiveness of the proposed method, we conduct extensive experiments on eight BTC datasets, including CoLA, ChnSentiCorpHou and RTE, described in Section 3, and five additional ones. In addition, we empirically compare our method to some strong baselines to analyze their relative merits and complementarity.

5.1. Datasets

In the following, we provide a brief introduction to the five additional datasets used in our experiments:

WaiMai (https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/waimai_10k (accessed on 8 October 2022)) is a Chinese sentiment corpus consisting of comments of delivery service and takeaway food. It contains about 4000 positive and 8000 negative examples.
SST-2 refers to the Stanford Sentiment Treebank [56]. It consists of about 70 k English sentences excerpted from movie reviews, each of which is assigned to a positive or negative label.
Nikon-JD is a Chinese aspect-level sentiment classification dataset constructed by ourselves [57]. It consists of 2.6 k product reviews from an E-commerce platform, namely JingDong (https://www.jd.com (accessed on 8 October 2022)). Given a review and a target aspect category, three human annotators are asked to judge whether the review expresses a positive sentiment toward the specified aspect. Then, a label is assigned according to the judgement of the majority.
AFQMC stands for the Ant Financial Question Matching Corpus [58], released by the Ant Technology Exploration Conference Developer competition. There are 42.5 k instances in this corpus, each of which is a sentence pair. The task is to judge whether a given pair of sentences are semantically similar.
MRPC is the Microsoft Research Paraphrase Corpus [59], consisting of English sentence pairs automatically extracted from online news platforms. Each pair is annotated with a label indicating whether the sentences in a pair are semantically equivalent. There are approximately 5.8 k pairs in this corpus.

Table 3 gives the statistics of these datasets, together with those described in Section 3. For better diversity, these datasets involve English and Chinese texts, single text and text-pair inputs, and various typical BTC tasks such as sentiment polarity identification and grammatical acceptability judgement. Note that the ordering of these datasets has been rearranged to place together tasks dealing with identical languages and input formats.

Among these datasets, all samples in ChnSentiCorpHou, WaiMai, SST-2, Nikon-JD and MRPC are available, and hence used in the experiments. In contrast, only the training and dev sets of CoLA, AFQMC and RTE are used, as their test sets have not been opened. For the two larger ones, i.e., CoLA and AFQMC, we take the original dev sets as our new test sets and randomly sample 10% of instances from the original training sets to form the new dev sets while the remaining examples form the new training sets. Then, we fine-tune each classifier on the training and dev sets and test their accuracy on the test sets for WaiMai, SST-2, CoLA, AFQMC and MRPC. We conduct five-fold cross-validations on the remaining three smaller datasets, namely ChnSentiCorpHou, Nikon-JD and RTE, which involves randomly partitioning the examples into five (nearly equal) groups, and then repeating the following process five times: choosing a different fold as the test set with accuracy calculated, and leaving the remaining four folds to form the training and dev set. We use the average test accuracy of five rounds as the result.

5.2. Implementation Details and Comparative Methods

We use 12-layer Chinese BERT-wwm-ext (https://huggingface.co/hfl/chinese-bert-wwm-ext (accessed on 8 October 2022)) and BERT-base-uncased (https://huggingface.co/bert-base-uncased (accessed on 8 October 2022)) as our basic language models for Chinese and English tasks separately. All classification models are fine-tuned for five epochs. The batch size and other implementation details are provided in Table 4. Following each epoch of training, we give a test on dev set and record the better model. Finally, the one that performs best on dev set will be saved and applied.

For comparison, we set up the following baselines:

Full BERT: We construct binary classification models based on BERT_base [4] as the standard model.
Distil-Pretrain: We build up classification models based on the distilled language models (DistilBERT [12] and ALBERT-Chinese-tiny (https://huggingface.co/clue/albert_chinese_tiny (accessed on 10 October 2022))).
Distil-task-biLSTM: We fine-tune a BERT-based classification model as the teacher, and construct a student using the biLSTM as the encoder with the assistance of the attention mechanism. Then, we let the student imitate predictions of the teacher.
Distil-task-kimCNN: The teacher is the same as in the last one, but the student is constructed based on the kimCNN [60].
Pruning-AGP: We employ the automated gradual pruner (AGP) [61] to implement task-specific magnitude pruning on our tasks. Here, we refer to the implementation provided in the Distiller library [62].

The BERT embeddings, the batch size and max sequence length adopted in all these baselines are the same as those of our models (see Table 4). As an exception, we set the learning rate to be 1 × 10⁻³ during task-specific distillation for both Distil-task-biLSTM and Distil-task-kimCNN. In the following sections, we will compare these methods with ours in terms of predicting accuracy, training time and number of parameters. Note that, to ensure fairness, the above baselines and proposed truncation methods adopt the same splitting of training (or rather, fine-tuning), validation and test set on each dataset.

5.3. Results and Analysis

5.3.1. The Balance between Accuracy Decline and Model Size Reduction

We investigate how the classification accuracy and model size vary along with different

ρ

(maximally tolerable ratio of relative accuracy decline) to find an appropriate trade-off between them. To check whether there exists some general regularity, we first treat the aforementioned eight BTC tasks as a whole by computing the average accuracy loss and compression ratio on them for different

ρ

. Figure 5 depicts how the average accuracy loss (measured using the mean accuracy loss ratio relative to the full BERT; see the left vertical axis) and average compression ratio achieved (measured using the average number of Transformers dropped during compression; see the right vertical axis) vary when

ρ

changes from 0 to 1 with a stride of 0.05. It can be observed that both accuracy loss and compression ratio increase when

ρ

grows larger. This is consistent with our expectations. Notably, when

ρ > 0.40

, the mean accuracy loss ratio (see the red curve in Figure 5) starts accelerating going up, whereas the increase in compression ratio (see the blue curve in Figure 5) slows down. Both curves are nearly horizontal in the interval of

ρ \in [0.35, 0.40]

; for a better trade-off, we would limit

ρ \leq 0.35

. If we prefer near to no accuracy loss, we may limit the mean loss ratio to be less than 0.02 (i.e., within 2%). Correspondingly,

ρ

should be no more than 0.25. Based on the figure, it is obvious that

ρ

should be limited to

[0.20, 0.25]

if we want to obtain a compression ratio as large as possible at the same time. In addition, as the red curve is steeper than the blue one when

ρ \in [0.20, 0.25]

, we think

ρ = 0.20

is an even better choice, which restricts accuracy loss more rigidly while achieving a close compression ratio.

Figure 6 further depicts how the test accuracy achieved (the left vertical axis in each subplot) and number of Transformer layers retained (the right vertical axis in each subplot) covary with different

ρ

on each dataset. We highlight the test accuracy and size of compressed model corresponding to

ρ = 0.20

using a vertical dashed line in each subplot. The results show that this

ρ

setting helps to find a compressed model with the highest accuracy or accuracy very close to the highest on all datasets but AFQMC. This verifies

ρ = 0.20

to be a generally good choice. Moreover, the resulting model is of different complexity, as it contains a different number of Transformers for different BTC datasets. In other words, the number of Transformers to be preserved in the encoder is adaptively determined for the downstream BTC task.

5.3.2. Performance of the Compressed Models

To validate the effectiveness of our proposed method, we comprehensively evaluate the performance of the compressed model it produces when adopting an optimal balance between accuracy loss and compression ratio, namely with a setting of

ρ = 0.2

. The performance is measured in different angles including the test accuracy achieved, the number of parameters remained and the training time spent. The results are shown in Table 5, Table 6 and Table 7, respectively, in which our compressed model is called Truncated-CSM. For the convenience of comparison, we also provide the results of the baseline models on identical datasets in these tables, using the same devices as those listed in Table 4.

As shown in Table 5, compared with other compressed models, ours achieves the highest accuracy on six out of eight datasets. For the other two datasets, it wins the second best on AFQMC and the third best on MRPC, and performs very closely to Pruning-AGP in that the accuracy difference between them is no more than one point. It is also noteworthy that though our model is often much smaller than the full BERT (see the corresponding columns in Table 6), it outperforms the full BERT on five datasets in terms of test accuracy. This shows that BERT is over-parameterized, as previously found, and justifies our method of compressing it via truncation.

We further compare the quantities of parameters in different compressed models, together with that of the full BERT, which need to be stored and updated during training. Note that, for inference, distillation-based methods only load and use the student, which is often much lighter and runs very fast. However, as shown in Table 6, such model reduction and speed-up are gained at the cost of far more resources invested into training. When looking back at training, the results in Table 6 demonstrate that our compressed model is much smaller than all competing models, including the full BERT, which leads to significantly less training time. This is indicated by the results in Table 7, showing that our model consumes the least training time on all datasets. It is worth highlighting that the training time of our model is only 46~92% of that needed to fine-tune a full BERT. In contrast, the other compression methods, including both distillation and pruning, require far longer training times than the full BERT as they need extra training to learn the pruning mask or transfer knowledge from the teacher to the student, sometimes iteratively.

5.4. Discussion

Overall, the above results show that our method can achieve an averaged compression ratio of about 40% relative to the full BERT while keeping the classification accuracy on a very high level. Specifically, our method achieved the highest accuracy among all models, including the full BERT, on five out of eight datasets, and beat other compressed models on six out of eight datasets while being the runner-up and third winner on the remaining two datasets.

Moreover, our method is simple to implement, requires very little additional cost for compression, only needs to fine-tune a truncated BERT with fewer Transformers, and thus enjoys much shorter training times compared with the full BERT. This is impossible for other conventional compression methods, including distillation and pruning, because they need to transfer knowledge from teacher to student or learn the pruning weights/masks in addition to fine-tuning. This generally leads to significant increases in the time and space consumption of the training process.

Currently, we only concentrate on BTC tasks. Although these tasks can support a long list of practical applications, for greater real-world applicability, they need to be formulated into more complicated tasks, such as multi-class classification and sequence labeling. As the separability indices also work for multi-class classification, we believe the work presented in this study could provide a basis for future studies on compression methods for more complex (downstream) tasks. In addition, we also note in the experiments that distillation methods often use only the learned students, e.g., a biLSTM-based binary classifier, for inference, and hence have shorter inference times compared with other compression methods, including ours. This merit could be exploited in the future through a combined paradigm, in which our method is applied first to find a lighter and more accurate teacher for distillation.

6. Conclusions and Future Work

In this study, we conduct a series of experiments and some theoretic analysis to explore whether there exist winning tickets when transferring language models to downstream BTC tasks and how to conveniently and effectively find them before fine-tuning. Different from conventional model compression methods, we truncate BERT by dropping out several consecutive Transformer layers once. In addition, a simple way to locate an appropriate position for truncation is provided, which uses class separability as a proxy and manages to appropriately balance between compression ratio and accuracy decline. To validate the proposed method, we conduct extensive experiments on eight BTC datasets covering two main languages (Chinese and English), different input formats (single text and text pair) and various tasks (sentiment analysis, acceptability judgement, textual entailment, semantic similarity analysis and natural language inference). The experimental results reveal that, using our compression method, the resources consumed in inference, as well as fine-tuning, could be significantly reduced without losing too much accuracy or adding extra computational burdens or memory footprint. It outperforms some strong baselines, including the full BERT, distillation methods and structured pruning, in terms of classification accuracy, parameter quantity and training time. In addition, the classification accuracy achieved by our method is even higher than that of full BERT on most of the datasets. In contrast, as well-known conventional compression methods cannot reach the accuracy level of full language models. We think this not only verifies the effectiveness and superiority of our method, but also proves that BERT is over-parameterized for certain downstream tasks and should be adaptively shrunken.

In future studies, we aim to extend these experiments to other more complex NLP tasks, e.g., multi-class classification and sequence labeling, and other language models, e.g., BERT_large, and explore more effective ways to find winning tickets in language models. In addition, we intend to find out indicators correlating more closely with classification accuracy than the CSM index used in this article, and hence to further improve the performance of compressed models on some difficult BTC datasets, such as AFQMC and MRPC.

Author Contributions

Conceptualization, X.Z.; formal analysis, X.Z. and J.F.; investigation, J.F.; methodology, X.Z. and J.F.; validation, X.Z., J.F. and M.H.; visualization, X.Z., J.F. and M.H.; writing—original draft, X.Z. and J.F.; writing—review and editing, X.Z., J.F. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China under Grant No. 62102431 and the Research Plan Project of National University of Defense Technology under Grant No. ZK21-32.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are available in [22,48,49,56,57,58,59] and the access to them has been described in Section 3.2 and Section 5.1.

Acknowledgments

The authors would like to thank Anders Søgaard for his suggestions and help during preparation of the preliminary version of this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, B. Synthesis Lectures on Human Language Technologies. In Sentiment Analysis and Opinion Mining; Morgan & Claypool Publishers: San Rafael, CA, USA, 2012; Volume 5, p. 167. [Google Scholar]
Lan, W.; Xu, W. Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering. In Proceedings of the International Conference on Computational Linguistics (COLING), Santa Fe, NM, USA, 20–26 August 2018; ACL: Santa Fe, NM, USA, 2018. [Google Scholar]
Jindal, N.; Liu, B. Review Spam Detection. In Proceedings of the International Conference on World Wide Web (WWW), Banff, AB, Canada, 8–12 May 2007; ACM: Banff, AB, Canada, 2007. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; ACL: Minneapolis, MN, USA, 2019. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; ACL: Florence, Italy, 2019. [Google Scholar]
Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
Ahia, O.; Kreutzer, J.; Hooker, S. The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2021; ACL: Punta Cana, Dominican Republic, 2021. [Google Scholar]
Chen, D.; Li, Y.; Qiu, M.; Wang, Z.; Li, B.; Ding, B.; Deng, H.; Huang, J.; Lin, W.; Zhou, J. AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020. [Google Scholar]
Chia, Y.K.; Witteveen, S.; Andrews, M. Transformer to CNN: Label-Scarce Distillation for Efficient Text Classification. In Proceedings of the NIPS 2018 Workshop CDNNRIA, Montreal, QC, Canada, 12 December 2018. [Google Scholar]
Fan, A.; Grave, E.; Joulin, A. Reducing Transformer Depth on Demand with Structured Dropout. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 8–12 November 2020. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. In Proceedings of the NeurIPS Workshop on Energy Efficient Machine Learning and Cognitive Computing, Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
McCarley, J.S.; Chakravarti, R.; Sil, A. Structured Pruning of a BERT-based Question Answering Model. arXiv 2019, arXiv:1910.06360. [Google Scholar]
Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, 1 August 2019. [Google Scholar]
Tenney, I.; Das, D.; Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019. [Google Scholar]
Boo, Y.; Sung, W. Fixed-point optimization of transformer neural network. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Ganesh, P.; Chen, Y.; Lou, X.; Khan, M.A.; Yang, Y.; Sajjad, H.; Nakov, P.; Chen, D.; Winslett, M. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Trans. Assoc. Comput. Linguist. 2021, 9, 1061–1080. [Google Scholar] [CrossRef]
Sanh, V.; Wolf, T.; Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. Adv. Neural Inf. Process. Syst. 2020, 33, 20378–20389. [Google Scholar]
Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Chen, T.; Frankle, J.; Chang, S.; Liu, S.; Zhang, Y.; Wang, Z.; Carbin, M. The Lottery Ticket Hypothesis for Pre-trained BERT Networks. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Prasanna, S.; Rogers, A.; Rumshisky, A. When BERT Plays the Lottery, All Tickets Are Winning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018. [Google Scholar]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conf. on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA, 1–5 November 2016. [Google Scholar]
Zellers, R.; Bisk, Y.; Schwartz, R.; Choi, Y. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. In Proceedings of the EMNLP, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Shen, S.; Dong, Z.; Ye, J.; Ma, L.; Yao, Z.; Gholami, A.; Mahoney, M.W.; Keutzer, K. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Wang, X.; Gao, T.; Zhu, Z.; Zhang, Z.; Liu, Z.; Li, J.; Tang, J. KEPLER: A unified model for knowledge embedding and pre-trained language representation. Trans. Assoc. Comput. Linguist. 2021, 9, 176–194. [Google Scholar] [CrossRef]
Zafrir, O.; Boudoukh, G.; Izsak, P.; Wasserblat, M. Q8bert: Quantized 8bit bert. In Proceedings of the 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
Levesque, H.; Davis, E.; Morgenstern, L. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Rome, Italy, 10–14 June 2012. [Google Scholar]
Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), New Orleans, LA, USA, 31 March 2018; ACL: Stroudsburg, PA, USA, 2018. [Google Scholar]
Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv 2016, arXiv:1612.03651. [Google Scholar]
Barzilay, R.; Elhadad, M. Using lexical chains for text summarization. In Advances in Automatic Text Summarization; MIT Press: Cambridge, MA, USA, 1999; pp. 111–121. [Google Scholar]
Michel, P.; Levy, O.; Neubig, G. Are Sixteen Heads Really Better than One? In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Wang, Z.; Wohlwend, J.; Lei, T. Structured Pruning of Large Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
Gordon, M.; Duh, K.; Andrews, N. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the ACL Workshop on Representation Learning for NLP (RepL4NLP), Online, 9 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Mthembu, L.; Greene, J.R. A comparison of three separability measures. In Proceedings of the Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Grabouw, South Africa, 25–26 November 2004. [Google Scholar]
Mthembu, L.; Marwala, T. A Note on the Separability Index. In Proceedings of the Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Cape Town, South Africa, 29–30 November 2004. [Google Scholar]
Schilling, A.; Metzner, C.; Rietsch, J.; Gerum, R.; Schulze, H.; Krauss, P. Quantifying the separability of data classes in neural networks. Neural Netw. 2021, 139, 278–293. [Google Scholar] [CrossRef] [PubMed]
Luengo, J.; Herrera, F. An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl. Inf. Syst. 2015, 42, 147–180. [Google Scholar] [CrossRef]
Flores, M.J.; Gámez, J.A.; Martínez, A.M. Domains of competence of the semi-naive Bayesian network classifiers. Inf. Sci. 2014, 260, 120–148. [Google Scholar] [CrossRef]
Muñoz, M.A.; Villanova, L.; Baatar, D.; Smith-Miles, K. Instance spaces for machine learning classification. Mach. Learn. 2018, 107, 109–147. [Google Scholar] [CrossRef] [Green Version]
Dong, M.; Kothari, R. Feature subset selection using a new definition of classifiability. Pattern Recognit. Lett. 2003, 24, 1215–1225. [Google Scholar] [CrossRef]
Smith, M.R.; Martinez, T.; Giraud-Carrier, C. An instance level analysis of data complexity. Mach. Learn. 2014, 95, 225–256. [Google Scholar] [CrossRef] [Green Version]
Mamou, J.; Le, H.; del Rio, M.; Stephenson, C.; Tang, H.; Kim, Y.; Chung, S. Emergence of Separable Manifolds in Deep Language Representations. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020. [Google Scholar]
Cohen, U.; Chung, S.; Lee, D.D.; Sompolinsky, H. Separability and Geometry of Object Manifolds in Deep Neural Networks. Nat. Commun. 2020, 11, 746. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Tan, S.; Zhang, J. An Empirical Study of Sentiment Analysis for Chinese Documents. Expert Syst. Appl. 2008, 34, 2622–2629. [Google Scholar] [CrossRef]
Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. Trans Assoc. Comput. Linguist. 2019, 7, 625–641. [Google Scholar] [CrossRef]
Merchant, A.; Rahimtoroghi, E.; Pavlick, E.; Tenney, I. What Happens to BERT Embeddings During Fine-tuning? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Online, 1 November 2020; ACL: Stroudsburg, PA, USA, 2020. [Google Scholar]
Zhou, Y.; Srikumar, V. DirectProbe: Studying Representations without Classifiers. In Proceedings of the 2021 Confecence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Zhou, Y.; Srikumar, V. A Closer Look at How Fine-tuning Changes BERT. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022; ACL: Stroudsburg, PA, USA, 2022. [Google Scholar]
Duda, O.R.; Hart, P.E. Pattern classification and scene analysis. In A Wiley-Interscience Publication; Wiley-Interscience, Inc.: New York, NY, USA, 1973. [Google Scholar]
Thornton, C.J. Truth from Trash: How Learning Makes Sense; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Gilad-Bachrach, R.; Navot, A.; Tishby, N. Margin based feature selection—Theory and algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the EMNLP, Seattle, WA, USA, 18–21 October 2013. [Google Scholar]
Fan, J.; Zhang, X.; Zhang, Z.; Xu, C. A Neural Model for Aspect-Level Sentiment Classification of Product Reviews Assissted by Question-Answering. In Proceedings of the International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Changsha, China, 26–28 March 2021; IEEE: Changsha, China, 2021. [Google Scholar]
Xu, L.; Hu, H.; Zhang, X.; Li, L.; Cao, C.; Li, Y.; Xu, Y.; Sun, K.; Yu, D.; Yu, C.; et al. CLUE: A Chinese Language Understanding Evaluation Benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 4762–4772. [Google Scholar]
Dolan, B.W.; Brockett, C. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the IJCNLP, Jeju Island, Republic of Korea, 11–13 October 2005. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014. [Google Scholar]
Zhu, M.; Gupta, S. To prune, or not to prune: Exploring the efficacy of pruning for model compression. arXiv 2018, arXiv:1710.01878. [Google Scholar]
Zmora, N.; Jacob, G.; Zlotnik, L.; Elharar, B.; Novik, G. Neural Network Distiller: A Python Package for DNN Compression Research. arXiv 2019, arXiv:1910.12232. [Google Scholar]

Figure 1. The general procedures of conventional task-specific compression (upper box) and the adaptive truncation compression proposed in this article (lower box). Conventional methods commonly conduct compression after fine-tuning, whereas ours conducts compression prior to fine-tuning. Moreover, knowledge distillation requires an additional training process for the teacher to teach the student to make predictions. Pruning methods are needed evaluate the importance of each model unit or learned weight to determine which should be masked and then perform another round of fine-tuning; our method demands no additional training.

Figure 2. An illustration of typical BERT-based text classification models. (a) BERT-based model for single text classification, which takes a single sequence of tokens as input. (b) BERT-based model for text-pair classification, which converts two input sequences of tokens into one by inserting a separator token [SEP] between them. Experiments presented in this article adopt one of these two models according to the input format.

Figure 3. The main stages involved in building a model using truncated BERT for a BTC task. The left is the full BERT model, the middle is the truncated BERT after dropping the top-k Transformer layers of full BERT, and the right is a binary classification model consisting of an encoder (the truncated BERT) and a task-specific head (a fully connected layer followed by a sigmoid layer).

Figure 4. The test accuracy (black) vs. CSM (red), SI (blue) and HM (green) curves on 3 BTC datasets. We juxtapose the separability index and classification accuracy for all 12 Transformer layers. The horizontal axis is the layer index of BERT, the left vertical axis is the classification accuracy and the right vertical axis is the separability index. Each row corresponds to a different dataset, and each column corresponds to a different separability index. Best viewed in color.

Figure 5. The mean accuracy loss ratio (the left vertical axis) and the number of dropped Transformer layers (the right vertical axis) vary when

ρ

changes from 0 to 1 with a stride of 0.05. The horizontal axis is the value of

ρ

. The vertical dashed line corresponds to the upper bound of

ρ

that can bring good trade-offs between accuracy loss and compression ratio. The gray interval corresponds to

ρ \in [0.20, 0.25]

, which can achieve the largest compression ratio without notable accuracy loss. Best viewed in color.

Figure 5. The mean accuracy loss ratio (the left vertical axis) and the number of dropped Transformer layers (the right vertical axis) vary when

ρ

changes from 0 to 1 with a stride of 0.05. The horizontal axis is the value of

ρ

. The vertical dashed line corresponds to the upper bound of

ρ

that can bring good trade-offs between accuracy loss and compression ratio. The gray interval corresponds to

ρ \in [0.20, 0.25]

, which can achieve the largest compression ratio without notable accuracy loss. Best viewed in color.

Figure 6. The test accuracy (left vertical axis) and number of Transformer layers preserved in the truncated model (right vertical axis) vary when

ρ

changes from 0 to 1 with a stride of 0.05. The horizontal axis is the value of

ρ

. Each subplot corresponds to a different dataset. The vertical dashed line in each subplot corresponds to

ρ

= 0.20. Best viewed in color.

Figure 6. The test accuracy (left vertical axis) and number of Transformer layers preserved in the truncated model (right vertical axis) vary when

ρ

changes from 0 to 1 with a stride of 0.05. The horizontal axis is the value of

ρ

. Each subplot corresponds to a different dataset. The vertical dashed line in each subplot corresponds to

ρ

= 0.20. Best viewed in color.

Table 1. Test accuracy (%) of

M_{1}

,…,

M_{12}

on three BTC datasets. Each row corresponds to a classifier with a truncated BERT with a different number of Transformers as its encoder. Column 2–4 are the classification accuracies these classifiers achieve on the three datasets. Bold denotes the highest accuracy.

Table 1. Test accuracy (%) of

M_{1}

,…,

M_{12}

on three BTC datasets. Each row corresponds to a classifier with a truncated BERT with a different number of Transformers as its encoder. Column 2–4 are the classification accuracies these classifiers achieve on the three datasets. Bold denotes the highest accuracy.

Classifiers	ChnSentiCorpHou	CoLA	RTE
$M_{1}$	88.08	69.07	52.00
$M_{2}$	91.08	66.22	52.29
$M_{3}$	91.52	69.64	56.38
$M_{4}$	92.07	72.87	63.06
$M_{5}$	92.65	75.52	64.73
$M_{6}$	92.45	77.80	63.57
$M_{7}$	93.05	77.99	64.29
$M_{8}$	92.07	80.83	66.03
$M_{9}$	93.28	84.25	63.68
$M_{10}$	92.02	85.96	63.71
$M_{11}$	93.02	83.49	64.94
$M_{12}$	93.30	83.87	65.81

Table 2. The Pearson’s correlation coefficients between test accuracy and each of the three separability indices.

Corpus	CSM	SI	HM
ChnSentiCorpHou	0.7942	0.8204	0.4208
CoLA	0.9655	0.5444	0.8269
RTE	0.8720	0.1090	−0.2889
Average	0.8772	0.4913	0.3196

Table 3. Statistics of the eight datasets used in this article. Pos. Rate indicates the ratio of positive instances in the corresponding dataset.

Corpus	Samples	Language	Input	Task	Pos. Rate
ChnSentiCorpHou	6 k	Chinese	Single text	Sentiment classification	0.50
WaiMai	12 k	Chinese	Single text	Sentiment classification	0.33
SST-2	67.8 k	English	Single text	Sentiment classification	0.50
CoLA	9 k	English	Single text	Grammatic acceptability judgement	0.70
Nikon-JD	2.6 k	Chinese	Text pair	Semantic similarity	0.78
AFQMC	38.6 k	Chinese	Text pair	Semantic similarity	0.31
MRPC	5.8 k	English	Text pair	Semantic similarity	0.67
RTE	2.7 k	English	Text pair	Semantic similarity	0.50

Table 4. Parameter settings in our implementations.

Corpus	Batch Size	Learning Rate	Max Seq. Length	Device
ChnSentiCorpHou	32	1 × 10⁻⁵	512	RTX3090 × 2
WaiMai	32	1 × 10⁻⁵	256	RTX3090 × 2
SST-2	32	2 × 10⁻⁵	128	V100 × 1
CoLA	32	2 × 10⁻⁵	128	RTX3090 × 2
Nikon-JD	32	1 × 10⁻⁵	256	RTX3090 × 2
AFQMC	32	5 × 10⁻⁶	256	RTX3090 × 2
MRPC	32	1 × 10⁻⁵	128	RTX3090 × 2
RTE	32	1 × 10⁻⁵	256	V100 × 1

Table 5. Test accuracy (%) on the 8 tasks in this article. Bolded values denote the highest among all models (i.e., including full BERT), and underlined values denote the best among compressed models.

Model	ChnSentiCorpHou	WaiMai	SST-2	CoLA	Nikon-JD	AFQMC	MRPC	RTE
Full BERT	93.30	90.12	92.26	83.87	97.01	74.12	83.59	65.81
Distil-Pretrain	91.17	89.82	91.71	81.40	90.72	51.88	82.09	59.56
Distil-task-biLSTM	87.50	89.57	87.10	68.31	89.32	64.18	64.17	50.88
Distil-task-kimCNN	86.48	88.45	85.23	70.59	91.29	63.51	67.59	52.08
Pruning-AGP	92.65	87.11	91.32	82.73	95.23	69.95	77.33	57.50
Truncated-CSM	92.65	90.78	93.03	84.25	97.31	68.95	76.52	66.03

Table 6. Number of parameters in the model needs to be stored and updated during training/fine-tuning for the 8 BTC tasks in this study. All values have been scaled by 1M. Bold denotes the smallest values. The quantity of Distil-Pretrain is not available, and thus also not shown in the table. The quantity of a distillation based model counts the parameters in both the teacher and student.

Model	ChnSentiCorpHou	WaiMai	SST-2	CoLA	Nikon-JD	AFQMC	MRPC	RTE
Full BERT	102	102	109	109	102	102	109	109
Distil-Pretrain	-	-	-	-	-	-	-	-
Distil-task-biLSTM	102 + 9	102 + 9	109 + 9	109 + 9	102 + 9	102 + 9	109 + 9	109 + 9
Distil-task-kimCNN	102 + 7	102 + 7	109 + 7	109 + 7	102 + 7	102 + 7	109 + 7	109 + 7
Pruning-AGP	102	102	109	109	102	102	109	109
Truncated-CSM	53	53	95	88	95	53	60	81

Table 7. Training time (s) for the 8 BTC tasks in this study. Bold denotes the shortest times. For distillation-based method, it includes fine-tuning time of teacher model and distillation time of student model. For pruning methods, it includes fine-tuning time and weight pruning time, whereas for truncation-based method, it includes calculation time of separability measure and fine-tuning time. Note that the total training time of Distil-Pretrain is not available, and hence not given here.

Model	ChnSentiCorpHou	WaiMai	SST-2	CoLA	Nikon-JD	AFQMC	MRPC	RTE
Full BERT	347	439	3918	334	104	1617	159	177
Distil-Pretrain	-	-	-	-	-	-	-	-
Distil-task-biLSTM	347 + 841	439 + 613	3918 + 5493	334 + 761	104 + 248	1617 + 3933	159 + 371	177 + 284
Distil-task-kimCNN	347 + 738	439 + 570	3918 + 5052	334 + 1312	104 + 455	1617 + 3836	159 + 1113	177 + 261
Pruning-AGP	10,516	5175	21,389	2326	1970	16,643	2467	2004
Truncated-CSM	158	202	3286	268	96	771	83	122

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Fan, J.; Hei, M. Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning. Appl. Sci. 2022, 12, 12055. https://0-doi-org.brum.beds.ac.uk/10.3390/app122312055

AMA Style

Zhang X, Fan J, Hei M. Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning. Applied Sciences. 2022; 12(23):12055. https://0-doi-org.brum.beds.ac.uk/10.3390/app122312055

Chicago/Turabian Style

Zhang, Xin, Jing Fan, and Mengzhe Hei. 2022. "Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning" Applied Sciences 12, no. 23: 12055. https://0-doi-org.brum.beds.ac.uk/10.3390/app122312055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Pre-Trained Language Models

2.2. Neural Model Compression

2.3. Class Separability

3. Full BERT Does Not Always Perform the Best

3.1. A Brief Introduction to BERT

3.2. Truncating BERT for Binary Text Classification

4. Truncating BERT before Fine-Tuning

4.1. Feasibility Analysis

4.2. Using Class Separability as a Proxy

4.3. The Compression Method

5. Experiments

5.1. Datasets

5.2. Implementation Details and Comparative Methods

5.3. Results and Analysis

5.3.1. The Balance between Accuracy Decline and Model Size Reduction

5.3.2. Performance of the Compressed Models

5.4. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI