1. Introduction
With the emergence of deep learning algorithms (neural networks), significant works have been done related to the legal domain and predictive justice. While classification tasks are important topics [
1], some applications have been developed around, such as information extraction [
2], legal norms classification [
3] or topic classification [
4]. Predicting the outcome of a case from factual data is a major topic in this literature, as it can offer great benefits to practitioners and can also be useful for citizens. If a lawyer or a judge can estimate the likelihood of an outcome, it can translate into greater efficiency, better access to justice, and greater fairness assuming this tool plays a supporting role.
The task of predicting the outcome has received much attention recently. In specific situations, high accuracies have been achieved [
5] mainly using deep neural networks [
6,
7] which are able to significantly outperform standard and widely used algorithms like logistic regression, tree based models or SVM [
8]. Some attention has recently been paid to the European Court of Human Rights (ECtHR) for a number of reasons. One of them is the accessibility of court decisions, another one has to do with the ability to create an automatic annotation process of these judgments. Solving tasks related to the legal domain requires experts to manually put labels on documents, it is often expensive and time consuming, leading to scarce data. ECtHR decisions possess a specific structure which can be exploited using simple regex, making them a good candidat for automatic legal analysis and forecasting. As the literature focuses on english decisions, some performances comparisons have already been made around standard classification algorithms [
9,
10] and deep learning approaches using state-of-art models [
11,
12] which require high computation cost compared to linear classifiers.
In the paper, we compare several algorithms on french ECtHR decisions on different binary tasks where the target is the outcome (violation of a given article). Another task is built as a multiclass problem and aims to find which article has potentially been violated. We do not voluntarily perform meaningful processing outside of tokenization and extraction of facts and circumstances in order to deal with sparse inputs. Sparse inputs are addressed using a modified and efficient PLS algorithm to improve overall performances, stabilize results and ease convergence. Finally we take advantage of a pretrained word embedding and the transformer architecture [
13] to build a neural network approach.
2. Data
Our main task is to predict the violation of article of the convention given a set of facts as inputs. We use the published ECtHR judgments (see (
https://hudoc.echr.coe.int/)) in french as our main data source. The choice of the language has two reasons: one can pretrain a large model on french legal documents and state of the art english embeddings are not able to process long sequences without huge memory consumption.
2.1. Structure
In theory, judgments are structured with titles and paragraphs for ease of readability since these decisions typically contain hundreds of sentences. A document can be divided into four parts. The first is called the procedure and provides general information about the procedure followed before the Court. This section is the smallest most of the time and does only lists past results from a local court. Note that in the ECtHR, the parties are individuals against a state.
The second section is the facts section which is the main input of our model. It provides some background about the case itself and everything unrelated to legal arguments (i.e., articles from the European Convention on Human Rights ECHR). This part is generally divided into two subsections: the circumstances of the case and the relevant laws. The first relates the factual background which is a crucial element for a legal document. This section is formulated by the Court itself but we consider that it provides a reasonable representation of facts. This section is also the most heterogeneous because its size and vocabulary may change a lot even for two similar cases. The part on relevant laws adds information on domestic laws and legal elements, with the exception of articles of the ECHR.
The third section is the law section and does rely on legal arguments to consider the merit of the case. To pronounce an outcome, the Court must justify its decision using rules and principles selected taking into account an alleged violation of an ECHR article and arguments provided by the parties.
Finally the last section is the results of the case. It enumerates all potential violations of ECHR articles and whether they actually took place. The overall structure is presented
Figure 1.
The majority of documents follow this structure, in both french and english versions. Note that the available decisions are not always translated, resulting in differences between the datasets coming from the two languages. Related papers [
9,
10] which focus on english already have different inputs because the extraction process is tricky and slightly differs between the two studies, new decisions are also periodically added to the database which may explain differences in size.
2.2. Extraction
Extraction is simple in theory because we can easily extract headings and paragraphs. In practice we observe a certain variability between documents which makes automatic extraction more challenging. Some titles may change or some section may move or disappear for no specific reason. For example, we sometimes find the procedure inside the fact section, or the fact section does exist but we cannot extract its subsections which should feed our models. The second challenge is removing ambiguous outcomes. As our main task is to find out whether a given article of the convention has been violated or not, we need to filter the decisions which have different outcomes for the same article. Everything is done using plain regex to first detect headings and paragraphs, check their validity, extract all the outcomes and compare them and their associated articles. The overall extraction procedure follows seven steps:
We scrape all french decisions available.
We check using multiple regex if the four sections titles (and some variations) can be found inside the decision to make sure everything is well located and available.
We check if both facts subsections are well defined using regex also, namely circumstances and relevant domestic laws.
We process the outcome section (operative provision) of the case by extracting sentences containing words “violation” and “article”.
We process all these sentences by listing all named articles that belong to the European convention using a regex.
For each retrieved article, we infer the outcome (violation or non-violation). This is simple given that these sentences always have the same structure (e.g., “Holds that there has been a violation of Article 6 of the Convention.” in english).
We remove all decisions where a given article has been violated and non violated at the same time. This happens if there are multiple claims based on the same article in a given decision.
After extraction, we only keep decisions related to six articles which are also the most frequent ones to ensure a sufficient number of observations: article 3 (torture or inhuman and degrading treatment), article 5 (right to liberty and security), article 6 (right to a fair trial), article 8 (respect for private and family life), article 10 (freedom of expression) and article 13 (access to justice).
2.3. Datasets
For each selected article, we build a dataset consisting of valid decisions and their associated binary label (violation or non-violation of the article). The extraction process yields
Table 1.
As shown, the outcome is highly unbalanced and favors most of the time the claimant against the respondent state. It is however influenced by our extraction process as we have removed ambiguous cases where we can find violation and non violation of a given article at the same time. For each selected set of judgments, we extract the circumstances of the case and the entire fact section as we only want to predict a fact-based outcome. The stack dataset is used to test a multiclass problem by trying to find, from a set of facts, whether a given item is relevant to the case.
To more accurately measure model performances and follow a similar approach as described in the related papers [
9,
10], we balance the datasets to ensure the same number of violations and non-violations. Note that circumstances and facts can last over a hundred sentences in few situations. They are also closely related: the facts are actually circumstances plus domestic legal arguments which are very heterogeneous as the applicants come from different countries.
We also ensure that the stack dataset is a single label problem by removing cases where two or more articles are relevant, however is it not balanced. Each article representing 13%, 10%, 35%, 22%, 11% and 9% of the stack respectively.
In order to solve these tasks, we compare various models and a transformer based neural network architecture. We also test a modified and optimized partial least squares algorithm (PLS) suitable for classification tasks and sparse input as we rely on it to skip any preprocessing outside of tokenization.
4. Experiments and Results
We first compare different linear models using different inputs to show how sparsity can affect performances. We focus on two main tasks, the first relative to the violation of an article given a set of facts or circumstances which is a binary classification task. The second aims to find which of the convention articles is appropriate given factual informations, in this case it is a multiclass problem.
All these tasks are challenging, even more for the one with large base vocabulary size and small sample size. For example, the dataset related to article 5 only contains 122 documents for a vocabulary greater than 40,000 words for the circumstances and 50,000 for the facts which show how heterogeneous can the input be. Most of the time, there is no obvious relation between several documents from the same source. Few of the datasets are more manageable, the one related to article 13 has a smaller vocabulary size and rare words are less common which may improve overall performances.
For the first set of tasks, we use a simple preprocessing. We first tokenize at the word level by splitting the sequences on white spaces and punctuation then we build a binary count matrix and a TF-IDF representation. Since sentences are very long, we limit the vocabulary size to the 25,000 most frequent words. While using smaller sizes like 15 or 20 thousands words gives close performances, sparsity is less pronounced for simple classifiers. Choosing a size of 25,000 offered the best compromise between performance and sparsity to fully show how the PLS model handles this effect.
For the multiclass problem, we extend the limit to 50,000 words to further increase sparsity. Note that we are only relying on unigrams and we do not add any additional preprocessing step compared to the related literature [
9,
10], punctuation and stop words are not removed in this experiment. We also suppose that the modified PLS algorithm is able to discriminate and keep most informative components without the need of vocabulary selection, hand-crafted features or heuristics.
We compare three simple linear models: a logistic regression, a ridge regression where the regularization parameter as been optimized and a linear SVM. The choice of linear models is similar to related works which are relying on linear SVM. Since the sample size is very limited most of the time, these models are less prone to overfitting compared to tree-based approaches or kernel-based approaches which are expensive to compute for a large feature space.
We intentionally limit the number of PLS components to in order to show the efficiency of the model given thousand of features. All experiments are compared using accuracy and a 10-cross validation approach.
For information, we copy the results from the related literature [
9] obtained on english documents and similar tasks using a linear SVM (
Table 2).
Beside a significant difference on article 8 which is difficult to explain for the circumstances, the results presented in the following section on french documents are similar to what has been achieved in the literature for linear models at least. Note that english is known to be grammatically simpler, which may improve overall performances.
4.1. Binary Input
As expected at least for the binary cases, running the models without a PLS reduction bring slightly better results overall but with a higher computational cost. Computing components then estimating a model is significantly faster than running the model on raw inputs for the logistic regression and the SVM. Note that we only use 8 components and we are already achieving very close performances. For the multiclass problem, standard algorithms already fail to generalize to new inputs due to sparsity (50 k features). Regularization does not fix the problem as ridge regression does not bring strong performances.
Running models on full facts does significantly increase sentences length which provide additional information but also adds some noise due to domestic legal arguments. In most situations, performances increase slightly except for article 5 where standard models struggle to keep their previous accuracy.
For the multiclass problem, we do not observe any significant difference in accuracy using different inputs. As shown in
Table 5, some articles are easier to predict such as article 10 which refers to the freedom of expression and article 3 relating to torture or inhuman treatment. The reason is linked to the structure of their respective vocabulary, with predominant themes such as violence or the media, making it easier to detect them. On the other hand, articles 5 related to the right to liberty are generally more vague with a much broader vocabulary. Finally, one can observe that as in the binary case, classes with a lot of rare words tend to be more difficult to recognize.
4.2. TF-IDF Input
We now run the same experiment but replace the binary input with a TD-IDF representation. The input size is the same as before but does provide a different information as it relies on frequencies and inverse frequencies. The results for binary inputs are reported in
Table 6 and
Table 7.
The additional variance due to the nature of the input significantly affects the overall performances without PLS reduction. Ridge regression has slightly higher accuracy due to the regularization parameter. We observe that the PLS algorithm is still consistent and even show better scores because the TF-IDF representation is generally more informative than a simple binary matrix. Accuracy on the multiclass problem is even worse and is very close to the frequency of the larger class of this dataset which is related to article 6.
We observe the same problem as before. PLS reduction does bring efficiency and consistency. Computing components and adding a linear classifier on top provides very similar performances regardless of the model, this is a consequence of a low number of components. Note that increasing h did not provide significant gains on average. The choice should be based on the dataset, but we choose to keep h the same to make fair comparisons.
On the multiclass task (
Table 8), the same behavior as in
Table 5 is generally observed with slightly better performances with facts as inputs, this was not the case before.
4.3. Neural Approach
The neural network approach is different as it tales advantage of the temporal aspect thanks to the attention mechanism. The choice of attention is indeed natural considering the current state of the NLP literature which does rely entirely on transformer-based models. Training a RNN as an alternative is not appropriate here because convergence is extremely slow when sequences are very long. Another approach would have been to only use feedforward layers, but it requires using some averaging to remove the sequential aspect. Using a mean embedding over thousand of tokens provides a very noisy representation as all words are equally weighted in this case.
We estimate our models by mixing two different inputs, one comes from a pretrained FastText (128 dimensions) which has been trained on 10Gb of french legal documents (2 billions tokens). The second is either a binary input or a TF-IDF input. We use the attention mechanism on the embedding inputs while we simply rely on feedforward layers for the other one. We stack everything up before doing our prediction. Using this combination generally improves speed of convergence as one part of the network does focus on word discrimination and acts like a simple linear model while the second part is oriented toward disambiguation and the handling of rare words. We rely on cross entropy as the loss function, we stack 3 layers of transformers and use Adam [
25] as the optimizer with a learning rate of
and a weigth decay of
.
We also train similar models with a dimensional reduction provided by the PLS algorithm. All results are reported in
Table 9 and
Table 10.
The use of neural networks greatly improves performance, but the computational cost is also much higher. Applying a reduction does hurt performances with a loss of points in accuracy in average but the model is also easier and much faster to train which can be a good trade off in some situations.
We observe a similar behavior on facts. In either case, the binary or TF-IDF have little effect on performances due to the embedding which may already incorporate similar information.
In the multiclass task, one can observe huge performance gains over simple linear models with more than 10 points difference on average for each individual accuracy (
Table 11). The most difficult classes to predict remain the same compared to those observed without the use of a neural network and the attention mechanism.
4.4. General Observations
Overall, article 13 is the easiest to predict while article 8 is the most difficult. For attention-based approaches, article 5 seems easier to predict compared to simple linear models as we can observe a significant performance gain from 10 to 15 points. If we pay attention to the overall vocabulary size, datasets with many rare words tend to be harder to predict for simpler models as they cannot properly handle words that appear only once. On the other hand, word embeddings are able to infer these words because they are trained on a large corpus before and may have seen them during pre-training. This advantage is also visible for the multiclass task where the overall vocabulary is very large (more than 100 K words) with a majority of rare words.
The natural ability of neural networks to disambiguate expressions also helps a lot when simplest models cannot properly handle sequences. However, while attention-based approaches are in theory able to provide some explainability, the exploitation of attention weights makes few sense for very long sequences (thousands of words). Most of them are actually very small due to the fact that they are positive and add up to one, they also provide very noisy outputs which cannot be properly exploited by practitioners.