Transfer Learning example: Legal Text Classification

Transfer Learning example: Legal Text Classification#

import pandas as pd
import numpy as np
from transformers import pipeline
from datasets import load_dataset

Let’s get the dataset, which is a legal text benchmark containing several tasks at nguha/legalbench

For the moment, we just focus on a binary classification task: given a legal text sentence, classify whether it is a (legal) definition or not.

dataset = load_dataset("nguha/legalbench", "definition_classification")['test'].to_pandas()
dataset = dataset[dataset.answer != "other"] # filter other class to make it binary

/Users/victorgallego/miniforge3/lib/python3.9/site-packages/datasets/load.py:1461: FutureWarning: The repository for nguha/legalbench contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/nguha/legalbench
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(

# full column width
pd.set_option('display.max_colwidth', None)

dataset.head()

	answer	index	text
0	Yes	0	The respondents do not dispute that the trustee meets the usual definition of the word “assignee” in both ordinary and legal usage. See Webster's Third New International Dictionary 132 (1986) (defining an “assignee” as “one to whom a right or property is legally transferred”); Black's Law Dictionary 118-119 (6th ed. 1990) (defining an “assignee” as “[a] person to whom an assignment is made” and an “assignment” as “[t]he act of transferring to another all or part of one's property, interest, or rights”); cf. 26 CFR § 301.6036-1(a)(3) (1991) (defining an “assignee for the benefit of ... creditors” as any person who takes possession of and liquidates property of a debtor for distribution to creditors).
1	Yes	1	The term 'exemption' is ordinarily used to denote relief from a duty or service."
2	Yes	2	A prisoner's voluntary decision to deliver property for transfer to another facility, for example, bears a greater similarity to a “bailment”—the delivery of personal property after being held by the prison in trust, see American Heritage Dictionary, supra, at 134—than to a “detention.”
3	Yes	3	Publishing by outcry, in the market-place and streets of towns, as suggested by Chitty, has, we apprehend, fallen into disuse in England. It is certainly unknown in this country. While it is said the proclamation always appears in the gazette, he does not say that it cannot become operative until promulgated in that way.
4	Yes	4	In Bouvier's Law Dictionary, (1 Bouv. Law Dict. p. 581,) ‘hearing’ is thus defined: ‘The examination of a prisoner charged with a crime or misdemeanor, and of the witnesses for the accuser.’ In 9 Amer. & Eng. Enc. Law, p. 324, it is said to be ‘the preliminary examination of a prisoner charged with a crime, and of witnesses for the prosecution and defense.’ See, also, Whart. Crim. Pl. & Pr. § 70.

dataset.tail()

	answer	index	text
1332	No	1332	Unless the Court is ashamed of its new rule, it is inexplicable that the Court seeks to limit its damage by hoping that defense counsel will be derelict in their duty to insist that the prosecution prove its case.
1333	No	1333	Hearings on H. R. 7902 before the House Committee on Indian Affairs, 73d Cong., 2d Sess., 17 (1934); see also D. Otis, The Dawes Act and the Allotment of Indian Lands 124-155 (Prucha ed. 1973) (discussing results of the allotments by 1900).
1334	No	1334	No Member of the Court suggested that the absence of a pending criminal proceeding made the Self-Incrimination Clause inquiry irrelevant.
1335	No	1335	Due process requires notice “reasonably calculated, under all the circumstances, to apprise interested parties of the pendency of the action and afford them an opportunity to present their objections.”
1336	No	1336	In 2004, the most recent year for which data are available, drug possession and trafficking resulted in 362,850 felony convictions in state courts across the country.

dataset.answer.value_counts().plot(kind='bar')

<AxesSubplot: >

../_images/bf4444ac4039cc8598072e4ee3272941745ae925d3c2dd032cb6f6079081ed4d.png

The next step is to load the feature-extraction pipeline from the transformers library, which is a pre-trained model that can be used to extract features from the text. We will use the bert-base-uncased model, which is the original BERT model.

from transformers import pipeline

feature_extractor = pipeline('feature-extraction', model="distilbert-base-uncased")

# apply the feature extractor to the dataset. Truncation is used to avoid memory issues in cases the example is too long.
features = feature_extractor(list(dataset.text), truncation=True)

# useful check
len(features) == len(dataset)

True

Question For each example, what are the dimensions of the extracted features?

# we can average along the token dimension to make every vector have the same dimensions
# but other aggregations could be possible (like just getting the first position).

features_averaged_per_sentence = [np.mean(sentence_embedding, axis=1) for sentence_embedding in features]

# convert to numpy array
features_averaged_per_sentence = np.array(features_averaged_per_sentence).squeeze()

features_averaged_per_sentence.shape

(1337, 768)

As usual, now we can apply scikit-learn to split the data and train a specialized classifier on the extracted features.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_averaged_per_sentence, dataset.answer, test_size=0.2, random_state=1, stratify=dataset.answer)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)

model.fit(X_train, y_train)

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.96      0.97      0.97       129
         Yes       0.97      0.96      0.97       139

    accuracy                           0.97       268
   macro avg       0.97      0.97      0.97       268
weighted avg       0.97      0.97      0.97       268

Exercise 1: What would happen if you use a different pre-trained model, such as:

bert-large-uncased
distilbert-base-uncased

Exercise 2 Compute the classification metrics using a bag-of-words approach (i.e., not based on deep learning).