{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transfer Learning example: Legal Text Classification"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from transformers import pipeline\n",
"from datasets import load_dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get the dataset, which is a legal text benchmark containing several tasks at [nguha/legalbench](https://huggingface.co/datasets/nguha/legalbench)\n",
"\n",
"For the moment, we just focus on a binary classification task: given a legal text sentence, classify whether it is a (legal) definition or not."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/victorgallego/miniforge3/lib/python3.9/site-packages/datasets/load.py:1461: FutureWarning: The repository for nguha/legalbench contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/nguha/legalbench\n",
"You can avoid this message in future by passing the argument `trust_remote_code=True`.\n",
"Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.\n",
" warnings.warn(\n"
]
}
],
"source": [
"dataset = load_dataset(\"nguha/legalbench\", \"definition_classification\")['test'].to_pandas()\n",
"dataset = dataset[dataset.answer != \"other\"] # filter other class to make it binary"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
answer
\n",
"
index
\n",
"
text
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Yes
\n",
"
0
\n",
"
The respondents do not dispute that the trustee meets the usual definition of the word “assignee” in both ordinary and legal usage. See Webster's Third New International Dictionary 132 (1986) (defining an “assignee” as “one to whom a right or property is legally transferred”); Black's Law Dictionary 118-119 (6th ed. 1990) (defining an “assignee” as “[a] person to whom an assignment is made” and an “assignment” as “[t]he act of transferring to another all or part of one's property, interest, or rights”); cf. 26 CFR § 301.6036-1(a)(3) (1991) (defining an “assignee for the benefit of ... creditors” as any person who takes possession of and liquidates property of a debtor for distribution to creditors).
\n",
"
\n",
"
\n",
"
1
\n",
"
Yes
\n",
"
1
\n",
"
The term 'exemption' is ordinarily used to denote relief from a duty or service.\"
\n",
"
\n",
"
\n",
"
2
\n",
"
Yes
\n",
"
2
\n",
"
A prisoner's voluntary decision to deliver property for transfer to another facility, for example, bears a greater similarity to a “bailment”—the delivery of personal property after being held by the prison in trust, see American Heritage Dictionary, supra, at 134—than to a “detention.”
\n",
"
\n",
"
\n",
"
3
\n",
"
Yes
\n",
"
3
\n",
"
Publishing by outcry, in the market-place and streets of towns, as suggested by Chitty, has, we apprehend, fallen into disuse in England. It is certainly unknown in this country. While it is said the proclamation always appears in the gazette, he does not say that it cannot become operative until promulgated in that way.
\n",
"
\n",
"
\n",
"
4
\n",
"
Yes
\n",
"
4
\n",
"
In Bouvier's Law Dictionary, (1 Bouv. Law Dict. p. 581,) ‘hearing’ is thus defined: ‘The examination of a prisoner charged with a crime or misdemeanor, and of the witnesses for the accuser.’ In 9 Amer. & Eng. Enc. Law, p. 324, it is said to be ‘the preliminary examination of a prisoner charged with a crime, and of witnesses for the prosecution and defense.’ See, also, Whart. Crim. Pl. & Pr. § 70.
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" answer index \\\n",
"0 Yes 0 \n",
"1 Yes 1 \n",
"2 Yes 2 \n",
"3 Yes 3 \n",
"4 Yes 4 \n",
"\n",
" text \n",
"0 The respondents do not dispute that the trustee meets the usual definition of the word “assignee” in both ordinary and legal usage. See Webster's Third New International Dictionary 132 (1986) (defining an “assignee” as “one to whom a right or property is legally transferred”); Black's Law Dictionary 118-119 (6th ed. 1990) (defining an “assignee” as “[a] person to whom an assignment is made” and an “assignment” as “[t]he act of transferring to another all or part of one's property, interest, or rights”); cf. 26 CFR § 301.6036-1(a)(3) (1991) (defining an “assignee for the benefit of ... creditors” as any person who takes possession of and liquidates property of a debtor for distribution to creditors). \n",
"1 The term 'exemption' is ordinarily used to denote relief from a duty or service.\" \n",
"2 A prisoner's voluntary decision to deliver property for transfer to another facility, for example, bears a greater similarity to a “bailment”—the delivery of personal property after being held by the prison in trust, see American Heritage Dictionary, supra, at 134—than to a “detention.” \n",
"3 Publishing by outcry, in the market-place and streets of towns, as suggested by Chitty, has, we apprehend, fallen into disuse in England. It is certainly unknown in this country. While it is said the proclamation always appears in the gazette, he does not say that it cannot become operative until promulgated in that way. \n",
"4 In Bouvier's Law Dictionary, (1 Bouv. Law Dict. p. 581,) ‘hearing’ is thus defined: ‘The examination of a prisoner charged with a crime or misdemeanor, and of the witnesses for the accuser.’ In 9 Amer. & Eng. Enc. Law, p. 324, it is said to be ‘the preliminary examination of a prisoner charged with a crime, and of witnesses for the prosecution and defense.’ See, also, Whart. Crim. Pl. & Pr. § 70. "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# full column width\n",
"pd.set_option('display.max_colwidth', None)\n",
"\n",
"dataset.head()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
answer
\n",
"
index
\n",
"
text
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1332
\n",
"
No
\n",
"
1332
\n",
"
Unless the Court is ashamed of its new rule, it is inexplicable that the Court seeks to limit its damage by hoping that defense counsel will be derelict in their duty to insist that the prosecution prove its case.
\n",
"
\n",
"
\n",
"
1333
\n",
"
No
\n",
"
1333
\n",
"
Hearings on H. R. 7902 before the House Committee on Indian Affairs, 73d Cong., 2d Sess., 17 (1934); see also D. Otis, The Dawes Act and the Allotment of Indian Lands 124-155 (Prucha ed. 1973) (discussing results of the allotments by 1900).
\n",
"
\n",
"
\n",
"
1334
\n",
"
No
\n",
"
1334
\n",
"
No Member of the Court suggested that the absence of a pending criminal proceeding made the Self-Incrimination Clause inquiry irrelevant.
\n",
"
\n",
"
\n",
"
1335
\n",
"
No
\n",
"
1335
\n",
"
Due process requires notice “reasonably calculated, under all the circumstances, to apprise interested parties of the pendency of the action and afford them an opportunity to present their objections.”
\n",
"
\n",
"
\n",
"
1336
\n",
"
No
\n",
"
1336
\n",
"
In 2004, the most recent year for which data are available, drug possession and trafficking resulted in 362,850 felony convictions in state courts across the country.
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" answer index \\\n",
"1332 No 1332 \n",
"1333 No 1333 \n",
"1334 No 1334 \n",
"1335 No 1335 \n",
"1336 No 1336 \n",
"\n",
" text \n",
"1332 Unless the Court is ashamed of its new rule, it is inexplicable that the Court seeks to limit its damage by hoping that defense counsel will be derelict in their duty to insist that the prosecution prove its case. \n",
"1333 Hearings on H. R. 7902 before the House Committee on Indian Affairs, 73d Cong., 2d Sess., 17 (1934); see also D. Otis, The Dawes Act and the Allotment of Indian Lands 124-155 (Prucha ed. 1973) (discussing results of the allotments by 1900). \n",
"1334 No Member of the Court suggested that the absence of a pending criminal proceeding made the Self-Incrimination Clause inquiry irrelevant. \n",
"1335 Due process requires notice “reasonably calculated, under all the circumstances, to apprise interested parties of the pendency of the action and afford them an opportunity to present their objections.” \n",
"1336 In 2004, the most recent year for which data are available, drug possession and trafficking resulted in 362,850 felony convictions in state courts across the country. "
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.tail()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD+CAYAAADBCEVaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAAsTAAALEwEAmpwYAAAPlElEQVR4nO3df6zddX3H8edrVNDgj/LjrunazpLZaIiJ0N0gxmVxNC6Am2WLEtgcDWlytwSnRrPZ/c42l2A2ZWIMSSNqMSoyHKFzxI1UzbYsMC/KUETDlcnaBugVoU6JP9D3/jif6qHe9p7bnnsP/fB8JCfn831/P9/7fZ9QXv320+/3NFWFJKkvPzPpBiRJ42e4S1KHDHdJ6pDhLkkdMtwlqUOrJt0AwJlnnlkbN26cdBuSdEK56667vlFVUwvte1qE+8aNG5mdnZ10G5J0Qkny4JH2uSwjSR0y3CWpQ4a7JHVo0XBP8uIkdw+9vpXkLUlOT3J7kvvb+2ltfpJcm2QuyT1JNi//x5AkDVs03Kvqq1V1TlWdA/wi8ARwC7AD2FNVm4A9bRvgImBTe80A1y1D35Kko1jqsswW4GtV9SCwFdjV6ruAS9p4K3BDDdwBrE6ydhzNSpJGs9Rwvwz4WBuvqaqH2vhhYE0brwP2Dh2zr9WeIslMktkks/Pz80tsQ5J0NCOHe5KTgdcC/3D4vhp8b/CSvju4qnZW1XRVTU9NLXgPviTpGC3lyv0i4PNV9UjbfuTQckt7P9Dq+4ENQ8etbzVJ0gpZyhOql/OTJRmA3cA24Or2futQ/Y1JbgReDhwcWr45oW3c8c+TbqErX7/6NZNuQerWSOGe5FTg1cDvDpWvBm5Ksh14ELi01W8DLgbmGNxZc+XYupUkjWSkcK+q7wBnHFZ7lMHdM4fPLeCqsXQnSTomPqEqSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUNL+eIwSU9TfqndePXwpXZeuUtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nq0EjhnmR1kpuTfCXJfUlekeT0JLcnub+9n9bmJsm1SeaS3JNk8/J+BEnS4Ua9cn8P8KmqegnwMuA+YAewp6o2AXvaNsBFwKb2mgGuG2vHkqRFLRruSV4A/DJwPUBVfb+qHge2ArvatF3AJW28FbihBu4AVidZO+a+JUlHMcqV+1nAPPDBJF9I8v4kpwJrquqhNudhYE0brwP2Dh2/r9WeIslMktkks/Pz88f+CSRJP2WUcF8FbAauq6pzge/wkyUYAKqqgFrKiatqZ1VNV9X01NTUUg6VJC1ilHDfB+yrqjvb9s0Mwv6RQ8st7f1A278f2DB0/PpWkyStkEXDvaoeBvYmeXErbQG+DOwGtrXaNuDWNt4NXNHumjkfODi0fCNJWgGj/ktMvw98JMnJwAPAlQx+Y7gpyXbgQeDSNvc24GJgDniizZUkraCRwr2q7gamF9i1ZYG5BVx1fG1Jko6HT6hKUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdWikcE/y9SRfTHJ3ktlWOz3J7Unub++ntXqSXJtkLsk9STYv5weQJP20pVy5/0pVnVNV0217B7CnqjYBe9o2wEXApvaaAa4bV7OSpNEcz7LMVmBXG+8CLhmq31ADdwCrk6w9jvNIkpZo1HAv4F+T3JVkptXWVNVDbfwwsKaN1wF7h47d12pPkWQmyWyS2fn5+WNoXZJ0JKtGnPdLVbU/yc8Ctyf5yvDOqqoktZQTV9VOYCfA9PT0ko6VJB3dSFfuVbW/vR8AbgHOAx45tNzS3g+06fuBDUOHr281SdIKWTTck5ya5HmHxsCvAl8CdgPb2rRtwK1tvBu4ot01cz5wcGj5RpK0AkZZllkD3JLk0PyPVtWnknwOuCnJduBB4NI2/zbgYmAOeAK4cuxdS5KOatFwr6oHgJctUH8U2LJAvYCrxtKdJOmY+ISqJHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUodGDvckJyX5QpJPtu2zktyZZC7Jx5Oc3OqntO25tn/jMvUuSTqCpVy5vxm4b2j7ncA1VfUi4DFge6tvBx5r9WvaPEnSChop3JOsB14DvL9tB7gAuLlN2QVc0sZb2zZt/5Y2X5K0Qka9cv974A+BH7XtM4DHq+rJtr0PWNfG64C9AG3/wTb/KZLMJJlNMjs/P39s3UuSFrRouCf5NeBAVd01zhNX1c6qmq6q6ampqXH+aEl6xls1wpxXAq9NcjHwbOD5wHuA1UlWtavz9cD+Nn8/sAHYl2QV8ALg0bF3Lkk6okWv3Kvqj6pqfVVtBC4DPl1Vvw18Bnhdm7YNuLWNd7dt2v5PV1WNtWtJ0lEdz33ubwfemmSOwZr69a1+PXBGq78V2HF8LUqSlmqUZZkfq6rPAp9t4weA8xaY813g9WPoTZJ0jHxCVZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDi4Z7kmcn+a8k/53k3iR/2epnJbkzyVySjyc5udVPadtzbf/GZf4MkqTDjHLl/j3ggqp6GXAOcGGS84F3AtdU1YuAx4Dtbf524LFWv6bNkyStoEXDvQa+3Taf1V4FXADc3Oq7gEvaeGvbpu3fkiTjaliStLiR1tyTnJTkbuAAcDvwNeDxqnqyTdkHrGvjdcBegLb/IHDGAj9zJslsktn5+fnj+hCSpKcaKdyr6odVdQ6wHjgPeMnxnriqdlbVdFVNT01NHe+PkyQNWdLdMlX1OPAZ4BXA6iSr2q71wP423g9sAGj7XwA8Oo5mJUmjGeVumakkq9v4OcCrgfsYhPzr2rRtwK1tvLtt0/Z/uqpqjD1LkhaxavEprAV2JTmJwW8GN1XVJ5N8GbgxyTuALwDXt/nXAx9OMgd8E7hsGfqWJB3FouFeVfcA5y5Qf4DB+vvh9e8Crx9Ld5KkY+ITqpLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdWjTck2xI8pkkX05yb5I3t/rpSW5Pcn97P63Vk+TaJHNJ7kmyebk/hCTpqUa5cn8SeFtVnQ2cD1yV5GxgB7CnqjYBe9o2wEXApvaaAa4be9eSpKNaNNyr6qGq+nwb/x9wH7AO2ArsatN2AZe08Vbghhq4A1idZO24G5ckHdmS1tyTbATOBe4E1lTVQ23Xw8CaNl4H7B06bF+rHf6zZpLMJpmdn59fat+SpKMYOdyTPBf4BPCWqvrW8L6qKqCWcuKq2llV01U1PTU1tZRDJUmLGCnckzyLQbB/pKr+sZUfObTc0t4PtPp+YMPQ4etbTZK0Qka5WybA9cB9VfXuoV27gW1tvA24dah+Rbtr5nzg4NDyjSRpBawaYc4rgd8Bvpjk7lb7Y+Bq4KYk24EHgUvbvtuAi4E54AngynE2LEla3KLhXlX/AeQIu7csML+Aq46zL0nScfAJVUnqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOLRruST6Q5ECSLw3VTk9ye5L72/tprZ4k1yaZS3JPks3L2bwkaWGjXLl/CLjwsNoOYE9VbQL2tG2Ai4BN7TUDXDeeNiVJS7FouFfVvwHfPKy8FdjVxruAS4bqN9TAHcDqJGvH1KskaUTHuua+pqoeauOHgTVtvA7YOzRvX6v9lCQzSWaTzM7Pzx9jG5KkhRz3X6hWVQF1DMftrKrpqpqempo63jYkSUOONdwfObTc0t4PtPp+YMPQvPWtJklaQcca7ruBbW28Dbh1qH5Fu2vmfODg0PKNJGmFrFpsQpKPAa8CzkyyD/gL4GrgpiTbgQeBS9v024CLgTngCeDKZehZkrSIRcO9qi4/wq4tC8wt4KrjbUqSdHx8QlWSOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ8sS7kkuTPLVJHNJdizHOSRJRzb2cE9yEvA+4CLgbODyJGeP+zySpCNbjiv384C5qnqgqr4P3AhsXYbzSJKOYNUy/Mx1wN6h7X3Ayw+flGQGmGmb307y1WXo5ZnqTOAbk25iMXnnpDvQBPhrc7xeeKQdyxHuI6mqncDOSZ2/Z0lmq2p60n1Ih/PX5spZjmWZ/cCGoe31rSZJWiHLEe6fAzYlOSvJycBlwO5lOI8k6QjGvixTVU8meSPwL8BJwAeq6t5xn0dH5XKXnq78tblCUlWT7kGSNGY+oSpJHTLcJalDhrskdchw70CSX0hyShu/KsmbkqyecFuSJshw78MngB8meRGDuxE2AB+dbEvSQJL1SW5JMp/kQJJPJFk/6b56Z7j34UdV9STwG8B7q+oPgLUT7kk65IMMnnVZC/wc8E+tpmVkuPfhB0kuB7YBn2y1Z02wH2nYVFV9sKqebK8PAVOTbqp3hnsfrgReAfxNVf1PkrOAD0+4J+mQR5O8IclJ7fUG4NFJN9U7H2LqRJLnAD9fVX67pp5WkrwQeC+DC5AC/hN4U1X970Qb65zh3oEkvw78HXByVZ2V5Bzgr6rqtZPtTNKkGO4dSHIXcAHw2ao6t9W+VFUvnWxneiZL8udH2V1V9dcr1swz0MS+z11j9YOqOphkuPajSTUjNd9ZoHYqsB04AzDcl5HhfgJLchtwFXBvkt8CTkqyCXgTg3VNaWKq6l2HxkmeB7yZwV/+3wi860jHaTy8W+bE9kEGX638deClwPcYPLx0kMH/SNJEJTk9yTuAexhcTG6uqrdX1YEJt9Y919xPcEmeC/wZcCGD2x8P/Qetqnr3xBrTM16SvwV+k8FT0++rqm9PuKVnFJdlTnzfZ7C2eQrwXH4S7tKkvY3Bnyb/FPiTob8TCoOLj+dPqrFnAsP9BJbkQuDdDB7t3lxVT0y4JenHqspl3wlyWeYEluTfgd/znzGUdDjDXZI65B+bJKlDhrskdchwl6QOGe6S1CHDXZI69P+rSO2nyoPJaQAAAABJRU5ErkJggg==",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"dataset.answer.value_counts().plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is to load the `feature-extraction` pipeline from the `transformers` library, which is a pre-trained model that can be used to extract features from the text. We will use the `bert-base-uncased` model, which is the original BERT model."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8bbd91df184b45c1ba654e0b7e9e754a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"config.json: 0%| | 0.00/483 [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8f787413cf5d40c2b5cbf00bd06a7fe5",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model.safetensors: 0%| | 0.00/268M [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "be325e8f0dae4200bed7587554a38920",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/28.0 [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3e98f15bab2a4fd4abdd8580fd3d958d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"vocab.txt: 0%| | 0.00/232k [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a3f4d1d04a8242d18770463b37ce58a0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.json: 0%| | 0.00/466k [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from transformers import pipeline\n",
"\n",
"feature_extractor = pipeline('feature-extraction', model=\"distilbert-base-uncased\")"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# apply the feature extractor to the dataset. Truncation is used to avoid memory issues in cases the example is too long.\n",
"features = feature_extractor(list(dataset.text), truncation=True)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# useful check\n",
"len(features) == len(dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question** For each example, what are the dimensions of the extracted features?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# we can average along the token dimension to make every vector have the same dimensions\n",
"# but other aggregations could be possible (like just getting the first position).\n",
"\n",
"features_averaged_per_sentence = [np.mean(sentence_embedding, axis=1) for sentence_embedding in features]"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1337, 768)"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# convert to numpy array\n",
"features_averaged_per_sentence = np.array(features_averaged_per_sentence).squeeze()\n",
"\n",
"features_averaged_per_sentence.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As usual, now we can apply scikit-learn to split the data and train a specialized classifier on the extracted features."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(features_averaged_per_sentence, dataset.answer, test_size=0.2, random_state=1, stratify=dataset.answer)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" No 0.96 0.97 0.97 129\n",
" Yes 0.97 0.96 0.97 139\n",
"\n",
" accuracy 0.97 268\n",
" macro avg 0.97 0.97 0.97 268\n",
"weighted avg 0.97 0.97 0.97 268\n",
"\n"
]
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"model = LogisticRegression(max_iter=1000)\n",
"\n",
"model.fit(X_train, y_train)\n",
"\n",
"from sklearn.metrics import classification_report\n",
"\n",
"y_pred = model.predict(X_test)\n",
"\n",
"print(classification_report(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise 1**: What would happen if you use a different pre-trained model, such as:\n",
"\n",
" * `bert-large-uncased`\n",
"\n",
" * `distilbert-base-uncased`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise 2** Compute the classification metrics using a bag-of-words approach (i.e., not based on deep learning)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}