{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to the Transformers Library for NLP: pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's install the required libraries to make sure everything works" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install datasets evaluate transformers[sentencepiece]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transformers are everywhere!\n", "\n", "Transformer models are used to solve all kinds of NLP tasks, like the ones mentioned in the previous section. Here are some of the companies and organizations using Hugging Face and Transformer models, who also contribute back to the community by sharing their models:\n", "\n", "![tr](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/companies.PNG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [🤗 Transformers library](https://github.com/huggingface/transformers) provides the functionality to create and use those shared models. The [Model Hub](https://huggingface.co/models) contains thousands of pretrained models that anyone can download and use. You can also upload your own models to the Hub!\n", "\n", "\n", "⚠️ The Hugging Face Hub is not limited to Transformer models. Anyone can share any kind of models or datasets they want! Create a huggingface.co account to benefit from all available features!\n", "\n", "\n", "Before diving into how Transformer models work under the hood, let's look at a few examples of how they can be used to solve some interesting NLP problems.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with pipelines\n", "\n", "\n", "The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n" ] }, { "data": { "text/plain": [ "[{'label': 'NEGATIVE', 'score': 0.9015460014343262}]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "\n", "classifier = pipeline(\"sentiment-analysis\")\n", "\n", "text = \"\"\"Dear Amazon, last week I ordered an Optimus Prime action figure \\\n", "from your online store in Germany. Unfortunately, when I opened the package, \\\n", "I discovered to my horror that I had been sent an action figure of Megatron \\\n", "instead! As a lifelong enemy of the Decepticons, I hope you can understand my \\\n", "dilemma. To resolve the issue, I demand an exchange of Megatron for the \\\n", "Optimus Prime figure I ordered. Enclosed are copies of my records concerning \\\n", "this purchase. I expect to hear from you soon. Sincerely, Bumblebee.\"\"\"\n", "\n", "classifier(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can even pass several sentences!" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'label': 'NEGATIVE', 'score': 0.9994558691978455},\n", " {'label': 'NEGATIVE', 'score': 0.9992402791976929},\n", " {'label': 'POSITIVE', 'score': 0.9998764991760254}]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "texts = [\n", " \"I hate this so much!\",\n", " \"I'm not sure how I feel about this.\",\n", " \"I love this!\",\n", "]\n", "\n", "classifier(texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.\n", "\n", "There are three main steps involved when you pass some text to a pipeline:\n", "\n", "1. The text is preprocessed into a format the model can understand.\n", "2. The preprocessed inputs are passed to the model.\n", "3. The predictions of the model are post-processed, so you can make sense of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines) are:\n", "\n", "* feature-extraction (get the vector representation of a text)\n", "* fill-mask\n", "* ner (named entity recognition)\n", "* question-answering\n", "* sentiment-analysis / text-classification\n", "* summarization\n", "* text-generation\n", "* translation\n", "* zero-shot-classification\n", "\n", "Let’s have a look at a few of these!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Named Entity Recognition (NER)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n", "Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']\n", "- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" ] } ], "source": [ "from transformers import pipeline\n", "\n", "ner = pipeline(\"ner\", grouped_entities=True)\n", "prediction = ner(\"My name is Victor and I am a student at IE University in Madrid.\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
entity_groupscorewordstartend
0PER0.998917Victor1117
1ORG0.997800IE University4053
2LOC0.995458Madrid5763
\n", "
" ], "text/plain": [ " entity_group score word start end\n", "0 PER 0.998917 Victor 11 17\n", "1 ORG 0.997800 IE University 40 53\n", "2 LOC 0.995458 Madrid 57 63" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame(prediction)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the model correctly identified that Victor is a person (PER), IE University as an organization (ORG), and Madrid a location (LOC)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question Answering (QA)\n", "\n", "The question-answering pipeline answers questions using information from a given context:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scorestartendanswer
00.631292335358an exchange of Megatron
\n", "
" ], "text/plain": [ " score start end answer\n", "0 0.631292 335 358 an exchange of Megatron" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reader = pipeline(\"question-answering\")\n", "\n", "text = \"\"\"Dear Amazon, last week I ordered an Optimus Prime action figure \\\n", "from your online store in Germany. Unfortunately, when I opened the package, \\\n", "I discovered to my horror that I had been sent an action figure of Megatron \\\n", "instead! As a lifelong enemy of the Decepticons, I hope you can understand my \\\n", "dilemma. To resolve the issue, I demand an exchange of Megatron for the \\\n", "Optimus Prime figure I ordered. Enclosed are copies of my records concerning \\\n", "this purchase. I expect to hear from you soon. Sincerely, Bumblebee.\"\"\"\n", "\n", "question = \"What does the customer want?\"\n", "\n", "outputs = reader(question=question, context=text)\n", "pd.DataFrame([outputs]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this pipeline works by extracting information from the provided context; it does not generate the answer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Summarization\n", "\n", "Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "24b5da92bd7944278bb667a7a2a0ea92", "version_major": 2, "version_minor": 0 }, "text/plain": [ "config.json: 0%| | 0.00/1.80k [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
translation_text
0Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.
\n", "" ], "text/plain": [ " translation_text\n", "0 Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee." ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('display.max_colwidth', None)\n", "\n", "pd.DataFrame(outputs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Text generation\n", "\n", "Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2ff18063bf344133bda180ceb4f7a9c4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "config.json: 0%| | 0.00/665 [00:00