Using GPUs with transformers#
This notebook is available on Google Colab at this url: https://colab.research.google.com/drive/1-B5Y_x5TXJLb2qlV-b5h4Ak6m8Ioej95?usp=sharing
Before executing this notebook on Google Colab, make sure to change the runtime to a GPU in the menu Runtime > Change runtime type.
Example: Zero-Shot classification#
Let’s measure the inference time, with CPU and with a GPU, respectively
from transformers import pipeline
pipe = pipeline("zero-shot-classification")
No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
In a Jupyter notebook, we can measure the time execution of a cell by using the magic command %%timeit
.
It repeats the execution several times, and computes the average and the standard deviation between all the runs.
%%timeit
example = "I have a problem with my iphone that needs to be resolved asap!"
pipe(example, candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer", "Mac"], multi_label=False)
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
The `multi_class` argument has been deprecated and renamed to `multi_label`. `multi_class` will be removed in a future version of Transformers.
2.95 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On average, it took around 3 seconds per example…
To use the GPU, we need to load the model again, specifying it with the device argument cuda
. This will move all model’s parameters to the GPU VRAM memory
pipe = pipeline("zero-shot-classification", device='cuda')
No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
%%timeit
example = "I have a problem with my iphone that needs to be resolved asap!"
pipe(example, candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer", "Mac"], multi_label=False)
/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:1157: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
warnings.warn(
218 ms ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, it only takes around 0.2 seconds per example!! Which is more than a 10X speed-up ⚡️
By using a GPU, the inference (and training) times typically speed-up by a factor between 10-20X.
Example: Text to Speech#
In the case the model is not being used from a pipeline
, we need to load them to the GPU by moving both the inputs and the Model with the to('cuda')
method:
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("suno/bark-small")
model = AutoModel.from_pretrained("suno/bark-small").to('cuda')
inputs = processor(
text=["Hello, my name is Suno. And, uh — and I like pizza [laughs] But I also have other interests such as playing tic tac toe."],
return_tensors="pt",
).to('cuda')
speech_values = model.generate(**inputs, do_sample=True)
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
from IPython.display import Audio
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
inputs = processor(
text=["Hello, my name is Suno. And, uh — and I like pizza [laughs] But I also have other interests such as playing tic tac toe."],
return_tensors="pt",
voice_preset="v2/en_speaker_3"
).to('cuda')
speech_values = model.generate(**inputs, do_sample=True)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
from IPython.display import Audio
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
Exercise: podcast summarizer 🎙️#
Develop a Podcast summarizer, which takes as an input an mp3, and generates a text with the summary.
As an example, you can use the files/podcast_sample.mp3
from the course materials
import torch
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition", model="openai/whisper-small", device=device
)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
import IPython
IPython.display.Audio("./podcast_sample.mp3")