# Automatic Speech Recognition


We’ll take a look at how Transformers can be used to convert spoken speech into text, a task known speech recognition.

**Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. This task has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa.**

![](https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/asr_diagram.png)

The audio models we’ll cover in this course typically have a standard transformer architecture as shown above, but with a slight modification on the input or output side to allow for audio data instead of text. Since all these models are transformers at heart, they will have most of their architecture in common and the main differences are in how they are trained and used.

![](https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/transformers_blocks.png)

For audio tasks, the input and/or output sequences may be audio instead of text:

* Automatic speech recognition (ASR): The input is speech, the output is text.

* Speech synthesis (TTS): The input is text, the output is speech.

Speech recognition is a challenging task as it requires joint knowledge of audio and text. The input audio might have lots of background noise and be spoken by speakers with different accents, making it difficult to pick out the spoken speech. The written text might have characters which don’t have an acoustic sound, such as punctuation, which are difficult to infer from audio alone. These are all hurdles we have to tackle when building effective speech recognition systems!

## Pre-trained Models for ASR: Whisper

The need for large amounts of training data has been a bottleneck in the advancement of Transformer architectures for speech. Labelled speech data is difficult to come by, with the largest annotated datasets at the time clocking in at just 10,000 hours. This all changed in 2022 upon the release of Whisper. Whisper is a pre-trained model for speech recognition published [in September 2022 by the authors Alec Radford et al. from OpenAI](https://openai.com/research/whisper). Unlike its predecessors, which were pre-trained entirely on un-labelled audio data, Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise.

When scaled to 680,000 hours of labelled pre-training data, Whisper models demonstrate a strong ability to generalise to many datasets and domains. The pre-trained checkpoints achieve competitive results to state-of-the-art pipe systems. Of particular importance is Whisper’s ability to handle long-form audio samples, its robustness to input noise and ability to predict cased and punctuated transcriptions. This makes it a viable candidate for real-world speech recognition systems.

The Whisper checkpoints come in five configurations of varying model sizes. The smallest four are trained on either English-only or multilingual data. The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The checkpoints are summarised in the following table with links to the models on the Hub. “VRAM” denotes the required GPU memory to run the model with the minimum batch size of 1. “Rel Speed” is the relative speed of a checkpoint compared to the largest model. Based on this information, you can select a checkpoint that is best suited to your hardware.


| Size   | Parameters | VRAM / GB | Rel Speed | English-only                                         | Multilingual                                        |
|--------|------------|-----------|-----------|------------------------------------------------------|-----------------------------------------------------|
| tiny   | 39 M       | 1.4       | 32        | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny)     |
| base   | 74 M       | 1.5       | 16        | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)     |
| small  | 244 M      | 2.3       | 6         | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
| medium | 769 M      | 4.2       | 2         | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
| large  | 1550 M     | 7.5       | 1         | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |


Let's start with the whisper-small model!

In [1]:
import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition", model="openai/whisper-small", device=device
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


As a first example, let's load a legal conversation. Notice there is some noise in the record..

In [24]:
import IPython

IPython.display.Audio("files/legal_audio.mp3")

In [10]:
with open("files/legal_audio.mp3", "rb") as f:
    audio = f.read()

This does the actual transcription. Note in the CPU, it just takes 20 seconds, whereas the audio is over 2 minutes long.
So it could work for near-real-time scenarios. And it could be further accelerated with a GPU (see last notebook in the session).

In [11]:
transcription = pipe(audio)

In [17]:
import pandas as pd

transcription = pd.DataFrame([transcription])


In [19]:
pd.set_option('display.max_colwidth', None)

transcription

Unnamed: 0,text
0,"What I'm trying to say is that the appeal of the supplement is a separate one. Oh you mean that communication announcement at the internal extoll would be brought into the appeal process? Right, because it can enter the appeal. I hope you understand there are issues of reputation and I have to deal with those in the appeal process rather than the commercial issues around settlement. And then there's this legal claim to defamation. I believe that the implication would be to proceed with the defamation. Yeah, I'll investigate to what we have as written evidence of what we do not have. Okay, great. I'm thankful for you to do that. Absolutely. Evidence is important, you know. In order to understand what a party question has after the program, it's processed and there is great possibility of definition to everybody's services like this. Exactly. I hope you feel that I've been listened to. Yeah, absolutely. Thank you. What I've also noted down from this is that there are a number of other people that I need to speak to, which I'll do as quickly as I possibly can. It evolved to raise to this conversation number of points which I now need to go about and test from the other perspective to try and make them a judgement about working off the evidence side. I mean strictly because it's a lot of confidential personal stuff and in fact my team is also something that's sorted out on my mind. It's like the very start to solve process. I thought it's good, you know, it's good to be dealt with very, very quickly and very, very smoothly or it could just be a traumatic long drawn up process. Well I appreciate that, the fact that you have of course your personal interest and are convinced that you have an intention to make this work if at all possible and the way it makes sense to the other party too. So I appreciate that. The truth is I went to a period if there wasn't a genuine interest in helping understand. Ok Justin, thank you for struggling so far. What else would you want to cover? The thing really, the biggest, not the rest of the age for you to talk about the timeline, maybe it's going to be quicker than we originally thought. We're not going to worry about any back-phone. But let me know if we've got time at some point. I think I'd go back to everyone that I've even got to understand the evidence we spoke about and how the trading thing got to serious. Written evidences would be important. However, I'll be challenging them with many other points we've made this conversation. All right, sounds cool. Okay, thank you very much. And have a good day. You too, thank you."


Another different example, which is a doctor diagnosing a patient..

In [21]:
IPython.display.Audio("files/medical_audio.mp3")

In [22]:
with open("files/medical_audio.mp3", "rb") as f:
    audio = f.read()

transcription = pipe(audio)

In [23]:
transcription = pd.DataFrame([transcription])


transcription

Unnamed: 0,text
0,"The patient is a 14-year-old white female, established patient to dermatology, last in an office on the 8th of the 10th 2004. She comes in today for re-evaluation of her acne plus she has how about she caused a rash for the past two months now on her chest, stomach, neck and back. On examination this is a flaring of her acne with two small folliculitis lesions. The patient has been taking a moxaline 500 mg PID and using Tasawak cream 0.1 and her face is doing well but she has been out of her medication now for three days also. She has been getting photofacials at healing waters and was wondering about what we could offer as far as cosmetic procedures and skincare products etc. The patient is married, she is the secretary. Family, social and allergy history, shares hay fever, eczema, sinus and hives, shares no melanoma or skin cancers or psoriasis. Her mother had oral cancer, the patient is non-smoker and no blood tests. I had some sunburn in the past, is on benzoal peroxide and deprol. Current medications, Xapro, Afexor, Dytropine, Aspirin, Vitamins. Physical examination that patient is well developed appears state of age. Overall health is good. She has a couple of acne lesions, one on her face and neck, but there are lots of small folliculitis type lesions on her abdomen, chest and back. Impression? Acne-ray folliculitis. Treatment? 1. Discussed condition with the patient to continue the amoxin 500 mg, 2 at bedtime, 3. Add Sceptre DS every morning with extra water, 4. Continue the Tesseract Cream 0.1 which is urge to use on the back and chest also, and 5. Refer to ABC clinic for an aesthetic consult, return in two months for a follow-up evaluation of her acne."


Notice there are some mistakes in the transcription, in very specific names:

* Tesseract Cream -> Tazorac Cream, https://www.drugs.com/tazorac.html
* Tasawak Cream -> Tazorac Cream.
* Xapro -> Lexapro, https://www.drugs.com/lexapro.html

**Question**: Describe an idea to avoid the previous problem with very specific terms

## Evaluation metrics for ASR



For the previos example, suppose an human expert has transcribed the audio, so we want to compare Whisper's transcription with this ground truth.

In [4]:
true_transcription = """SUBJECTIVE:  The patient is a 49-year-old white female, established patient of Dermatology, last seen in office on 08/10/2004. She comes in today for reevaluation of her acne, plus she has had what she calls a rash for the past two months now on her chest, stomach, neck, and back. On examination, there is flaring of her acne with small folliculitis lesions. The patient has been taking amoxicillin 500 mg b.i.d. and using Tazorac cream 0.1. Her face is doing well, but she has been out of her medication now for three days also. She has been getting photo facials at Healing Waters. I was wondering about what we could offer as far as cosmetic procedures and skin care products, etcetera. The patient is married. She is a secretary.
FAMILY, SOCIAL, AND ALLERGY HISTORY:  She has hay fever, eczema, sinus, and hives. She has no melanoma or skin cancers or psoriasis. Her mother had oral cancer. The patient is a nonsmoker. No blood tests. She had some sunburn in the past. She is on benzoyl peroxide and Daypro.
CURRENT MEDICATIONS:  Lexapro, Effexor, Ditropan, aspirin, and vitamins.
PHYSICAL EXAMINATION:  The patient is well developed and appears stated age. Overall health is good. She has a couple of acne lesions, one on her face and neck, but there are a lot of small folliculitis-like lesions on her abdomen, chest, and back.
IMPRESSION:  Acne with folliculitis.
TREATMENT:
1. Discussed condition and treatment with the patient.
2. Continue amoxicillin 500 mg two at bedtime.
3. Add Septra DS every morning with extra water.
4. Continue Tazorac cream 0.1. It is okay to use on back and chest also.
5. Referred to ABC Clinic for anesthetic consult. Return in two months for a followup evaluation of her acne."""

In [31]:
transcription['true_transcription'] = true_transcription
transcription.rename(columns={'text': 'predicted_transcription'}, inplace=True)

In [32]:
transcription

Unnamed: 0,predicted_transcription,true_transcription
0,"The patient is a 14-year-old white female, established patient to dermatology, last in an office on the 8th of the 10th 2004. She comes in today for re-evaluation of her acne plus she has how about she caused a rash for the past two months now on her chest, stomach, neck and back. On examination this is a flaring of her acne with two small folliculitis lesions. The patient has been taking a moxaline 500 mg PID and using Tasawak cream 0.1 and her face is doing well but she has been out of her medication now for three days also. She has been getting photofacials at healing waters and was wondering about what we could offer as far as cosmetic procedures and skincare products etc. The patient is married, she is the secretary. Family, social and allergy history, shares hay fever, eczema, sinus and hives, shares no melanoma or skin cancers or psoriasis. Her mother had oral cancer, the patient is non-smoker and no blood tests. I had some sunburn in the past, is on benzoal peroxide and deprol. Current medications, Xapro, Afexor, Dytropine, Aspirin, Vitamins. Physical examination that patient is well developed appears state of age. Overall health is good. She has a couple of acne lesions, one on her face and neck, but there are lots of small folliculitis type lesions on her abdomen, chest and back. Impression? Acne-ray folliculitis. Treatment? 1. Discussed condition with the patient to continue the amoxin 500 mg, 2 at bedtime, 3. Add Sceptre DS every morning with extra water, 4. Continue the Tesseract Cream 0.1 which is urge to use on the back and chest also, and 5. Refer to ABC clinic for an aesthetic consult, return in two months for a follow-up evaluation of her acne.","SUBJECTIVE: The patient is a 49-year-old white female, established patient of Dermatology, last seen in office on 08/10/2004. She comes in today for reevaluation of her acne, plus she has had what she calls a rash for the past two months now on her chest, stomach, neck, and back. On examination, there is flaring of her acne with small folliculitis lesions. The patient has been taking amoxicillin 500 mg b.i.d. and using Tazorac cream 0.1. Her face is doing well, but she has been out of her medication now for three days also. She has been getting photo facials at Healing Waters. I was wondering about what we could offer as far as cosmetic procedures and skin care products, etcetera. The patient is married. She is a secretary.\nFAMILY, SOCIAL, AND ALLERGY HISTORY: She has hay fever, eczema, sinus, and hives. She has no melanoma or skin cancers or psoriasis. Her mother had oral cancer. The patient is a nonsmoker. No blood tests. She had some sunburn in the past. She is on benzoyl peroxide and Daypro.\nCURRENT MEDICATIONS: Lexapro, Effexor, Ditropan, aspirin, and vitamins.\nPHYSICAL EXAMINATION: The patient is well developed and appears stated age. Overall health is good. She has a couple of acne lesions, one on her face and neck, but there are a lot of small folliculitis-like lesions on her abdomen, chest, and back.\nIMPRESSION: Acne with folliculitis.\nTREATMENT:\n1. Discussed condition and treatment with the patient.\n2. Continue amoxicillin 500 mg two at bedtime.\n3. Add Septra DS every morning with extra water.\n4. Continue Tazorac cream 0.1. It is okay to use on back and chest also.\n5. Referred to ABC Clinic for anesthetic consult. Return in two months for a followup evaluation of her acne."


When assessing speech recognition systems, we compare the system's predictions to the target text transcriptions,
annotating any errors that are present. We categorise these errors into one of three categories:
1. Substitutions (S): where we transcribe the **wrong word** in our prediction ("sit" instead of "sat")
2. Insertions (I): where we add an **extra word** in our prediction
3. Deletions (D): where we **remove a word** in our prediction

These error categories are the same for all speech recognition metrics. What differs is the level at which we compute
these errors: we can either compute them on the _word level_ or on the _character level_.

We'll use a running example for each of the metric definitions. Here, we have a _ground truth_ or _reference_ text sequence:

```python
reference = "the cat sat on the mat"
```

And a predicted sequence from the speech recognition system that we're trying to assess:

```python
prediction = "the cat sit on the"
```

We can see that the prediction is pretty close, but some words are not quite right. We'll evaluate this prediction
against the reference for the three most popular speech recognition metrics and see what sort of numbers we get for each.

## Word Error Rate
The *word error rate (WER)* metric is the 'de facto' metric for speech recognition. It calculates substitutions,
insertions and deletions on the *word level*. This means errors are annotated on a word-by-word basis. Take our example:


| Reference:  | the | cat | sat     | on  | the | mat |
|-------------|-----|-----|---------|-----|-----|-----|
| Prediction: | the | cat | **sit** | on  | the |     |  |
| Label:      | ✅   | ✅   | S       | ✅   | ✅   | D   |

Here, we have:
* 1 substitution ("sit" instead of "sat")
* 0 insertions
* 1 deletion ("mat" is missing)

This gives 2 errors in total. To get our error rate, we divide the number of errors by the total number of words in our
reference (N), which for this example is 6:

$$
\begin{aligned}
WER &= \frac{S + I + D}{N} \\
&= \frac{1 + 0 + 1}{6} \\
&= 0.333
\end{aligned}
$$

Alright! So we have a WER of 0.333, or 33.3%. Notice how the word "sit" only has one character that is wrong, but the
entire word is marked incorrect. This is a defining feature of the WER: spelling errors are penalised heavily, no matter
how minor they are.

The WER is defined such that *lower is better*: a lower WER means there are fewer errors in our prediction, so a perfect
speech recognition system would have a WER of zero (no errors).

In [33]:
#!pip install evaluate

In [15]:
from evaluate import load

wer_metric = load("wer")

In [40]:
wer = wer_metric.compute(references=transcription['true_transcription'], predictions=transcription['predicted_transcription'])

In [41]:
wer

0.41843971631205673

If the punctuation is not very important for our task, we can normalize the text by removing punctuation and converting to lowercase. This is a common practice in ASR evaluation, as it makes the evaluation more robust to small errors in punctuation and capitalization.

In [13]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

In [43]:
# apply the normalizer to the text

transcription['predicted_transcription'] = transcription['predicted_transcription'].apply(normalizer)
transcription['true_transcription'] = transcription['true_transcription'].apply(normalizer)

In [44]:
transcription

Unnamed: 0,predicted_transcription,true_transcription
0,the patient is a 14 year old white female established patient to dermatology last in an office on the 8th of the 10th 2004 she comes in today for re evaluation of her acne plus she has how about she caused a rash for the past two months now on her chest stomach neck and back on examination this is a flaring of her acne with two small folliculitis lesions the patient has been taking a moxaline 500 mg pid and using tasawak cream 0 1 and her face is doing well but she has been out of her medication now for three days also she has been getting photofacials at healing waters and was wondering about what we could offer as far as cosmetic procedures and skincare products etc the patient is married she is the secretary family social and allergy history shares hay fever eczema sinus and hives shares no melanoma or skin cancers or psoriasis her mother had oral cancer the patient is non smoker and no blood tests i had some sunburn in the past is on benzoal peroxide and deprol current medications xapro afexor dytropine aspirin vitamins physical examination that patient is well developed appears state of age overall health is good she has a couple of acne lesions one on her face and neck but there are lots of small folliculitis type lesions on her abdomen chest and back impression acne ray folliculitis treatment 1 discussed condition with the patient to continue the amoxin 500 mg 2 at bedtime 3 add sceptre ds every morning with extra water 4 continue the tesseract cream 0 1 which is urge to use on the back and chest also and 5 refer to abc clinic for an aesthetic consult return in two months for a follow up evaluation of her acne,subjective the patient is a 49 year old white female established patient of dermatology last seen in office on 08 10 2004 she comes in today for reevaluation of her acne plus she has had what she calls a rash for the past two months now on her chest stomach neck and back on examination there is flaring of her acne with small folliculitis lesions the patient has been taking amoxicillin 500 mg b i d and using tazorac cream 0 1 her face is doing well but she has been out of her medication now for three days also she has been getting photo facials at healing waters i was wondering about what we could offer as far as cosmetic procedures and skin care products etcetera the patient is married she is a secretary family social and allergy history she has hay fever eczema sinus and hives she has no melanoma or skin cancers or psoriasis her mother had oral cancer the patient is a nonsmoker no blood tests she had some sunburn in the past she is on benzoyl peroxide and daypro current medications lexapro effexor ditropan aspirin and vitamins physical examination the patient is well developed and appears stated age overall health is good she has a couple of acne lesions one on her face and neck but there are a lot of small folliculitis like lesions on her abdomen chest and back impression acne with folliculitis treatment 1 discussed condition and treatment with the patient 2 continue amoxicillin 500 mg two at bedtime 3 add septra ds every morning with extra water 4 continue tazorac cream 0 1 it is okay to use on back and chest also 5 referred to abc clinic for anesthetic consult return in two months for a followup evaluation of her acne


In [45]:
wer = wer_metric.compute(references=transcription['true_transcription'], predictions=transcription['predicted_transcription'])
wer

0.2425249169435216

**Exercise** Can you improve the WER of the previous example by using a different model?

In [6]:
with open("files/medical_audio.mp3", "rb") as f:
    audio = f.read()

In [3]:
pipe = pipeline(
    "automatic-speech-recognition", model="openai/whisper-medium", device=device
)

config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

In [9]:
transcription = pipe(audio)
transcription = pd.DataFrame([transcription])

In [10]:
transcription['true_transcription'] = true_transcription
transcription.rename(columns={'text': 'predicted_transcription'}, inplace=True)

In [12]:
pd.set_option('display.max_colwidth', None)

transcription

Unnamed: 0,predicted_transcription,true_transcription
0,"The patient is a 49 year old white female, established patient of dermatology, last in office on the 8th of the 10th 2004. She comes in today for re-evaluation of her acne, plus she has had what she calls a rash for the past two months now on her chest, stomach, neck and back. On examination, this is a flaring of her acne with two small folliculitis lesions. The patient has been taking Amoxanine 500MgbID and using Tazeracream 0.1 and her face is doing well but she has been out of her medication now for 3 days also. She has been getting photofacial at healing waters and was wondering about what we could offer as far as cosmetic procedures and skin care products etc. The patient is married, she is a secretary. Social and Allergy History. Shares hay fever, eczema, sinus and hives. Shares no melanoma or skin cancers or psoriasis. Her mother had oral cancer. Her patient is non-smoker, no blood tests. Had some sunburn in the past. Is on benzoyl peroxide and DAPRO. Current Current medications Lexapro, Afexor, Dytropine, Aspirin, Vitamins. Physical examination that patient is well developed appears state of age. Overall health is good. She has a couple of acne lesions, one on her face and neck, but there are a lot of small folliculitis type lesions on her abdomen, chest and back. Impression? Acne with folliculitis. 1. Discuss the condition with the patient to continue the Amoxin 500mg to at bedtime. 3. Add SEPTRA-DS every morning with extra water. 4. Continue the Tazeracrim 0.1, it is okay to use on the back and chest also. 5. Refer to ABC clinic for an aesthetic consult. Return in 2 months for a follow up evaluation of her acne.","SUBJECTIVE: The patient is a 49-year-old white female, established patient of Dermatology, last seen in office on 08/10/2004. She comes in today for reevaluation of her acne, plus she has had what she calls a rash for the past two months now on her chest, stomach, neck, and back. On examination, there is flaring of her acne with small folliculitis lesions. The patient has been taking amoxicillin 500 mg b.i.d. and using Tazorac cream 0.1. Her face is doing well, but she has been out of her medication now for three days also. She has been getting photo facials at Healing Waters. I was wondering about what we could offer as far as cosmetic procedures and skin care products, etcetera. The patient is married. She is a secretary.\nFAMILY, SOCIAL, AND ALLERGY HISTORY: She has hay fever, eczema, sinus, and hives. She has no melanoma or skin cancers or psoriasis. Her mother had oral cancer. The patient is a nonsmoker. No blood tests. She had some sunburn in the past. She is on benzoyl peroxide and Daypro.\nCURRENT MEDICATIONS: Lexapro, Effexor, Ditropan, aspirin, and vitamins.\nPHYSICAL EXAMINATION: The patient is well developed and appears stated age. Overall health is good. She has a couple of acne lesions, one on her face and neck, but there are a lot of small folliculitis-like lesions on her abdomen, chest, and back.\nIMPRESSION: Acne with folliculitis.\nTREATMENT:\n1. Discussed condition and treatment with the patient.\n2. Continue amoxicillin 500 mg two at bedtime.\n3. Add Septra DS every morning with extra water.\n4. Continue Tazorac cream 0.1. It is okay to use on back and chest also.\n5. Referred to ABC Clinic for anesthetic consult. Return in two months for a followup evaluation of her acne."


In [14]:
# apply the normalizer to the text
transcription['predicted_transcription'] = transcription['predicted_transcription'].apply(normalizer)
transcription['true_transcription'] = transcription['true_transcription'].apply(normalizer)

In [16]:
wer = wer_metric.compute(references=transcription['true_transcription'], predictions=transcription['predicted_transcription'])
wer

0.2159468438538206