Text to Speech#

The task we will study in this notebook is called “Text-to-speech” (TTS). Models capable of converting text into audible human speech have a wide range of potential applications:

  • Assistive apps: think about tools that can leverage these models to enable visually-impaired people to access digital content through the medium of sound.

  • Audiobook narration: converting written books into audio form makes literature more accessible to individuals who prefer to listen or have difficulty with reading.

  • Virtual assistants: TTS models are a fundamental component of virtual assistants like Siri, Google Assistant, or Amazon Alexa. Once they have used a classification model to catch the wake word, and used ASR model to process your request, they can use a TTS model to respond to your inquiry.

  • Entertainment, gaming and language learning: give voice to your NPC characters, narrate game events, or help language learners with examples of correct pronunciation and intonation of words and phrases.

The Bark model 🐶#

Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark.

Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying.

In the HuggingFace’s model hub, we have two different sizes of the Bark model:

  • suno/bark-small.

  • suno/bark.

Let’s use the smaller one, since it’s quite computationally expensive to run the larger model (needs a GPU).

from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark-small")
model = AutoModel.from_pretrained("suno/bark-small")

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
)

speech_values = model.generate(**inputs, do_sample=True)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

If we execute again the previous inference, each time the voice will be different. But we can fix the voice, by specifying the voice_preset parameter.

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
    voice_preset="v2/en_speaker_3"
)

speech_values = model.generate(**inputs, do_sample=True)

sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.

The full list of voices is available at: https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c

Note you can mix languages, e.g. reading out loud English text with a french accent:

inputs = processor(
    "[clears throat] This is a test ... and I just took a long pause [laughs]",
    voice_preset="v2/fr_speaker_1",
)

speech_values = model.generate(**inputs)
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
inputs = processor(
    "I will conquer the world! [laughter]",
    voice_preset="v2/en_speaker_3",
)

speech_values = model.generate(**inputs)
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.

Below is a list of some known non-speech sounds, but people are finding more every day.

[laughter]
[laughs]
[sighs]
[music]
[gasps]
[clears throat]
— or ... for hesitations
♪ for song lyrics
CAPITALIZATION for emphasis of a word
[MAN] and [WOMAN] to bias Bark toward male and female speakers, respectively

Evaluating Text to Speech models#

Unlike many other computational tasks that can be objectively measured using quantitative metrics, such as accuracy or precision, evaluating TTS relies heavily on subjective human analysis.

One of the most commonly employed evaluation methods for TTS systems is conducting qualitative assessments using mean opinion scores (MOS). MOS is a subjective scoring system that allows human evaluators to rate the perceived quality of synthesized speech on a scale from 1 to 5. These scores are typically gathered through listening tests, where human participants listen to and rate the synthesized speech samples.

One of the main reasons why objective metrics are challenging to develop for TTS evaluation is the subjective nature of speech perception. Human listeners have diverse preferences and sensitivities to various aspects of speech, including pronunciation, intonation, naturalness, and clarity. Capturing these perceptual nuances with a single numerical value is a daunting task. At the same time, the subjectivity of the human evaluation makes it challenging to compare and benchmark different TTS systems.

Furthermore, this kind of evaluation may overlook certain important aspects of speech synthesis, such as naturalness, expressiveness, and emotional impact. These qualities are difficult to quantify objectively but are highly relevant in applications where the synthesized speech needs to convey human-like qualities and evoke appropriate emotional responses.

In summary, evaluating text-to-speech models is a complex task due to the absence of one truly objective metric. The most common evaluation method, mean opinion scores (MOS), relies on subjective human analysis. While MOS provides valuable insights into the quality of synthesized speech, it also introduces variability and subjectivity.