Fundamentals of the Transformer Architecture

Fundamentals of the Transformer Architecture#

In this section, we will take a high-level look at the architecture of Transformer models.

A bit of Transformer history#

The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:

June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so)
May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning).
March 2023: GPT-4. The latest version of the GPT series, which is even bigger and more powerful than GPT-3.

This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories:

GPT-like (also called decoder Transformer models)
BERT-like (also called encoder Transformer models)
BART/T5-like (also called encode-decoder Transformer models) We will dive into these families in more depth later on.

tree

Transformers are big models#

Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

scale

scale2

Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources. It even translates to environmental impact, as can be seen in the following graph.

carbon

https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cost-you–there-is-no-free-ai-lunch/?sh=2ee2cc54af7a

Imagine if each time a research team, a student organization, or a company wanted to train a model, it did so from scratch. This would lead to huge, unnecessary global costs!

This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.

Transformers are language models#

All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.

Another example is masked language modeling, in which the model predicts a masked word in the sentence.

Essentially, a language model just learns to predict the next word(token) in a sentence. This means learning the probability distribution of the next word, given a context with the previous words:

Generating text#

Once the model has been trained, it can be used to generate text, for example. This is done by sampling from the probability distribution of the next word given the context. This is called autoregressive generation, see an example in the video: images/generation_example.mp4.

General Architecture#

The model is primarily composed of two blocks:

Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

arch

Each of these parts can be used independently, depending on the task:

Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
Decoder-only models: Good for generative tasks such as text generation.
Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

The original architecture#

The Transformer architecture was originally designed for translation. It mainly consists in these two types of neural network layers:

Multi-linear perceptron layers: these are the same linear layers we know (it’s just a different layer).
Self-Attention layers: these layers take as input the token representations from the previous layers (one vector per token) and learn correlation between tokens at different positions. This is the key innovation of the Transformer model, as it allows the model to learn dependencies between tokens, resulting in contextual information (the meaning of a word/token depends on the context in which it appears).

During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

Further information about the architecture#

Visual in-depth explanation of the Transformer architecture: http://jalammar.github.io/illustrated-transformer/