What Happens at the End of a Transformer?

This topic is frequently encountered by users when utilizing a LLM API for inference. Grasping this seemingly peripheral concept in depth is essential for designing a LLM-based application.

Let's explore a few concepts with clear, in-depth intuition. I'll aim to keep it jargon-free and conceptually transparent.

Architecture

I'll begin slightly before the main point to set the stage for discussion. First, remember that the Transformer block (whether using the encoder-decoder or decoder architecture) always produces a hidden state for each index in the sequence of the batch. Thus, the shape (dimension) of the output is:

$$h_t = (B, S, D)$$

B = Batch Size, consisting of multiple sequences of inputs processed simultaneously.
S = Sequence Length, the maximum length of tokens the model handles (also sometimes referred to as the context window).
D = Hidden State, a vector representing the semantic knowledge, fingerprint, and meaning of the corresponding token.

To understand this, there is an open-source API from HuggingFace called AutoModel that exposes the raw hidden state.

Head (Linear Output projection)

There are several types of heads positioned over the Transformer blocks to fulfill different purposes. Some common ones include:

Language Modeling Head (Causal/Masked)
Classification Head
Token-Classification Head
Sequence-to-Sequence Head

Today, we will specifically focus on the Language Modeling Head:

For practical purposes, we need the logits, which are the raw, unnormalized output values from the final layer before being converted into probabilities, for all tokens in the vocabulary. A similar concept is implemented in HuggingFace's AutoModel, which provides AutoModelForCausalLM. This is essentially the same API, but with a causal language modeling head added.

Understanding the Head in More Detail

Let's delve into the head in more detail. I'll outline a few noteworthy points in sequence for better clarity.

Weight Tying: When the same weights are shared between the input embedding matrix (Embedding Layer) and the output vocabulary projection matrix or linear projection matrix, it is known as tied-weights. Conversely, untied-weights refer to having separate matrices for these components. The head may use either of the two variants.

The Embedding Layer: It is a learnable component that encodes the semantic meaning of all tokens in the vocabulary into vectors. It has the following dimensions:

$$E \in \mathbb{R}^{V \times D}$$

Where:

V: Vocabulary size.
D: Embedding dimension (the size of the vector representing each token).

Working of the head and Token Probabilities

The following steps occur in sequence:

Layer Normalization: It passes the final hidden state through a LayerNorm to stabilize the variance. This layer is in gray area as theoretically this bridges the flow between the core Transformer output and the head and therefore part of neither.
Linear Output Projection: The normalized hidden state is then multiplied with the transpose of the learned linear projection matrix. Since the hidden state dimension and embedding dimension are of the same order, the multiplication remains mathematically feasible.

$$h_t \times E^T = (B, S, D) \times (D, V) = (B, S, V)$$

Softmax Layer: The output from the linear projection layer is converted into a probability distribution using the softmax function:

$$\sigma (z_{i})=\frac{e^{z_{i}}}{\sum {j=1}^{K}e^{z{j}}}$$

Where:
- $\sigma(z_i)$: The resulting probability of class i
- $z_{i}$: The raw score (logit) for class i
- $e^{z_{i}}$ and $e^{z_{j}}$: The standard exponential function applied to the scores
- K: The total number of classes in the multi-class classification problem
- $\sum_{j=1}^{K} e^{z_j}$: The sum of all exponentiated scores, which acts as the normalizing constant (denominator)

This function transforms the logits into probabilities, where each value represents the likelihood of a token being selected from the vocabulary.

The way softmax is applied depends on whether the model is in the training or inference phase.

During training, the entire input sequence is processed in parallel. Since the model learns to predict every next token position simultaneously, softmax is applied across all sequence positions.

During inference, generation happens autoregressively, meaning one token is generated at a time. Although the model processes the full sequence context available so far, only the logits from the final sequence position are used to predict the next token. Therefore, softmax is applied only to the last position during token generation.

From this point onward, we will refer to the softmax output as (token) probabilities.

Token Probabilities

After the LM head, we obtain the probabilities, which are used for either training or inference.

The key difference (already mentioned in the previous section) is that during training, probabilities for the entire sequence are generated simultaneously for loss computation and back-propagation. In contrast, inference occurs auto-regressively, following these high-level steps, which will be discussed in detail later in a separate article:

Probabilities are generated, and the probabilities corresponding to the latest token position are used for prediction.
The next token is determined.
The predicted token is appended back to the input sequence for predicting the subsequent token.

This process continues until the "EOS" token is reached or the maximum sequence length is exhausted.

Below is a visual representation of the entire process.

Next Agenda

In the next article, we will delve into how token probabilities culminate in the actual prediction of tokens, with a hint already provided in the visual representation above.

What Happens At The End Of A Transformer

Architecture

Head (Linear Output projection)

Understanding the Head in More Detail

Working of the head and Token Probabilities

Token Probabilities

Below is a visual representation of the entire process.

Next Agenda

Comments

More from this blog

Neural Networks From Scratch

Unveiling the Power of Support Vector Machines in Classification Tasks

Boosting Machine Learning Adaboost Guide

Understanding the Sigmoid Function and Its Applications in Machine Learning

Command Palette

Architecture

Head (Linear Output projection)

Understanding the Head in More Detail

Working of the head and Token Probabilities

Token Probabilities

Below is a visual representation of the entire process.

Next Agenda

Comments

More from this blog