📚 AI Fundamentals & Basics

The Transformer Architecture Explained

A comprehensive guide to the transformer architecture, the revolutionary neural network design that underlies modern LLMs like GPT, BERT, and Claude.

June 3, 2026
16 min read
Visualization of transformer architecture showing connected attention layers
#Transformer#Architecture#Large Language Models#Deep Learning

Introduction to the Transformer Architecture

The Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. From Google Brain, represents a fundamental breakthrough in neural network design. Prior to the Transformer, the dominant architectures for sequence processing tasks were recurrent neural networks (RNNs) and their variants (LSTMs, GRUs), which processed data sequentially and struggled with long-range dependencies and parallel computation. The Transformer replaced recurrence entirely with an attention mechanism, enabling models to process entire sequences in parallel and capture relationships between any two positions regardless of their distance in the sequence.

The impact of this architectural innovation cannot be overstated. The Transformer is the foundation upon which virtually all modern large language models (LLMs) are built — GPT-4, Claude, Gemini, LLaMA, Mistral, and hundreds of others. Beyond language, Transformers have become dominant in computer vision (Vision Transformers or ViTs), speech processing, protein folding (AlphaFold2), code generation, and multimodal AI. Understanding the transformer architecture is essential for anyone working with modern AI, as it provides the conceptual foundation for understanding how today's most capable models process, understand, and generate information across virtually every domain and modality.

The Self-Attention Mechanism

It works.

At the heart of the transformer architecture lies the self-attention mechanism, also called scaled dot-product attention. Self-attention allows each element in a sequence to attend to every other element, computing a weighted representation that captures contextual relationships. The mechanism works by transforming each input into three vectors: a Query (Q), a Key (K), and a Value (V). These are computed by multiplying the input embeddings with learned weight matrices. The attention score between two positions is computed as the dot product of the Query at one position with the Key at another, scaled by the square root of the key dimension to prevent excessively large values that push softmax into regions with very small gradients.

These attention scores are then passed through a softmax function to produce attention weights that sum to 1, which are used to compute a weighted sum of the Value vectors. The result is that each position's output is a context-aware mixture of information from all other positions. This formulation has several crucial properties. It captures long-range dependencies effortlessly — information can flow directly between any two positions, regardless of distance. It is permutation-equivariant — the output for a given token depends on all other tokens through the attention weights. It enables parallel computation — all attention scores for all positions can be computed simultaneously using matrix operations, unlike RNNs which require sequential processing. The "multi-head" aspect of multi-head attention runs multiple attention operations in parallel with different learned projections, allowing the model to attend to information from different representation subspaces at different positions.

Transformer architecture diagram showing the encoder-decoder structure with attention mechanisms

The Encoder-Decoder Structure

Really well, actually.

The original transformer architecture follows an encoder-decoder structure, designed for sequence-to-sequence tasks like machine translation. The encoder processes the input sequence and produces a sequence of continuous representations (hidden states). It consists of a stack of identical layers (typically 6 or 12), each containing two sublayers: a multi-head self-attention mechanism and a position-wise fully connected feedforward network. Each sublayer employs a residual connection (skip connection) followed by layer normalization, which helps gradients flow through the network and stabilizes training.

The decoder also consists of stacked identical layers, but each decoder layer has three sublayers: a masked multi-head self-attention mechanism (which prevents positions from attending to future positions, preserving the auto-regressive property), a cross-attention mechanism that attends to the encoder's output, and a feedforward network. The cross-attention mechanism is what allows the decoder to incorporate information from the input sequence — the Queries come from the decoder's previous layer, while the Keys and Values come from the encoder's output. This structural separation between encoder (understanding the input) and decoder (generating the output) was crucial for tasks like translation, but modern transformer variants often use only the encoder (BERT-style) for understanding tasks or only the decoder (GPT-style) for generative tasks.

So where does that leave us?

Positional Encoding

Since the self-attention mechanism processes all positions in parallel and is inherently permutation-equivariant (the output doesn't change if the input order is shuffled), the transformer needs a way to incorporate information about the position of each element in the sequence. This is accomplished through positional encodings, which are added to the input embeddings before they enter the first transformer layer.

The original transformer used sinusoidal positional encodings based on sine and cosine functions of different frequencies. Each position gets a unique encoding vector, and the sinusoidal formulation allows the model to easily attend to relative positions (since any position's encoding can be expressed as a linear function of other positions' encodings). Modern transformers have largely moved to learned positional embeddings, where each position index has a corresponding trainable embedding vector. More recent approaches like RoPE (Rotary Position Embedding), used in models like LLaMA and Mistral, encode position by rotating the Query and Key vectors by an angle proportional to the position, allowing the model to attend based on relative distance. ALiBi (Attention with Linear Biases) takes a different approach, adding a bias term to attention scores that penalizes attention to distant positions. These innovations have been crucial for extending context lengths — modern models can handle contexts of 128K tokens or more, compared to the original transformer's 512-token limit. For more on how these attention mechanisms power language understanding, see our guide on embeddings in machine learning.

Too good to be true?

From Transformer to Foundation Models

The original transformer architecture has been adapted and extended in numerous ways to create the foundation models that dominate modern AI. The GPT (Generative Pre-trained Transformer) family, introduced by OpenAI, uses only the decoder portion of the transformer and is trained autoregressively — predicting the next token given all previous tokens. This simple objective, when applied at scale, produces models with remarkable language understanding and generation capabilities. GPT-2 (2019) demonstrated that scaling up the transformer leads to significant capability improvements, and GPT-3 (2020) showed that very large transformers exhibit emergent abilities like few-shot learning, translation, and basic reasoning without explicit training.

Let me give you a concrete example. BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder portion and is trained with a masked language modeling objective (predicting randomly masked words in a sentence) and next-sentence prediction. BERT's bidirectional nature allows it to build rich contextual representations that excel at understanding tasks like question answering and text classification. T5 (Text-to-Text Transfer Transformer) treats all NLP tasks as text-to-text problems, using a full encoder-decoder structure. Vision Transformers (ViT) apply the transformer architecture to image patches, treating an image as a sequence of patches and processing them with a standard transformer encoder. Modern state-of-the-art LLMs like GPT-4, Claude, and Gemini incorporate numerous improvements including Grouped Query Attention (GQA), Flash Attention for efficient computation, Mixture of Experts (MoE) for scaling, and novel normalization techniques. The explosion of open-source models like LLaMA, Mistral, and Falcon has democratized access to transformer-based AI. Explore how these models are evaluated in our article on AI benchmarks and evaluation.

Scaling Laws and Emergent Abilities

Game changer.

After testing this extensively, one of the most important discoveries in transformer research is the existence of scaling laws — predictable relationships between model size, dataset size, compute budget, and model performance. Kaplan et al. (2020) and Hoffmann et al. (2022, Chinchilla scaling laws) demonstrated that transformer performance follows smooth power-law scaling with model parameters, training tokens, and compute. The Chinchilla scaling law, in particular, established that most models were significantly undertrained and that for optimal performance, the number of training tokens should scale proportionally with model parameters — roughly 20 tokens per parameter.

As transformers scale up (models now reach hundreds of billions or even trillions of parameters), they exhibit emergent abilities — capabilities that aren't present in smaller models but appear suddenly at larger scales. These include few-shot learning, arithmetic reasoning, multi-step planning, code generation, and translation between languages. The mechanisms behind emergent abilities are not fully understood and are the subject of active research. Some abilities appear to arise from the increasing depth and breadth of the model's internal representations, while others may result from the model learning generalizable patterns at sufficient scale. The transformer's efficient use of parallel computation through self-attention has enabled this unprecedented scaling, making it possible to train models on trillions of tokens across thousands of GPUs. As training techniques continue to improve and hardware evolves, transformer-based models are expected to become even more capable, though questions about efficiency, reasoning, alignment, and safety become increasingly important at larger scales.

Does this live up to the hype?

The TL;DR

I learned this the hard way: - The transformer architecture replaces recurrence with self-attention, enabling parallel processing and modeling of long-range dependencies.

  • Self-attention computes contextual representations by allowing each position to attend to all others through Query, Key, and Value transformations.
  • The original encoder-decoder structure has been adapted into encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) variants. — game changer in my workflow
  • Positional encodings provide sequence order information, with modern approaches (RoPE, ALiBi) enabling much longer context windows. — wish I'd known this six months ago
  • The transformer is the foundation of all modern LLMs and has been extended to vision, speech, and multimodal domains. — game changer in my workflow
  • Scaling laws predict transformer performance based on model size and data, while emergent abilities at sufficient scale unlock unprecedented AI capabilities. — game changer in my workflow

Too good to be true?