What Happens When You Talk to an AI?
How AI chat systems like ChatGPT, Claude, Gemini, and others actually work — from your keyboard to the AI's response
The User Types a Prompt
Where every AI interaction begins
Everything starts here. You type a question in plain English. But the AI does not understand English — it understands numbers. Before anything else, your text needs to be converted into a format the model can process. This conversion happens in several stages.
Text Encoding
Characters become numbers via Unicode
Before the AI even gets involved, your computer converts each character into a number using a standard called Unicode. The letter 'H' becomes 72. A space becomes 32. This is how all digital text works — it is not AI-specific. But the AI does not consume these raw character codes. It needs a smarter representation.
Tokenization
Breaking text into subword pieces
The AI does not read word by word or letter by letter. It breaks text into chunks called tokens. A token is usually a common word or a piece of a word. Notice how spaces are attached to the beginning of words — that is part of how the tokenizer works.
A token is not a word and not a character — it is a subword unit produced by an algorithm called Byte Pair Encoding (BPE).
Most modern LLMs use Byte Pair Encoding (BPE). Vocabulary sizes vary: LLaMA 3 uses 128,256 tokens, GPT-4 uses ~100,000, and GPT-4o uses ~200,000. Common words like 'the' are a single token. Uncommon words get split into multiple pieces — 'uncharacteristically' might become ['un', 'character', 'istic', 'ally'] in a simplified illustration (in practice, BPE splits follow frequency patterns, not morphology — in a large modern vocabulary this word might actually be a single token). Rare strings reduce all the way down to raw bytes.
How BPE Builds a Vocabulary
Starting from individual characters
Vocabulary
During training, the tokenizer examines massive amounts of text and finds which pairs of characters appear together most often. It merges them into a single token. Then repeats — over and over — until it has built a vocabulary of 100,000 to 200,000+ tokens.
This process happens once, before the AI model is even trained. After that, the same fixed vocabulary is used for every prompt and every response.
Token IDs
The numbers the model actually sees
Each token in the vocabulary has a unique ID number. From this point forward, the AI works entirely with these numbers— not with text. Your question has become a sequence of 8 integers. The original English text is gone.
Token Embedding
From integer IDs to dense vectors
Each token ID gets looked up in a giant table called the embedding matrix — a learned table of shape [100,000 × 4,096]. This lookup converts each token integer into a dense floating-point vector — a list of thousands of decimal numbers.
These vectors are entirely learned during training. The geometry that emerges is remarkable: semantically similar words cluster together in this high-dimensional space.
A historical breakthrough
vector("king") − vector("man") + vector("woman") ≈ vector("queen")
An early breakthrough from Word2Vec (2013), which gave each word a fixed position in vector space. This showed that vector geometry encodes meaning.
Modern LLMs go further with contextual embeddings— the vector for "bank" is different in "river bank" vs. "bank account." The embedding lookup shown here is just a starting point; these vectors are continuously refined by attention in the layers above.
The resulting matrix
Your 8 token IDs become a matrix of shape [8 × 4,096] — a stack of vectors, one per token.
Positional Encoding
Teaching the model about word order
A transformer has no inherent sense of order— it processes all tokens simultaneously in parallel, not left to right. Without intervention, 'I bake cookies' and 'cookies bake I' would look identical to the model.
To fix this, a positional encoding is added to each embedding vector. This encodes where each token appears in the sequence.
The original Transformer paper (2017) used fixed sinusoidal functions — mathematical wave patterns based on position index. Most modern decoder-only LLMs use Rotary Position Embedding (RoPE), including LLaMA, Mistral, Qwen, and DeepSeek. RoPE encodes each token's position by rotating its Query and Key vectors using position-dependent rotation matrices. The result is that attention scores depend only on the relative distancebetween tokens, not their absolute positions. This helps models generalize to sequence lengths they were not trained on — critical for the 128K+ context windows common in 2026.
The maximum number of tokens a model can process at once is called its context window. Early models had windows of just 2,000–4,000 tokens. Modern models in 2026 can handle 128,000 to over 1,000,000 tokens — enough to process entire books or codebases at once.
After positional encoding: each token is now represented by a vector that encodes both what it is and where it is.
Entering the Transformer
The engine that powers modern AI
Your tokens now enter the Transformer — the engine that powers modern AI. All major AI chat models — GPT-4, Claude, Gemini, LLaMA, Mistral, DeepSeek— are built on the decoder-only transformer architecture. This means each is designed specifically for text generation — predicting the next token given all previous tokens.
The transformer is built from a stack of identical blocks. A typical large model has 32 to 128+ of these blocks, applied sequentially. Each block contains two main operations:
- Multi-Head Self-Attention — lets every token look at every other token
- Feed-Forward Network — processes each token individually
Both operations are wrapped in residual connections and RMSNorm for training stability. Click any block in the tower to see its internal structure.
The architecture described so far — where every token passes through every parameter — is called a dense transformer. But since 2024, the largest and most capable models use Mixture of Experts (MoE). In an MoE model, the FFN is replaced with multiple smaller FFNs called experts (typically 16 to 256). A lightweight router selects just a handful of experts per token (typically 2–8), so a model can have 671B total parameters but only activate 37B per token. DeepSeek-V3, for example, has 256 experts per layer but routes each token to only 8 of them plus 1 shared expert.
Self-Attention
The core mechanism that separates transformers from everything before them
Showing attention pattern for the token "bake" — line thickness represents attention weight
Self-attention is the mechanism that separates transformers from everything that came before them. It allows every token to directly attend to every other token in the context, with learned weights determining how much each token should influence each other token.
This illustration shows semantic relationship strengths between tokens. During actual inference, a causal mask restricts each token to only attend to itself and earlier tokens (shown in section 7c). The arcs to later tokens here illustrate the semantic affinity that the model would leverage when those tokens are accessible.
This is why the same word can mean different things in different contexts. A word like "bank" in "river bank" versus "bank account" will produce completely different output vectors after attention, because the surrounding tokens pull different information into it.
Query, Key, Value
Three learned projections from every token's embedding
every token produces its own Q, K, V vectors simultaneously
For each token, three vectors are computed by multiplying the token's embedding by three separate learned weight matrices (Wq, Wk, Wv):
- Query (Q): "What am I looking for?"
- Key (K): "What do I advertise about myself?"
- Value (V): "What actual information do I carry?"
These are separate learned transformations. The model learned what makes a good query, a good key, and a good value during training.
Scaled Dot-Product Attention
Step-by-step computation for "bake" attending to all tokens
Step 1: Dot Products
Multiply Q of "bake" with K of every other token to produce raw similarity scores.
Step 2: Scale
The division by the square root of the dimension is a stability fix — it prevents the dot products from growing too large in high dimensions and saturating the softmax.
Step 3: Causal Mask
Tokens after the current position are masked out. "bake" is at position 3, so it can only see positions 0 through 3.
During text generation, the attention is masked so that each token can only see tokens that came before it, plus itself. This enforces the rule that the model can only use past context to predict the next token.
Step 4: Softmax
The remaining scores pass through softmax, normalizing into a probability distribution summing to 1.0.
Step 5: Weighted Sum of Values
The output for each token is a weighted average of all Value vectors. After this step, the vector for "bake" is no longer just about the word "bake" — it now carries contextual information from the tokens it attended to.
Multi-Head Attention
32 parallel attention computations, each learning different patterns
Rather than running one attention computation, the model runs multiple attention operations in parallel — each called a head. Each head uses its own Wq, Wk, Wv matrices and learns to attend to different types of relationships.
One head might track syntactic dependencies (subject-verb agreement). Another might track semantic similarity. Another might focus on proximity. A model with 32 attention heads is attending to 32 different relationship structures simultaneously per layer.
The outputs of all heads are concatenated and projected back to the original dimension through a final linear layer.
Beyond Standard Multi-Head Attention
Modern optimizations that reduce memory cost while preserving quality
The original multi-head attention gives each head its own Query, Key, and Value projections. This is powerful but expensive — storing all those Key and Value vectors for every previous token (the "KV cache") consumes enormous amounts of memory during generation. Modern models use several strategies to reduce this cost.
Grouped-Query Attention (GQA): Used in LLaMA 2/3 and Mistral. Instead of giving every head its own Key and Value projections, groups of query heads share the same K/V. For example, 32 query heads might share 8 sets of K/V projections — reducing KV cache size by 4x with minimal quality loss.
Multi-Head Latent Attention (MLA): Used in DeepSeek-V2/V3. Instead of caching full Key and Value vectors, MLA compresses them into a small "latent" vector (for example, 512 dimensions vs. 7,168). When needed, the full K/V are reconstructed from this compressed representation. This achieves ~93% KV cache reduction — the most dramatic compression of any production attention variant.
Sliding Window Attention (SWA): Used in Mistral. Each token only attends to a fixed window of nearby tokens (for example, 4,096 tokens) rather than the full context, with information propagating through stacked layers.
Feed-Forward Network
Each token is processed independently through a gated neural network
After attention, each token's vector passes independently through the Feed-Forward Network (FFN). This is where most of the model's raw capacity lives.
The original Transformer (2017) used a two-layer FFN with ReLU activation and a 4x expansion ratio. BERT and GPT-2/3 switched to GELU. Modern frontier models (LLaMA, Mistral, DeepSeek, and most others released since 2023) use SwiGLU — a gated activation that uses three weight matrices instead of two. The gating mechanism lets the network learn to selectively pass or block information, improving model quality. Because of the extra matrix, the inner dimension is scaled to ~2.67x (instead of 4x) to keep the total parameter count comparable. For a 4,096-dimension model with SwiGLU, the inner dimension is approximately 11,008 instead of 16,384.
Research suggests the FFN layers function as a kind of key-value memory store, where specific factual associations are encoded in the weight matrices. This is where the model "stores" learned facts about the world.
Residual Connections & RMSNorm
Critical techniques that make deep transformers trainable
Around every sub-component (attention and FFN), the model uses two critical techniques:
Residual Connections: The input to each sub-layer is added back to its output. Think of it as:
output = sublayer(input) + input. This lets information flow directly through the network without being forced through every transformation.
Without residual connections, training deep models (80+ layers) would fail because gradients would vanish — the learning signal would evaporate before reaching early layers.
RMSNorm (Pre-Norm): The original 2017 Transformer placed normalization after the residual addition (post-norm). Modern models place it before each sub-layer (pre-norm), which improves training stability. Most current models also use RMSNorm — a simpler, faster variant that normalizes using only the root mean square of activations, omitting the mean-centering step of full LayerNorm.
Stacking It All Together
The full forward pass through all transformer layers
This process — RMSNorm, self-attention, residual add, RMSNorm, feed-forward, residual add — repeats identically for every layer in the model. Large models typically have 32 to 128+ transformer layers.
Each token's vector changes as it passes through each layer, accumulating contextual information. By the time a token exits the final layer, its vector is no longer about the word in isolation— it encodes the meaning of that word in the full context of every other word in the prompt.
After all transformer blocks, the final vector for each token position contains a rich, contextualized representation conditioned on the entire input.
Section 11
The Language Model Head
Top Predicted Next Tokens
... 100,000 total entries, most near 0%
After the final transformer block, the output vector at the last token position is projected through one final linear layer called the LM Head . This produces a vector of ~100,000 raw scores, one for every token in the vocabulary. These scores are called logits.
Softmax converts these logits into a probability distribution over all possible next tokens. This is the actual output of the model: not a single predicted word, but a full probability distribution over the entire vocabulary.
The model is not "choosing" a word. It is assigning a probability to every single one of its 100,000tokens simultaneously. "To" might get 12.3%, "First" might get 8.7%, and a random token like "Zamboni" might get 0.00001%.
Section 12
Sampling and Decoding
Always takes the single highest-probability token
Always picks the top token. Fast, but repetitive and prone to local maxima.
The model produces a probability distribution. How it selects the next token from that distribution is called the decoding strategy. Different strategies trade off between predictability and creativity.
In practice, production AI systems combine multiple strategies. For example, using temperature with top-p sampling. The exact settings are tuned to produce responses that are coherent and helpful without being repetitive.
Section 13
Autoregressive Generation
The full pipeline runs for every single token generated. After sampling a token, it gets appended to the sequence and the entire forward pass runs again.
Growing Sequence
Generation is autoregressive : after sampling one token, it is appended to the input sequence and the entire forward pass runs again to generate the next token.
This is why AI responses appear word by word (or token by token). Each token requires a full pass through the entire model, all 80+ transformer layers. A 500-token response requires 500 separate forward passes. Each pass after the first is far cheaper thanks to KV caching (Section 14) — only the single new token is processed through the FFN layers, and its Query attends to the cached Keys and Values from all previous tokens.
Generation scales roughly linearly with response length — each new token requires one forward pass. However, each pass grows slightly more expensive as the sequence lengthens, because the new token must attend to all previous tokens.
Section 14
The KV Cache
Running the full attention computation from scratch for every new token would be computationally prohibitive. The KV cache solves this.
Every token gets recomputed from scratch on each step. The red flash marks wasted computation: all previous K and V vectors are recalculated even though they have not changed.
Only the new token computes fresh Q, K, V vectors. Its Query attends to all previously cached Keys and Values. The new K/V pair is then appended to the cache.
Cache grows linearly with each generated token
The KV Cache stores the K and V vectors from all previous tokens so they do not need to be recomputed. Only the new token's Q, K, V vectors are computed per step. Its Q attends to all cached K/V pairs.
This is why prompt caching , reusing the KV cache across requests with the same prefix, gives large latency and cost reductions. The cached key-value pairs from the shared prefix are stored and reused directly.
This cache grows with every generated token, which is why very long conversations eventually become slow or expensive. The memory required scales linearly with sequence length, multiplied by the number of layers and attention heads.
Detokenization
Converting token IDs back to human-readable text
Generated token IDs
Each ID flips to reveal its text representation
Final assembled text
To bake chocolate chip cookies, you will need the following ingredients...
Once generation is complete (or at each step during streaming), the token IDs are mapped back to their text representations using the same vocabulary table from tokenization. The text fragments are concatenated to form the final human-readable response.
This is why AI responses sometimes have unusual word breaks or spacing artifacts. The model operates on tokens, not words, and the boundaries do not always align with what we think of as a "word."
Training
Where all the weights came from
16a. Pre-Training: Next-Token Prediction
The training objective
loss = -log(probability assigned to the correct next token)
The model was initialized with random weights and trained on hundreds of billions to trillions of tokens of text. The training objective is simple: next-token prediction. Given tokens 1 through N, predict token N+1.
The model's predicted distribution is compared to the actual next token using cross-entropy loss. Backpropagation computes how every single weight contributed to the error. An optimizer (typically Adam) nudges each weight slightly in the direction that reduces the loss.
This runs for weeks to months on clusters of thousands of specialized chips (GPUs or TPUs). LLaMA 3.1 405B trained for approximately 54 days on 16,384 H100 GPUs. The result: a model that has compressed the statistical structure of human language and knowledge into its weight matrices.
16b. What the Weights Encode
Conceptual view of a weight matrix
... billions of parameters total
There is no database being queried at inference time. The training dataset is gone. What remains is a compression of its statistical structure baked into billions of static floating-point numbers.
Knowledge is distributed. Factual associations are patterns spread across millions of weights, not stored at discrete addresses. This is why the model can interpolate, generalize to novel contexts... and also hallucinate (generate a plausible-sounding pattern that has no factual basis).
The model is, at its most reduced level, a very large function that maps a sequence of token integers to a probability distribution over the next token. Everything else (the apparent reasoning, the knowledge, the style) is what emerges from scaling that objective.
16c. RLHF: Making It Actually Helpful
Raw pre-trained models are not yet useful as assistants. A multi-stage process converts the raw text predictor into a helpful assistant.
Stage 1
Supervised Fine-Tuning (SFT)
Human contractors write ideal (prompt, response) pairs
- Example prompts paired with ideal responses
- Model learns instruction-following format
- Establishes baseline helpful behavior
Stage 2
Reward Modeling
Humans compare model outputs and choose the better one
- Two model responses shown side by side
- Human raters select the preferred output
- Separate reward model trained on preferences
Stage 3
Reinforcement Learning (PPO)
Model optimizes for reward while staying near SFT baseline
- Model generates candidate responses
- Reward model scores each response
- KL penalty prevents drift from SFT baseline
This pipeline is what converts a raw text predictor into something that behaves like a coherent, helpful, refusal-capable assistant. Reinforcement Learning from Human Feedback (RLHF) aligns the model with human preferences through supervised examples, reward modeling, and policy optimization.
Modern variants like Direct Preference Optimization (DPO) skip the separate reward model and optimize preferences directly, simplifying the pipeline while achieving comparable results.
The Complete Pipeline
Every step, from keypress to displayed response
All of this happens in a fraction of a second per token. Every matrix multiplication, every attention computation, every layer. A typical response of 500 tokens means the model runs this entire pipeline 500 times. On modern hardware, this takes just a few seconds.
The Complete Response
From keypress to answer
How do I bake chocolate chip cookies?
To bake chocolate chip cookies, you will need the following ingredients...
Every AI response you have ever received was produced by this process — a sequence of mathematical operations, running on numbers, producing probabilities, selecting tokens one at a time.
The core model itself is a stateless mathematical function — it has no built-in memory between forward passes and no inherent internet connection. But in production, AI assistants are wrapped in systems that add persistent memory across conversations, real-time web search, tool use, code execution, and more. These capabilities are not part of the neural network itself — they are engineered around it. The transformer at the center is still just predicting the next token.
But scaled to billions of parameters and trillions of training tokens, something remarkable emerges: the ability to generate text that appears knowledgeable, coherent, and helpful.
Accuracy note
This visualization explains how large language models work using general principles of the decoder-only transformer architecture. Specific numbers (layer counts, dimensions, vocabulary sizes) are drawn from openly documented models like LLaMA 3 and DeepSeek-V3. The exact architecture details of proprietary models like Claude and GPT-4 have not been publicly disclosed by their creators, but they share the same fundamental building blocks shown here.
This visualization covers the text-processing pipeline — the core of how language models work. Modern AI models in 2026 are also multimodal: they can process images, audio, video, and code using the same transformer architecture with additional input encoders. The fundamental mechanism — attention, feed-forward layers, autoregressive generation — remains the same across modalities.
Built as an educational visualization