What Happens When You Talk to an AI?
A visual journey through every step — from your keyboard to the AI's response
The User Types a Prompt
Where every AI interaction begins
Everything starts here. You type a question in plain English. But the AI does not understand English — it understands numbers. Before anything else, your text needs to be converted into a format the model can process. This conversion happens in several stages.
Text Encoding
Characters become numbers via Unicode
Before the AI even gets involved, your computer converts each character into a number using a standard called Unicode. The letter 'H' becomes 72. A space becomes 32. This is how all digital text works — it is not AI-specific. But the AI does not consume these raw character codes. It needs a smarter representation.
Tokenization
Breaking text into subword pieces
The AI does not read word by word or letter by letter. It breaks text into chunks called tokens. A token is usually a common word or a piece of a word. Notice how spaces are attached to the beginning of words — that is part of how the tokenizer works.
A token is not a word and not a character — it is a subword unit produced by an algorithm called Byte Pair Encoding (BPE).
Claude and most modern LLMs use Byte Pair Encoding (BPE) with a vocabulary of roughly 65,000 tokens. Common words like 'the' are a single token. Uncommon words get split into multiple pieces — 'uncharacteristically' might become ['un', 'character', 'istic', 'ally']. Rare strings reduce all the way down to raw bytes.
How BPE Builds a Vocabulary
Starting from individual characters
Vocabulary
During training, the tokenizer examines massive amounts of text and finds which pairs of characters appear together most often. It merges them into a single token. Then repeats — over and over — until it has built a vocabulary of about 65,000 tokens.
This process happens once, before the AI model is even trained. After that, the same fixed vocabulary is used for every prompt and every response.
Token IDs
The numbers the model actually sees
Each token in the vocabulary has a unique ID number. From this point forward, the AI works entirely with these numbers— not with text. Your question has become a sequence of 8 integers. The original English text is gone.
Token Embedding
From integer IDs to dense vectors
Each token ID gets looked up in a giant table called the embedding matrix — a learned table of shape [65,000 × 4,096]. This lookup converts each token integer into a dense floating-point vector — a list of thousands of decimal numbers.
These vectors are entirely learned during training. The geometry that emerges is remarkable: semantically similar words cluster together in this high-dimensional space.
A quick analogy
vector("king") − vector("man") + vector("woman") ≈ vector("queen")
This is not programmed — it emerges purely from learning to predict text.
The resulting matrix
Your 8 token IDs become a matrix of shape [8 × 4,096] — a stack of vectors, one per token.
Positional Encoding
Teaching the model about word order
A transformer has no inherent sense of order— it processes all tokens simultaneously in parallel, not left to right. Without intervention, 'I bake cookies' and 'cookies bake I' would look identical to the model.
To fix this, a positional encoding is added to each embedding vector. This encodes where each token appears in the sequence.
The original Transformer paper (2017) used fixed sinusoidal functions — mathematical wave patterns based on position index. Modern models like Claude use Rotary Position Embedding (RoPE), which encodes the relative distance between tokens rather than their absolute position. This helps the model generalize to sequence lengths it was not trained on.
After positional encoding: each token is now represented by a vector that encodes both what it is and where it is.
Entering the Transformer
The engine that powers modern AI
Your tokens now enter the Transformer — the engine that powers modern AI. Claude is a decoder-only transformer. This means it is designed specifically for text generation — predicting the next token given all previous tokens.
The transformer is built from a stack of identical blocks. A typical large model has 32 to 96 of these blocks, applied sequentially. Each block contains two main operations:
- Multi-Head Self-Attention — lets every token look at every other token
- Feed-Forward Network — processes each token individually
Both operations are wrapped in residual connections and layer normalization for training stability. Click any block in the tower to see its internal structure.
Self-Attention
The core mechanism that separates transformers from everything before them
Showing attention pattern for the token "bake" — line thickness represents attention weight
Self-attention is the mechanism that separates transformers from everything that came before them. It allows every token to directly attend to every other token in the context, with learned weights determining how much each token should influence each other token.
When processing "bake," the model pays strong attention to "cookies," "chocolate," and "chip" —because those words provide crucial context. It pays less attention to "How" and "do."
This is why the same word can mean different things in different contexts. A word like "bank" in "river bank" versus "bank account" will produce completely different output vectors after attention, because the surrounding tokens pull different information into it.
Query, Key, Value
Three learned projections from every token's embedding
every token produces its own Q, K, V vectors simultaneously
For each token, three vectors are computed by multiplying the token's embedding by three separate learned weight matrices (Wq, Wk, Wv):
- Query (Q): "What am I looking for?"
- Key (K): "What do I advertise about myself?"
- Value (V): "What actual information do I carry?"
These are separate learned transformations. The model learned what makes a good query, a good key, and a good value during training.
Scaled Dot-Product Attention
Step-by-step computation for "bake" attending to all tokens
Step 1: Dot Products
Multiply Q of "bake" with K of every other token to produce raw similarity scores.
Step 2: Scale
The division by the square root of the dimension is a stability fix — it prevents the dot products from growing too large in high dimensions and saturating the softmax.
Step 3: Causal Mask
Tokens after the current position are masked out. "bake" is at position 3, so it can only see positions 0 through 3.
During text generation, the attention is masked so that each token can only see tokens that came before it, plus itself. This enforces the rule that the model can only use past context to predict the next token.
Step 4: Softmax
The remaining scores pass through softmax, normalizing into a probability distribution summing to 1.0.
Step 5: Weighted Sum of Values
The output for each token is a weighted average of all Value vectors. After this step, the vector for "bake" is no longer just about the word "bake" — it now carries contextual information from the tokens it attended to.
Multi-Head Attention
32 parallel attention computations, each learning different patterns
Rather than running one attention computation, the model runs multiple attention operations in parallel — each called a head. Each head uses its own Wq, Wk, Wv matrices and learns to attend to different types of relationships.
One head might track syntactic dependencies (subject-verb agreement). Another might track semantic similarity. Another might focus on proximity. A model with 32 attention heads is attending to 32 different relationship structures simultaneously per layer.
The outputs of all heads are concatenated and projected back to the original dimension through a final linear layer.
Feed-Forward Network
Each token is processed independently through a two-layer neural network
After attention, each token's vector passes independently through a two-layer neural network called the Feed-Forward Network (FFN). This is where most of the model's raw capacity lives.
In large models, the FFN is typically 4 times wider than the embedding dimension. For a 4,096-dimension model, the hidden layer has 16,384 neurons.
Research suggests the FFN layers function as a kind of key-value memory store, where specific factual associations are encoded in the weight matrices. This is where the model "stores" learned facts about the world.
Residual Connections & Layer Norm
Critical techniques that make deep transformers trainable
Around every sub-component (attention and FFN), the model uses two critical techniques:
Residual Connections: The input to each sub-layer is added back to its output. Think of it as:
output = sublayer(input) + input. This lets information flow directly through the network without being forced through every transformation.
Without residual connections, training deep models (80+ layers) would fail because gradients would vanish — the learning signal would evaporate before reaching early layers.
Layer Normalization: After each addition, the values are normalized — centered and scaled to a consistent range. This keeps the numbers stable as they flow through dozens of layers. Without it, values would explode or collapse.
Stacking It All Together
The full forward pass through all transformer layers
This process — self-attention, then feed-forward, with residual connections and normalization — repeats identically for every layer in the model. A large model like Claude has dozens to over 100 layers.
Each token's vector changes as it passes through each layer, accumulating contextual information. By the time a token exits the final layer, its vector is no longer about the word in isolation— it encodes the meaning of that word in the full context of every other word in the prompt.
After all transformer blocks, the final vector for each token position contains a rich, contextualized representation conditioned on the entire input.
Section 11
The Language Model Head
Top Predicted Next Tokens
... 65,000 total entries, most near 0%
After the final transformer block, the output vector at the last token position is projected through one final linear layer called the LM Head . This produces a vector of ~65,000 raw scores, one for every token in the vocabulary. These scores are called logits.
Softmax converts these logits into a probability distribution over all possible next tokens. This is the actual output of the model: not a single predicted word, but a full probability distribution over the entire vocabulary.
The model is not "choosing" a word. It is assigning a probability to every single one of its 65,000tokens simultaneously. "To" might get 12.3%, "First" might get 8.7%, and a random token like "Zamboni" might get 0.00001%.
Section 12
Sampling and Decoding
Always takes the single highest-probability token
Always picks the top token. Fast, but repetitive and prone to local maxima.
The model produces a probability distribution. How it selects the next token from that distribution is called the decoding strategy. Different strategies trade off between predictability and creativity.
In practice, production systems like Claude combine multiple strategies. For example, using temperature with top-p sampling. The exact settings are tuned to produce responses that are coherent and helpful without being repetitive.
Section 13
Autoregressive Generation
The full pipeline runs for every single token generated. After sampling a token, it gets appended to the sequence and the entire forward pass runs again.
Growing Sequence
Generation is autoregressive : after sampling one token, it is appended to the input sequence and the entire forward pass runs again to generate the next token.
This is why AI responses appear word by word (or token by token). Each token requires a full pass through the entire model, all 80+ transformer layers. A 500-token response requires 500 separate forward passes.
This is also why generation scales linearly with response length. Twice as many tokens means roughly twice as much computation.
Section 14
The KV Cache
Running the full attention computation from scratch for every new token would be computationally prohibitive. The KV cache solves this.
Every token gets recomputed from scratch on each step. The red flash marks wasted computation: all previous K and V vectors are recalculated even though they have not changed.
Only the new token computes fresh Q, K, V vectors. Its Query attends to all previously cached Keys and Values. The new K/V pair is then appended to the cache.
Cache grows linearly with each generated token
The KV Cache stores the K and V vectors from all previous tokens so they do not need to be recomputed. Only the new token's Q, K, V vectors are computed per step. Its Q attends to all cached K/V pairs.
This is why prompt caching , reusing the KV cache across requests with the same prefix, gives large latency and cost reductions. The cached key-value pairs from the shared prefix are stored and reused directly.
This cache grows with every generated token, which is why very long conversations eventually become slow or expensive. The memory required scales linearly with sequence length, multiplied by the number of layers and attention heads.
Detokenization
Converting token IDs back to human-readable text
Generated token IDs
Each ID flips to reveal its text representation
Final assembled text
To bake chocolate chip cookies, you will need the following ingredients...
Once generation is complete (or at each step during streaming), the token IDs are mapped back to their text representations using the same vocabulary table from tokenization. The text fragments are concatenated to form the final human-readable response.
This is why AI responses sometimes have unusual word breaks or spacing artifacts. The model operates on tokens, not words, and the boundaries do not always align with what we think of as a "word."
Training
Where all the weights came from
16a. Pre-Training: Next-Token Prediction
The training objective
loss = -log(probability assigned to the correct next token)
The model was initialized with random weights and trained on hundreds of billions to trillions of tokens of text. The training objective is simple: next-token prediction. Given tokens 1 through N, predict token N+1.
The model's predicted distribution is compared to the actual next token using cross-entropy loss. Backpropagation computes how every single weight contributed to the error. An optimizer (typically Adam) nudges each weight slightly in the direction that reduces the loss.
This runs for months on clusters of thousands of GPUs. The result: a model that has compressed the statistical structure of human language and knowledge into its weight matrices.
16b. What the Weights Encode
Conceptual view of a weight matrix
... billions of parameters total
There is no database being queried at inference time. The training dataset is gone. What remains is a compression of its statistical structure baked into billions of static floating-point numbers.
Knowledge is distributed. Factual associations are patterns spread across millions of weights, not stored at discrete addresses. This is why the model can interpolate, generalize to novel contexts... and also hallucinate (generate a plausible-sounding pattern that has no factual basis).
The model is, at its most reduced level, a very large function that maps a sequence of token integers to a probability distribution over the next token. Everything else (the apparent reasoning, the knowledge, the style) is what emerges from scaling that objective.
16c. RLHF: Making It Actually Helpful
Raw pre-trained models are not yet useful as assistants. A multi-stage process converts the raw text predictor into a helpful assistant.
Stage 1
Supervised Fine-Tuning (SFT)
Human contractors write ideal (prompt, response) pairs
- Example prompts paired with ideal responses
- Model learns instruction-following format
- Establishes baseline helpful behavior
Stage 2
Reward Modeling
Humans compare model outputs and choose the better one
- Two model responses shown side by side
- Human raters select the preferred output
- Separate reward model trained on preferences
Stage 3
Reinforcement Learning (PPO)
Model optimizes for reward while staying near SFT baseline
- Model generates candidate responses
- Reward model scores each response
- KL penalty prevents drift from SFT baseline
This pipeline is what converts a raw text predictor into something that behaves like a coherent, helpful, refusal-capable assistant. Reinforcement Learning from Human Feedback (RLHF) aligns the model with human preferences through supervised examples, reward modeling, and policy optimization.
Modern variants like Direct Preference Optimization (DPO) skip the separate reward model and optimize preferences directly, simplifying the pipeline while achieving comparable results.
The Complete Pipeline
Every step, from keypress to displayed response
All of this happens in a fraction of a second per token. Every matrix multiplication, every attention computation, every layer. A typical response of 500 tokens means the model runs this entire pipeline 500 times. On modern hardware, this takes just a few seconds.
The Complete Response
From keypress to answer
How do I bake chocolate chip cookies?
To bake chocolate chip cookies, you will need the following ingredients...
Every AI response you have ever received was produced by this process — a sequence of mathematical operations, running on numbers, producing probabilities, selecting tokens one at a time.
The model has no understanding, no memory of previous conversations (unless explicitly engineered), and no access to the internet during generation. It is, at its core, a very large function — one whose parameters were learned by reading the equivalent of millions of books.
But scaled to billions of parameters and trillions of training tokens, something remarkable emerges: the ability to generate text that appears knowledgeable, coherent, and helpful.
Accuracy note
This visualization uses Claude (by Anthropic) as its primary example. Anthropic has not publicly disclosed the exact architecture details, parameter count, or training data of Claude. The descriptions in this visualization are based on the confirmed fact that Claude is a decoder-only transformer using BPE tokenization, combined with well-documented principles of transformer architecture that apply to all models in this family. Specific numbers (layer counts, dimensions, vocabulary sizes) are illustrative and representative of large language models in general.
Built as an educational visualization