Jump to section

What Happens When You Talk to an AI?

A visual journey through every step — from your keyboard to the AI's response

Scroll to begin

The User Types a Prompt

Where every AI interaction begins

Chat

Everything starts here. You type a question in plain English. But the AI does not understand English — it understands numbers. Before anything else, your text needs to be converted into a format the model can process. This conversion happens in several stages.

Text Encoding

Characters become numbers via Unicode

H
U+0048(72)
o
U+006F(111)
w
U+0077(119)
U+0020(32)
d
U+0064(100)
o
U+006F(111)
U+0020(32)
I
U+0049(73)
U+0020(32)
b
U+0062(98)
a
U+0061(97)
k
U+006B(107)
e
U+0065(101)
U+0020(32)
c
U+0063(99)
h
U+0068(104)
o
U+006F(111)
c
U+0063(99)
o
U+006F(111)
l
U+006C(108)
a
U+0061(97)
t
U+0074(116)
e
U+0065(101)
U+0020(32)
c
U+0063(99)
h
U+0068(104)
i
U+0069(105)
p
U+0070(112)
U+0020(32)
c
U+0063(99)
o
U+006F(111)
o
U+006F(111)
k
U+006B(107)
i
U+0069(105)
e
U+0065(101)
s
U+0073(115)
?
U+003F(63)
[72, 111, 119, 32, 100, 111, 32, 73, 32, 98, 97, 107, 101, 32, 99, 104, 111, 99, 111, 108, 97, 116, 101, 32, 99, 104, 105, 112, 32, 99, 111, 111, 107, 105, 101, 115, 63]

Before the AI even gets involved, your computer converts each character into a number using a standard called Unicode. The letter 'H' becomes 72. A space becomes 32. This is how all digital text works — it is not AI-specific. But the AI does not consume these raw character codes. It needs a smarter representation.

Tokenization

Breaking text into subword pieces

How do I bake chocolate chip cookies?
How
do
I
bake
chocolate
chip
cookies
?

The AI does not read word by word or letter by letter. It breaks text into chunks called tokens. A token is usually a common word or a piece of a word. Notice how spaces are attached to the beginning of words — that is part of how the tokenizer works.

A token is not a word and not a character — it is a subword unit produced by an algorithm called Byte Pair Encoding (BPE).

Claude and most modern LLMs use Byte Pair Encoding (BPE) with a vocabulary of roughly 65,000 tokens. Common words like 'the' are a single token. Uncommon words get split into multiple pieces — 'uncharacteristically' might become ['un', 'character', 'istic', 'ally']. Rare strings reduce all the way down to raw bytes.

How BPE Builds a Vocabulary

Starting from individual characters

Example corpus: "low lower lowest"
Starting characters:
l
o
w
e
r
s
t
1Most common pair: l + o lo
lowlowerlowest
2Most common pair: lo + w low
lowlowerlowest
3Most common pair: low + e lowe
lowelowerlowest

Vocabulary

Base: l, o, w, ␣, e, r, s, t
+lo
+low
+lowe

During training, the tokenizer examines massive amounts of text and finds which pairs of characters appear together most often. It merges them into a single token. Then repeats — over and over — until it has built a vocabulary of about 65,000 tokens.

This process happens once, before the AI model is even trained. After that, the same fixed vocabulary is used for every prompt and every response.

Token IDs

The numbers the model actually sees

How
2437
do
656
I
314
bake
19832
chocolate
14693
chip
12851
cookies
21487
?
30
[2437, 656, 314, 19832, 14693, 12851, 21487, 30]

Each token in the vocabulary has a unique ID number. From this point forward, the AI works entirely with these numbers— not with text. Your question has become a sequence of 8 integers. The original English text is gone.

Token Embedding

From integer IDs to dense vectors

2437
656
314
19832
14693
12851
21487
30
4,096+ dimensions per token

Each token ID gets looked up in a giant table called the embedding matrix — a learned table of shape [65,000 × 4,096]. This lookup converts each token integer into a dense floating-point vector — a list of thousands of decimal numbers.

These vectors are entirely learned during training. The geometry that emerges is remarkable: semantically similar words cluster together in this high-dimensional space.

A quick analogy

royaltyroyaltymanwomankingqueen

vector("king") − vector("man") + vector("woman") ≈ vector("queen")

This is not programmed — it emerges purely from learning to predict text.

The resulting matrix

Your 8 token IDs become a matrix of shape [8 × 4,096] — a stack of vectors, one per token.

...
...
...
...
...
...
...
...
8 × 4,096
[tokens × embedding dimensions]

Positional Encoding

Teaching the model about word order

Embedding Vectors
How
do
I
bake
chocolate
chip
cookies
?
+
+
+
+
+
+
+
+
Positional Encoding Vectors
=
=
=
=
=
=
=
=
Position-Aware Vectors

A transformer has no inherent sense of order— it processes all tokens simultaneously in parallel, not left to right. Without intervention, 'I bake cookies' and 'cookies bake I' would look identical to the model.

To fix this, a positional encoding is added to each embedding vector. This encodes where each token appears in the sequence.

The original Transformer paper (2017) used fixed sinusoidal functions — mathematical wave patterns based on position index. Modern models like Claude use Rotary Position Embedding (RoPE), which encodes the relative distance between tokens rather than their absolute position. This helps the model generalize to sequence lengths it was not trained on.

After positional encoding: each token is now represented by a vector that encodes both what it is and where it is.

Entering the Transformer

The engine that powers modern AI

Repeated N times large models use 32 to 96+ layers
Transformer Block 7
Transformer Block 6
Transformer Block 5
Transformer Block 4
Transformer Block 3
Exploring
Multi-Head Self-Attention
Add & Layer Norm
Feed-Forward Network (FFN)
Add & Layer Norm
+ Residual connectionsaround each sub-layer
Transformer Block 2
Transformer Block 1
position-encoded embeddings enter here

Your tokens now enter the Transformer the engine that powers modern AI. Claude is a decoder-only transformer. This means it is designed specifically for text generation predicting the next token given all previous tokens.

The transformer is built from a stack of identical blocks. A typical large model has 32 to 96 of these blocks, applied sequentially. Each block contains two main operations:

  1. Multi-Head Self-Attention lets every token look at every other token
  2. Feed-Forward Network processes each token individually

Both operations are wrapped in residual connections and layer normalization for training stability. Click any block in the tower to see its internal structure.

Self-Attention

The core mechanism that separates transformers from everything before them

Showing attention pattern for the token "bake" line thickness represents attention weight

HowdoIbakechocolatechipcookies?

Self-attention is the mechanism that separates transformers from everything that came before them. It allows every token to directly attend to every other token in the context, with learned weights determining how much each token should influence each other token.

When processing "bake," the model pays strong attention to "cookies," "chocolate," and "chip" because those words provide crucial context. It pays less attention to "How" and "do."

This is why the same word can mean different things in different contexts. A word like "bank" in "river bank" versus "bank account" will produce completely different output vectors after attention, because the surrounding tokens pull different information into it.

Query, Key, Value

Three learned projections from every token's embedding

"bake"embedding× WqQuery (Q)What am I looking for?× WkKey (K)What do I contain?× WvValue (V)What info do I carry?
How
do
I
bake
chocolate
chip
cookies
?

every token produces its own Q, K, V vectors simultaneously

For each token, three vectors are computed by multiplying the token's embedding by three separate learned weight matrices (Wq, Wk, Wv):

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I advertise about myself?"
  • Value (V): "What actual information do I carry?"

These are separate learned transformations. The model learned what makes a good query, a good key, and a good value during training.

Scaled Dot-Product Attention

Step-by-step computation for "bake" attending to all tokens

Step 1: Dot Products

Multiply Q of "bake" with K of every other token to produce raw similarity scores.

bake.Q · How.K=3.2
bake.Q · do.K=1.1
bake.Q · I.K=0.8
bake.Q · bake.K=2.5
bake.Q · chocolate.K=4.7
bake.Q · chip.K=4.1
bake.Q · cookies.K=5.3
bake.Q · ?.K=0.3

Step 2: Scale

score / dk where dk = 128, 128 11.3

The division by the square root of the dimension is a stability fix it prevents the dot products from growing too large in high dimensions and saturating the softmax.

Step 3: Causal Mask

Tokens after the current position are masked out. "bake" is at position 3, so it can only see positions 0 through 3.

How0.050
do0.017
I0.013
bake0.039
chocolate0.073
chip0.064
cookies0.083
?0.005

During text generation, the attention is masked so that each token can only see tokens that came before it, plus itself. This enforces the rule that the model can only use past context to predict the next token.

Step 4: Softmax

The remaining scores pass through softmax, normalizing into a probability distribution summing to 1.0.

How
17.9%
do
28.6%
I
10.7%
bake
42.9%

Step 5: Weighted Sum of Values

outputbake = 0.18 × VHow + 0.29 × Vdo + 0.11 × VI + 0.43 × Vbake

The output for each token is a weighted average of all Value vectors. After this step, the vector for "bake" is no longer just about the word "bake" it now carries contextual information from the tokens it attended to.

Multi-Head Attention

32 parallel attention computations, each learning different patterns

Head 1Syntax
How?
Head 2Semantics
How?
Head 3Proximity
How?
Head 4Position
How?
Head 5Context
How?
Head 6Global
How?
... 32 heads total, each with its own Wq, Wk, Wv matrices
...
Concatenate
Linear Projection 4,096 dim

Rather than running one attention computation, the model runs multiple attention operations in parallel each called a head. Each head uses its own Wq, Wk, Wv matrices and learns to attend to different types of relationships.

One head might track syntactic dependencies (subject-verb agreement). Another might track semantic similarity. Another might focus on proximity. A model with 32 attention heads is attending to 32 different relationship structures simultaneously per layer.

The outputs of all heads are concatenated and projected back to the original dimension through a final linear layer.

Feed-Forward Network

Each token is processed independently through a two-layer neural network

input vectors
8 × 4,096 dim
Feed-Forward Network
Linear Expand4,096 → 16,384
GELU Activationnon-linear transform
Linear Compress16,384 → 4,096
applied to each token independently
output vectors
8 × 4,096 dim

After attention, each token's vector passes independently through a two-layer neural network called the Feed-Forward Network (FFN). This is where most of the model's raw capacity lives.

In large models, the FFN is typically 4 times wider than the embedding dimension. For a 4,096-dimension model, the hidden layer has 16,384 neurons.

Research suggests the FFN layers function as a kind of key-value memory store, where specific factual associations are encoded in the weight matrices. This is where the model "stores" learned facts about the world.

Residual Connections & Layer Norm

Critical techniques that make deep transformers trainable

InputMulti-Head Self-AttentionLayer NormFeed-Forward NetworkLayer Norm++skipskipOutput
residual (skip) connection
+
element-wise addition

Around every sub-component (attention and FFN), the model uses two critical techniques:

Residual Connections: The input to each sub-layer is added back to its output. Think of it as: output = sublayer(input) + input. This lets information flow directly through the network without being forced through every transformation.

Without residual connections, training deep models (80+ layers) would fail because gradients would vanish the learning signal would evaporate before reaching early layers.

Layer Normalization: After each addition, the values are normalized centered and scaled to a consistent range. This keeps the numbers stable as they flow through dozens of layers. Without it, values would explode or collapse.

Stacking It All Together

The full forward pass through all transformer layers

~80+ layers total
Block 6
Block 5
Block 4
Block 3
Block 2
Block 1
8 token vectors
waiting to begin...

This process self-attention, then feed-forward, with residual connections and normalization repeats identically for every layer in the model. A large model like Claude has dozens to over 100 layers.

Each token's vector changes as it passes through each layer, accumulating contextual information. By the time a token exits the final layer, its vector is no longer about the word in isolation it encodes the meaning of that word in the full context of every other word in the prompt.

After all transformer blocks, the final vector for each token position contains a rich, contextualized representation conditioned on the entire input.

Section 11

The Language Model Head

Last token position output vector
?
vector [4,096 dims]
LM HeadLinear Projection[4,096 x 65,000]
65,000 Logitsraw unnormalized scores
Softmax
Probability Distribution65,000 values summing to 1.0

Top Predicted Next Tokens

To
12.3%
First
8.7%
Here
7.1%
You
5.9%
Start
4.2%
The
3.8%
Baking
3.1%
Pre
2.9%
Sure
2.4%
Great
1.9%

... 65,000 total entries, most near 0%

After the final transformer block, the output vector at the last token position is projected through one final linear layer called the LM Head . This produces a vector of ~65,000 raw scores, one for every token in the vocabulary. These scores are called logits.

Softmax converts these logits into a probability distribution over all possible next tokens. This is the actual output of the model: not a single predicted word, but a full probability distribution over the entire vocabulary.

The model is not "choosing" a word. It is assigning a probability to every single one of its 65,000tokens simultaneously. "To" might get 12.3%, "First" might get 8.7%, and a random token like "Zamboni" might get 0.00001%.

Section 12

Sampling and Decoding

Always takes the single highest-probability token

To
12.3%
First
8.7%
Here
7.1%
You
5.9%
Start
4.2%
The
3.8%
Baking
3.1%
Pre
2.9%
Sure
2.4%
Great
1.9%

Always picks the top token. Fast, but repetitive and prone to local maxima.

The model produces a probability distribution. How it selects the next token from that distribution is called the decoding strategy. Different strategies trade off between predictability and creativity.

In practice, production systems like Claude combine multiple strategies. For example, using temperature with top-p sampling. The exact settings are tuned to produce responses that are coherent and helpful without being repetitive.

Section 13

Autoregressive Generation

The full pipeline runs for every single token generated. After sampling a token, it gets appended to the sequence and the entire forward pass runs again.

Tokenize
Embed
Positional Enc.
Transformer Layers
LM Head
Softmax
Sample

Growing Sequence

How
do
I
bake
chocolate
chip
cookies
?
Step:
1 / 6
Generated Response

Generation is autoregressive : after sampling one token, it is appended to the input sequence and the entire forward pass runs again to generate the next token.

This is why AI responses appear word by word (or token by token). Each token requires a full pass through the entire model, all 80+ transformer layers. A 500-token response requires 500 separate forward passes.

This is also why generation scales linearly with response length. Twice as many tokens means roughly twice as much computation.

Section 14

The KV Cache

Running the full attention computation from scratch for every new token would be computationally prohibitive. The KV cache solves this.

Naive: Recompute Everything
How
do
I
bake
chocolate
chip
cookies
?
To

Every token gets recomputed from scratch on each step. The red flash marks wasted computation: all previous K and V vectors are recalculated even though they have not changed.

Optimized: KV Cache
Cached K/V Vectors
K/VHow
K/V do
K/V I
K/V bake
K/V chocolate
K/V chip
K/V cookies
K/V?
New Token (compute Q, K, V)
Waiting...

Only the new token computes fresh Q, K, V vectors. Its Query attends to all previously cached Keys and Values. The new K/V pair is then appended to the cache.

Cache Memory Growth
Token 1Token 12

Cache grows linearly with each generated token

The KV Cache stores the K and V vectors from all previous tokens so they do not need to be recomputed. Only the new token's Q, K, V vectors are computed per step. Its Q attends to all cached K/V pairs.

This is why prompt caching , reusing the KV cache across requests with the same prefix, gives large latency and cost reductions. The cached key-value pairs from the shared prefix are stored and reused directly.

This cache grows with every generated token, which is why very long conversations eventually become slow or expensive. The memory required scales linearly with sequence length, multiplied by the number of layers and attention heads.

Detokenization

Converting token IDs back to human-readable text

Generated token IDs

[1061, 19832, 14693, 12851, 21487, 11, 499, 690, 1184, 279, 2768, 14293]

Each ID flips to reveal its text representation

1061To
19832 bake
14693 chocolate
12851 chip
21487 cookies
11,
499 you
690 will
1184 need
279 the
2768 following
14293 ingredients

Final assembled text

To bake chocolate chip cookies, you will need the following ingredients...

Once generation is complete (or at each step during streaming), the token IDs are mapped back to their text representations using the same vocabulary table from tokenization. The text fragments are concatenated to form the final human-readable response.

This is why AI responses sometimes have unusual word breaks or spacing artifacts. The model operates on tokens, not words, and the boundaries do not always align with what we think of as a "word."

Training

Where all the weights came from

16a. Pre-Training: Next-Token Prediction

Text corpusWeb pages, books, code
TokenizeBreak into token IDs
Predict nextModel outputs distribution
CompareCheck against actual token
Compute lossCross-entropy error
Update weightsBackpropagation + Adam
Repeattrillions of times

The training objective

loss = -log(probability assigned to the correct next token)
High lossP(correct) = 1%loss = 4.605
Low lossP(correct) = 80%loss = 0.223

The model was initialized with random weights and trained on hundreds of billions to trillions of tokens of text. The training objective is simple: next-token prediction. Given tokens 1 through N, predict token N+1.

The model's predicted distribution is compared to the actual next token using cross-entropy loss. Backpropagation computes how every single weight contributed to the error. An optimizer (typically Adam) nudges each weight slightly in the direction that reduces the loss.

This runs for months on clusters of thousands of GPUs. The result: a model that has compressed the statistical structure of human language and knowledge into its weight matrices.

16b. What the Weights Encode

Conceptual view of a weight matrix

0.023
-0.189
0.452
-0.003
0.781
-0.346
-0.568
0.901
-0.234
0.679
-0.012
0.346
0.123
-0.789
0.234
-0.568
0.890
-0.123
-0.346
0.568
-0.901
0.234
-0.679
0.012
0.789
-0.123
0.568
-0.890
0.123
-0.457
-0.234
0.679
-0.012
0.346
-0.789
0.234

... billions of parameters total

There is no database being queried at inference time. The training dataset is gone. What remains is a compression of its statistical structure baked into billions of static floating-point numbers.

Knowledge is distributed. Factual associations are patterns spread across millions of weights, not stored at discrete addresses. This is why the model can interpolate, generalize to novel contexts... and also hallucinate (generate a plausible-sounding pattern that has no factual basis).

The model is, at its most reduced level, a very large function that maps a sequence of token integers to a probability distribution over the next token. Everything else (the apparent reasoning, the knowledge, the style) is what emerges from scaling that objective.

16c. RLHF: Making It Actually Helpful

Raw pre-trained models are not yet useful as assistants. A multi-stage process converts the raw text predictor into a helpful assistant.

1

Stage 1

Supervised Fine-Tuning (SFT)

Human contractors write ideal (prompt, response) pairs

  • Example prompts paired with ideal responses
  • Model learns instruction-following format
  • Establishes baseline helpful behavior
2

Stage 2

Reward Modeling

Humans compare model outputs and choose the better one

  • Two model responses shown side by side
  • Human raters select the preferred output
  • Separate reward model trained on preferences
3

Stage 3

Reinforcement Learning (PPO)

Model optimizes for reward while staying near SFT baseline

  • Model generates candidate responses
  • Reward model scores each response
  • KL penalty prevents drift from SFT baseline

This pipeline is what converts a raw text predictor into something that behaves like a coherent, helpful, refusal-capable assistant. Reinforcement Learning from Human Feedback (RLHF) aligns the model with human preferences through supervised examples, reward modeling, and policy optimization.

Modern variants like Direct Preference Optimization (DPO) skip the separate reward model and optimize preferences directly, simplifying the pipeline while achieving comparable results.

The Complete Pipeline

Every step, from keypress to displayed response

1
User types text
2
Unicode encoding
3
BPE Tokenization
4
Token IDs
5
Embedding lookup
6
Positional encoding
7
Transformer Block 1
8
Transformer Block 2
9
Transformer Block N
10
LM Head projection
11
Softmax
12
Sampling
13
Selected token
14
Append and loop
15
Detokenization
16
Display response
Response delivered to user

All of this happens in a fraction of a second per token. Every matrix multiplication, every attention computation, every layer. A typical response of 500 tokens means the model runs this entire pipeline 500 times. On modern hardware, this takes just a few seconds.

The Complete Response

From keypress to answer

AI Assistant

How do I bake chocolate chip cookies?

To bake chocolate chip cookies, you will need the following ingredients...

Every AI response you have ever received was produced by this process — a sequence of mathematical operations, running on numbers, producing probabilities, selecting tokens one at a time.

The model has no understanding, no memory of previous conversations (unless explicitly engineered), and no access to the internet during generation. It is, at its core, a very large function — one whose parameters were learned by reading the equivalent of millions of books.

But scaled to billions of parameters and trillions of training tokens, something remarkable emerges: the ability to generate text that appears knowledgeable, coherent, and helpful.

Accuracy note

This visualization uses Claude (by Anthropic) as its primary example. Anthropic has not publicly disclosed the exact architecture details, parameter count, or training data of Claude. The descriptions in this visualization are based on the confirmed fact that Claude is a decoder-only transformer using BPE tokenization, combined with well-documented principles of transformer architecture that apply to all models in this family. Specific numbers (layer counts, dimensions, vocabulary sizes) are illustrative and representative of large language models in general.

Built as an educational visualization