Jump to section

What Happens When You Talk to an AI?

How AI chat systems like ChatGPT, Claude, Gemini, and others actually work — from your keyboard to the AI's response

Scroll to begin

The User Types a Prompt

Where every AI interaction begins

Chat

Everything starts here. You type a question in plain English. But the AI does not understand English — it understands numbers. Before anything else, your text needs to be converted into a format the model can process. This conversion happens in several stages.

Text Encoding

Characters become numbers via Unicode

H
U+0048(72)
o
U+006F(111)
w
U+0077(119)
U+0020(32)
d
U+0064(100)
o
U+006F(111)
U+0020(32)
I
U+0049(73)
U+0020(32)
b
U+0062(98)
a
U+0061(97)
k
U+006B(107)
e
U+0065(101)
U+0020(32)
c
U+0063(99)
h
U+0068(104)
o
U+006F(111)
c
U+0063(99)
o
U+006F(111)
l
U+006C(108)
a
U+0061(97)
t
U+0074(116)
e
U+0065(101)
U+0020(32)
c
U+0063(99)
h
U+0068(104)
i
U+0069(105)
p
U+0070(112)
U+0020(32)
c
U+0063(99)
o
U+006F(111)
o
U+006F(111)
k
U+006B(107)
i
U+0069(105)
e
U+0065(101)
s
U+0073(115)
?
U+003F(63)
[72, 111, 119, 32, 100, 111, 32, 73, 32, 98, 97, 107, 101, 32, 99, 104, 111, 99, 111, 108, 97, 116, 101, 32, 99, 104, 105, 112, 32, 99, 111, 111, 107, 105, 101, 115, 63]

Before the AI even gets involved, your computer converts each character into a number using a standard called Unicode. The letter 'H' becomes 72. A space becomes 32. This is how all digital text works — it is not AI-specific. But the AI does not consume these raw character codes. It needs a smarter representation.

Tokenization

Breaking text into subword pieces

How do I bake chocolate chip cookies?
How
do
I
bake
chocolate
chip
cookies
?

The AI does not read word by word or letter by letter. It breaks text into chunks called tokens. A token is usually a common word or a piece of a word. Notice how spaces are attached to the beginning of words — that is part of how the tokenizer works.

A token is not a word and not a character — it is a subword unit produced by an algorithm called Byte Pair Encoding (BPE).

Most modern LLMs use Byte Pair Encoding (BPE). Vocabulary sizes vary: LLaMA 3 uses 128,256 tokens, GPT-4 uses ~100,000, and GPT-4o uses ~200,000. Common words like 'the' are a single token. Uncommon words get split into multiple pieces — 'uncharacteristically' might become ['un', 'character', 'istic', 'ally'] in a simplified illustration (in practice, BPE splits follow frequency patterns, not morphology — in a large modern vocabulary this word might actually be a single token). Rare strings reduce all the way down to raw bytes.

How BPE Builds a Vocabulary

Starting from individual characters

Example corpus: "low lower lowest"
Starting characters:
l
o
w
e
r
s
t
1Most common pair: l + o lo
lowlowerlowest
2Most common pair: lo + w low
lowlowerlowest
3Most common pair: low + e lowe
lowelowerlowest

Vocabulary

Base: l, o, w, ␣, e, r, s, t
+lo
+low
+lowe

During training, the tokenizer examines massive amounts of text and finds which pairs of characters appear together most often. It merges them into a single token. Then repeats — over and over — until it has built a vocabulary of 100,000 to 200,000+ tokens.

This process happens once, before the AI model is even trained. After that, the same fixed vocabulary is used for every prompt and every response.

Token IDs

The numbers the model actually sees

How
2437
do
656
I
314
bake
19832
chocolate
14693
chip
12851
cookies
21487
?
30
[2437, 656, 314, 19832, 14693, 12851, 21487, 30]

Each token in the vocabulary has a unique ID number. From this point forward, the AI works entirely with these numbers— not with text. Your question has become a sequence of 8 integers. The original English text is gone.

Token Embedding

From integer IDs to dense vectors

2437
656
314
19832
14693
12851
21487
30
4,096+ dimensions per token

Each token ID gets looked up in a giant table called the embedding matrix — a learned table of shape [100,000 × 4,096]. This lookup converts each token integer into a dense floating-point vector — a list of thousands of decimal numbers.

These vectors are entirely learned during training. The geometry that emerges is remarkable: semantically similar words cluster together in this high-dimensional space.

A historical breakthrough

royaltyroyaltymanwomankingqueen

vector("king") − vector("man") + vector("woman") ≈ vector("queen")

An early breakthrough from Word2Vec (2013), which gave each word a fixed position in vector space. This showed that vector geometry encodes meaning.

Modern LLMs go further with contextual embeddings— the vector for "bank" is different in "river bank" vs. "bank account." The embedding lookup shown here is just a starting point; these vectors are continuously refined by attention in the layers above.

The resulting matrix

Your 8 token IDs become a matrix of shape [8 × 4,096] — a stack of vectors, one per token.

...
...
...
...
...
...
...
...
8 × 4,096
[tokens × embedding dimensions]

Positional Encoding

Teaching the model about word order

Embedding Vectors
How
do
I
bake
chocolate
chip
cookies
?
+
+
+
+
+
+
+
+
Positional Encoding Vectors
=
=
=
=
=
=
=
=
Position-Aware Vectors

A transformer has no inherent sense of order— it processes all tokens simultaneously in parallel, not left to right. Without intervention, 'I bake cookies' and 'cookies bake I' would look identical to the model.

To fix this, a positional encoding is added to each embedding vector. This encodes where each token appears in the sequence.

The original Transformer paper (2017) used fixed sinusoidal functions — mathematical wave patterns based on position index. Most modern decoder-only LLMs use Rotary Position Embedding (RoPE), including LLaMA, Mistral, Qwen, and DeepSeek. RoPE encodes each token's position by rotating its Query and Key vectors using position-dependent rotation matrices. The result is that attention scores depend only on the relative distancebetween tokens, not their absolute positions. This helps models generalize to sequence lengths they were not trained on — critical for the 128K+ context windows common in 2026.

The maximum number of tokens a model can process at once is called its context window. Early models had windows of just 2,0004,000 tokens. Modern models in 2026 can handle 128,000 to over 1,000,000 tokens — enough to process entire books or codebases at once.

After positional encoding: each token is now represented by a vector that encodes both what it is and where it is.

Entering the Transformer

The engine that powers modern AI

Repeated N times large models use 32 to 128+ layers
Transformer Block 7
Transformer Block 6
Transformer Block 5
Transformer Block 4
Transformer Block 3
Exploring
RMS Norm
Multi-Head Self-Attention
Add (Residual)
RMS Norm
Feed-Forward Network (FFN)
Add (Residual)
+ Residual connectionsaround each sub-layer
Transformer Block 2
Transformer Block 1
position-encoded embeddings enter here

Your tokens now enter the Transformer the engine that powers modern AI. All major AI chat models GPT-4, Claude, Gemini, LLaMA, Mistral, DeepSeek are built on the decoder-only transformer architecture. This means each is designed specifically for text generation predicting the next token given all previous tokens.

The transformer is built from a stack of identical blocks. A typical large model has 32 to 128+ of these blocks, applied sequentially. Each block contains two main operations:

  1. Multi-Head Self-Attention lets every token look at every other token
  2. Feed-Forward Network processes each token individually

Both operations are wrapped in residual connections and RMSNorm for training stability. Click any block in the tower to see its internal structure.

The architecture described so far where every token passes through every parameter is called a dense transformer. But since 2024, the largest and most capable models use Mixture of Experts (MoE). In an MoE model, the FFN is replaced with multiple smaller FFNs called experts (typically 16 to 256). A lightweight router selects just a handful of experts per token (typically 28), so a model can have 671B total parameters but only activate 37B per token. DeepSeek-V3, for example, has 256 experts per layer but routes each token to only 8 of them plus 1 shared expert.

Self-Attention

The core mechanism that separates transformers from everything before them

Showing attention pattern for the token "bake" line thickness represents attention weight

HowdoIbakechocolatechipcookies?

Self-attention is the mechanism that separates transformers from everything that came before them. It allows every token to directly attend to every other token in the context, with learned weights determining how much each token should influence each other token.

This illustration shows semantic relationship strengths between tokens. During actual inference, a causal mask restricts each token to only attend to itself and earlier tokens (shown in section 7c). The arcs to later tokens here illustrate the semantic affinity that the model would leverage when those tokens are accessible.

This is why the same word can mean different things in different contexts. A word like "bank" in "river bank" versus "bank account" will produce completely different output vectors after attention, because the surrounding tokens pull different information into it.

Query, Key, Value

Three learned projections from every token's embedding

"bake"embedding× WqQuery (Q)What am I looking for?× WkKey (K)What do I contain?× WvValue (V)What info do I carry?
How
do
I
bake
chocolate
chip
cookies
?

every token produces its own Q, K, V vectors simultaneously

For each token, three vectors are computed by multiplying the token's embedding by three separate learned weight matrices (Wq, Wk, Wv):

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I advertise about myself?"
  • Value (V): "What actual information do I carry?"

These are separate learned transformations. The model learned what makes a good query, a good key, and a good value during training.

Scaled Dot-Product Attention

Step-by-step computation for "bake" attending to all tokens

Step 1: Dot Products

Multiply Q of "bake" with K of every other token to produce raw similarity scores.

bake.Q · How.K=3.2
bake.Q · do.K=1.1
bake.Q · I.K=0.8
bake.Q · bake.K=2.5
bake.Q · chocolate.K=4.7
bake.Q · chip.K=4.1
bake.Q · cookies.K=5.3
bake.Q · ?.K=0.3

Step 2: Scale

score / dk where dk = 128, 128 11.3

The division by the square root of the dimension is a stability fix it prevents the dot products from growing too large in high dimensions and saturating the softmax.

Step 3: Causal Mask

Tokens after the current position are masked out. "bake" is at position 3, so it can only see positions 0 through 3.

How0.050
do0.017
I0.013
bake0.039
chocolate0.073
chip0.064
cookies0.083
?0.005

During text generation, the attention is masked so that each token can only see tokens that came before it, plus itself. This enforces the rule that the model can only use past context to predict the next token.

Step 4: Softmax

The remaining scores pass through softmax, normalizing into a probability distribution summing to 1.0.

How
17.9%
do
28.6%
I
10.7%
bake
42.9%

Step 5: Weighted Sum of Values

outputbake = 0.18 × VHow + 0.29 × Vdo + 0.11 × VI + 0.43 × Vbake

The output for each token is a weighted average of all Value vectors. After this step, the vector for "bake" is no longer just about the word "bake" it now carries contextual information from the tokens it attended to.

Multi-Head Attention

32 parallel attention computations, each learning different patterns

Head 1Syntax
How?
Head 2Semantics
How?
Head 3Proximity
How?
Head 4Position
How?
Head 5Context
How?
Head 6Global
How?
... 32 heads total, each with its own Wq, Wk, Wv matrices
...
Concatenate
Linear Projection 4,096 dim

Rather than running one attention computation, the model runs multiple attention operations in parallel each called a head. Each head uses its own Wq, Wk, Wv matrices and learns to attend to different types of relationships.

One head might track syntactic dependencies (subject-verb agreement). Another might track semantic similarity. Another might focus on proximity. A model with 32 attention heads is attending to 32 different relationship structures simultaneously per layer.

The outputs of all heads are concatenated and projected back to the original dimension through a final linear layer.

Beyond Standard Multi-Head Attention

Modern optimizations that reduce memory cost while preserving quality

The original multi-head attention gives each head its own Query, Key, and Value projections. This is powerful but expensive storing all those Key and Value vectors for every previous token (the "KV cache") consumes enormous amounts of memory during generation. Modern models use several strategies to reduce this cost.

Grouped-Query Attention (GQA): Used in LLaMA 2/3 and Mistral. Instead of giving every head its own Key and Value projections, groups of query heads share the same K/V. For example, 32 query heads might share 8 sets of K/V projections reducing KV cache size by 4x with minimal quality loss.

Multi-Head Latent Attention (MLA): Used in DeepSeek-V2/V3. Instead of caching full Key and Value vectors, MLA compresses them into a small "latent" vector (for example, 512 dimensions vs. 7,168). When needed, the full K/V are reconstructed from this compressed representation. This achieves ~93% KV cache reduction the most dramatic compression of any production attention variant.

Sliding Window Attention (SWA): Used in Mistral. Each token only attends to a fixed window of nearby tokens (for example, 4,096 tokens) rather than the full context, with information propagating through stacked layers.

Feed-Forward Network

Each token is processed independently through a gated neural network

input vectors
8 × 4,096 dim
Feed-Forward Network
Gate Projection4,096 → 11,008 (gate path)
Value Projection4,096 → 11,008 (value path)
SwiGLU ActivationSwish(gate) ⊙ value
Output Projection11,008 → 4,096
applied to each token independently
output vectors
8 × 4,096 dim

After attention, each token's vector passes independently through the Feed-Forward Network (FFN). This is where most of the model's raw capacity lives.

The original Transformer (2017) used a two-layer FFN with ReLU activation and a 4x expansion ratio. BERT and GPT-2/3 switched to GELU. Modern frontier models (LLaMA, Mistral, DeepSeek, and most others released since 2023) use SwiGLU a gated activation that uses three weight matrices instead of two. The gating mechanism lets the network learn to selectively pass or block information, improving model quality. Because of the extra matrix, the inner dimension is scaled to ~2.67x (instead of 4x) to keep the total parameter count comparable. For a 4,096-dimension model with SwiGLU, the inner dimension is approximately 11,008 instead of 16,384.

Research suggests the FFN layers function as a kind of key-value memory store, where specific factual associations are encoded in the weight matrices. This is where the model "stores" learned facts about the world.

Residual Connections & RMSNorm

Critical techniques that make deep transformers trainable

InputRMS NormMulti-Head Self-AttentionRMS NormFeed-Forward Network++skipskipOutput
residual (skip) connection
+
element-wise addition

Around every sub-component (attention and FFN), the model uses two critical techniques:

Residual Connections: The input to each sub-layer is added back to its output. Think of it as: output = sublayer(input) + input. This lets information flow directly through the network without being forced through every transformation.

Without residual connections, training deep models (80+ layers) would fail because gradients would vanish the learning signal would evaporate before reaching early layers.

RMSNorm (Pre-Norm): The original 2017 Transformer placed normalization after the residual addition (post-norm). Modern models place it before each sub-layer (pre-norm), which improves training stability. Most current models also use RMSNorm a simpler, faster variant that normalizes using only the root mean square of activations, omitting the mean-centering step of full LayerNorm.

Stacking It All Together

The full forward pass through all transformer layers

~80+ layers total
Block 6
Block 5
Block 4
Block 3
Block 2
Block 1
8 token vectors
waiting to begin...

This process RMSNorm, self-attention, residual add, RMSNorm, feed-forward, residual add repeats identically for every layer in the model. Large models typically have 32 to 128+ transformer layers.

Each token's vector changes as it passes through each layer, accumulating contextual information. By the time a token exits the final layer, its vector is no longer about the word in isolation it encodes the meaning of that word in the full context of every other word in the prompt.

After all transformer blocks, the final vector for each token position contains a rich, contextualized representation conditioned on the entire input.

Section 11

The Language Model Head

Last token position output vector
?
vector [4,096 dims]
LM HeadLinear Projection[4,096 x 100,000]
100,000 Logitsraw unnormalized scores
Softmax
Probability Distribution100,000 values summing to 1.0

Top Predicted Next Tokens

To
12.3%
First
8.7%
Here
7.1%
You
5.9%
Start
4.2%
The
3.8%
Baking
3.1%
Pre
2.9%
Sure
2.4%
Great
1.9%

... 100,000 total entries, most near 0%

After the final transformer block, the output vector at the last token position is projected through one final linear layer called the LM Head . This produces a vector of ~100,000 raw scores, one for every token in the vocabulary. These scores are called logits.

Softmax converts these logits into a probability distribution over all possible next tokens. This is the actual output of the model: not a single predicted word, but a full probability distribution over the entire vocabulary.

The model is not "choosing" a word. It is assigning a probability to every single one of its 100,000tokens simultaneously. "To" might get 12.3%, "First" might get 8.7%, and a random token like "Zamboni" might get 0.00001%.

Section 12

Sampling and Decoding

Always takes the single highest-probability token

To
12.3%
First
8.7%
Here
7.1%
You
5.9%
Start
4.2%
The
3.8%
Baking
3.1%
Pre
2.9%
Sure
2.4%
Great
1.9%

Always picks the top token. Fast, but repetitive and prone to local maxima.

The model produces a probability distribution. How it selects the next token from that distribution is called the decoding strategy. Different strategies trade off between predictability and creativity.

In practice, production AI systems combine multiple strategies. For example, using temperature with top-p sampling. The exact settings are tuned to produce responses that are coherent and helpful without being repetitive.

Section 13

Autoregressive Generation

The full pipeline runs for every single token generated. After sampling a token, it gets appended to the sequence and the entire forward pass runs again.

Tokenize
Embed
Positional Enc.
Transformer Layers
LM Head
Softmax
Sample

Growing Sequence

How
do
I
bake
chocolate
chip
cookies
?
Step:
1 / 6
Generated Response

Generation is autoregressive : after sampling one token, it is appended to the input sequence and the entire forward pass runs again to generate the next token.

This is why AI responses appear word by word (or token by token). Each token requires a full pass through the entire model, all 80+ transformer layers. A 500-token response requires 500 separate forward passes. Each pass after the first is far cheaper thanks to KV caching (Section 14) only the single new token is processed through the FFN layers, and its Query attends to the cached Keys and Values from all previous tokens.

Generation scales roughly linearly with response length each new token requires one forward pass. However, each pass grows slightly more expensive as the sequence lengthens, because the new token must attend to all previous tokens.

Section 14

The KV Cache

Running the full attention computation from scratch for every new token would be computationally prohibitive. The KV cache solves this.

Naive: Recompute Everything
How
do
I
bake
chocolate
chip
cookies
?
To

Every token gets recomputed from scratch on each step. The red flash marks wasted computation: all previous K and V vectors are recalculated even though they have not changed.

Optimized: KV Cache
Cached K/V Vectors
K/VHow
K/V do
K/V I
K/V bake
K/V chocolate
K/V chip
K/V cookies
K/V?
New Token (compute Q, K, V)
Waiting...

Only the new token computes fresh Q, K, V vectors. Its Query attends to all previously cached Keys and Values. The new K/V pair is then appended to the cache.

Cache Memory Growth
Token 1Token 12

Cache grows linearly with each generated token

The KV Cache stores the K and V vectors from all previous tokens so they do not need to be recomputed. Only the new token's Q, K, V vectors are computed per step. Its Q attends to all cached K/V pairs.

This is why prompt caching , reusing the KV cache across requests with the same prefix, gives large latency and cost reductions. The cached key-value pairs from the shared prefix are stored and reused directly.

This cache grows with every generated token, which is why very long conversations eventually become slow or expensive. The memory required scales linearly with sequence length, multiplied by the number of layers and attention heads.

Detokenization

Converting token IDs back to human-readable text

Generated token IDs

[1061, 19832, 14693, 12851, 21487, 11, 499, 690, 1184, 279, 2768, 14293]

Each ID flips to reveal its text representation

1061To
19832 bake
14693 chocolate
12851 chip
21487 cookies
11,
499 you
690 will
1184 need
279 the
2768 following
14293 ingredients

Final assembled text

To bake chocolate chip cookies, you will need the following ingredients...

Once generation is complete (or at each step during streaming), the token IDs are mapped back to their text representations using the same vocabulary table from tokenization. The text fragments are concatenated to form the final human-readable response.

This is why AI responses sometimes have unusual word breaks or spacing artifacts. The model operates on tokens, not words, and the boundaries do not always align with what we think of as a "word."

Training

Where all the weights came from

16a. Pre-Training: Next-Token Prediction

Text corpusWeb pages, books, code
TokenizeBreak into token IDs
Predict nextModel outputs distribution
CompareCheck against actual token
Compute lossCross-entropy error
Update weightsBackpropagation + Adam
Repeattrillions of times

The training objective

loss = -log(probability assigned to the correct next token)
High lossP(correct) = 1%loss = 4.605
Low lossP(correct) = 80%loss = 0.223

The model was initialized with random weights and trained on hundreds of billions to trillions of tokens of text. The training objective is simple: next-token prediction. Given tokens 1 through N, predict token N+1.

The model's predicted distribution is compared to the actual next token using cross-entropy loss. Backpropagation computes how every single weight contributed to the error. An optimizer (typically Adam) nudges each weight slightly in the direction that reduces the loss.

This runs for weeks to months on clusters of thousands of specialized chips (GPUs or TPUs). LLaMA 3.1 405B trained for approximately 54 days on 16,384 H100 GPUs. The result: a model that has compressed the statistical structure of human language and knowledge into its weight matrices.

16b. What the Weights Encode

Conceptual view of a weight matrix

0.023
-0.189
0.452
-0.003
0.781
-0.346
-0.568
0.901
-0.234
0.679
-0.012
0.346
0.123
-0.789
0.234
-0.568
0.890
-0.123
-0.346
0.568
-0.901
0.234
-0.679
0.012
0.789
-0.123
0.568
-0.890
0.123
-0.457
-0.234
0.679
-0.012
0.346
-0.789
0.234

... billions of parameters total

There is no database being queried at inference time. The training dataset is gone. What remains is a compression of its statistical structure baked into billions of static floating-point numbers.

Knowledge is distributed. Factual associations are patterns spread across millions of weights, not stored at discrete addresses. This is why the model can interpolate, generalize to novel contexts... and also hallucinate (generate a plausible-sounding pattern that has no factual basis).

The model is, at its most reduced level, a very large function that maps a sequence of token integers to a probability distribution over the next token. Everything else (the apparent reasoning, the knowledge, the style) is what emerges from scaling that objective.

16c. RLHF: Making It Actually Helpful

Raw pre-trained models are not yet useful as assistants. A multi-stage process converts the raw text predictor into a helpful assistant.

1

Stage 1

Supervised Fine-Tuning (SFT)

Human contractors write ideal (prompt, response) pairs

  • Example prompts paired with ideal responses
  • Model learns instruction-following format
  • Establishes baseline helpful behavior
2

Stage 2

Reward Modeling

Humans compare model outputs and choose the better one

  • Two model responses shown side by side
  • Human raters select the preferred output
  • Separate reward model trained on preferences
3

Stage 3

Reinforcement Learning (PPO)

Model optimizes for reward while staying near SFT baseline

  • Model generates candidate responses
  • Reward model scores each response
  • KL penalty prevents drift from SFT baseline

This pipeline is what converts a raw text predictor into something that behaves like a coherent, helpful, refusal-capable assistant. Reinforcement Learning from Human Feedback (RLHF) aligns the model with human preferences through supervised examples, reward modeling, and policy optimization.

Modern variants like Direct Preference Optimization (DPO) skip the separate reward model and optimize preferences directly, simplifying the pipeline while achieving comparable results.

The Complete Pipeline

Every step, from keypress to displayed response

1
User types text
2
Unicode encoding
3
BPE Tokenization
4
Token IDs
5
Embedding lookup
6
Positional encoding
7
Transformer Block 1
8
Transformer Block 2
9
Transformer Block N
10
LM Head projection
11
Softmax
12
Sampling
13
Selected token
14
Append and loop
15
Detokenization
16
Display response
Response delivered to user

All of this happens in a fraction of a second per token. Every matrix multiplication, every attention computation, every layer. A typical response of 500 tokens means the model runs this entire pipeline 500 times. On modern hardware, this takes just a few seconds.

The Complete Response

From keypress to answer

AI Assistant

How do I bake chocolate chip cookies?

To bake chocolate chip cookies, you will need the following ingredients...

Every AI response you have ever received was produced by this process — a sequence of mathematical operations, running on numbers, producing probabilities, selecting tokens one at a time.

The core model itself is a stateless mathematical function — it has no built-in memory between forward passes and no inherent internet connection. But in production, AI assistants are wrapped in systems that add persistent memory across conversations, real-time web search, tool use, code execution, and more. These capabilities are not part of the neural network itself — they are engineered around it. The transformer at the center is still just predicting the next token.

But scaled to billions of parameters and trillions of training tokens, something remarkable emerges: the ability to generate text that appears knowledgeable, coherent, and helpful.

Accuracy note

This visualization explains how large language models work using general principles of the decoder-only transformer architecture. Specific numbers (layer counts, dimensions, vocabulary sizes) are drawn from openly documented models like LLaMA 3 and DeepSeek-V3. The exact architecture details of proprietary models like Claude and GPT-4 have not been publicly disclosed by their creators, but they share the same fundamental building blocks shown here.

This visualization covers the text-processing pipeline the core of how language models work. Modern AI models in 2026 are also multimodal: they can process images, audio, video, and code using the same transformer architecture with additional input encoders. The fundamental mechanism attention, feed-forward layers, autoregressive generation remains the same across modalities.

Built as an educational visualization