"The cat sat…" — predict next word

A complete walkthrough of the Transformer forward pass — every number computed, nothing fabricated.

Summary — what each matrix learned via backpropagation
INPUT
"The cat sat"
3 tokens
EMBEDDINGS
x₁ x₂ x₃
raw token vectors
ATTENTION
W_Q, W_K — relationships
"which words talk to which?"
W_V — what to pass forward
"what is worth blending in?"
CONTEXT VECTORS
z₁ z₂ z₃
words with context
FEED-FORWARD
W₁, W₂ — world knowledge
"given this context, what do I know?"
same backprop, different specialization
ENRICHED VECTORS
h₁ h₂ h₃
context + knowledge
repeat ×3
layers
PREDICTION
next word
"on" 42%
W_Q, W_K → learned to identify relationships — "which words should talk to which?"
W_V → learned what information to pass forward — "what is worth blending into the output?"
W₁, W₂ → learned world knowledge — "given this context, what do I know?"
⚠ This walkthrough traces "sat" only — but all 3 tokens ("the", "cat", "sat") pass through every single step simultaneously. Each gets its own q, k, v vectors. Each gets its own α weights. Each gets its own pass through feed-forward. The output is a 3×4 matrix at every stage — not just one vector. "sat" is shown because it is the last token and the one used for predicting the next word.