Summary — what each matrix learned via backpropagation
INPUT
"The cat sat"
3 tokens
→
EMBEDDINGS
x₁ x₂ x₃
raw token vectors
→
ATTENTION
W_Q, W_K — relationships
"which words talk to which?"
W_V — what to pass forward
"what is worth blending in?"
→
CONTEXT VECTORS
z₁ z₂ z₃
words with context
→
FEED-FORWARD
W₁, W₂ — world knowledge
"given this context, what do I know?"
same backprop, different specialization
→
ENRICHED VECTORS
h₁ h₂ h₃
context + knowledge
→
→
PREDICTION
next word
"on" 42%
W_Q, W_K → learned to identify relationships — "which words should talk to which?"
W_V → learned what information to pass forward — "what is worth blending into the output?"
W₁, W₂ → learned world knowledge — "given this context, what do I know?"
⚠ This walkthrough traces "sat" only — but all 3 tokens ("the", "cat", "sat") pass through every single step simultaneously.
Each gets its own q, k, v vectors. Each gets its own α weights. Each gets its own pass through feed-forward.
The output is a 3×4 matrix at every stage — not just one vector.
"sat" is shown because it is the last token and the one used for predicting the next word.