AI Explained: Transformer Next-Token Prediction

Language

Summary — what each matrix learned via backpropagation

INPUT

"The cat sat"

3 tokens

→

EMBEDDINGS

x₁ x₂ x₃

raw token vectors

→

ATTENTION

W_Q, W_K — relationships

"which words talk to which?"

W_V — what to pass forward

"what is worth blending in?"

→

CONTEXT VECTORS

z₁ z₂ z₃

words with context

→

FEED-FORWARD

W₁, W₂ — world knowledge

"given this context, what do I know?"

same backprop, different specialization

→

ENRICHED VECTORS

h₁ h₂ h₃

context + knowledge

→

repeat ×3

layers

→

PREDICTION

next word

"on" 42%

      W_Q, W_K → learned to identify relationships — "which words should talk to which?"
    

      W_V → learned what information to pass forward — "what is worth blending into the output?"
    

      W₁, W₂ → learned world knowledge — "given this context, what do I know?"
    

⚠ This walkthrough traces "sat" only — but all 3 tokens ("the", "cat", "sat") pass through every single step simultaneously. Each gets its own q, k, v vectors. Each gets its own α weights. Each gets its own pass through feed-forward. The output is a 3×4 matrix at every stage — not just one vector. "sat" is shown because it is the last token and the one used for predicting the next word.

AI explained: interactive transformer tutorial