How AI Decides What to Pay Attention To

Quick Recap

So far we have built a solid picture of how Large Language Models work under the hood.

We started with the core idea: an LLM generates text by predicting the next piece of language. One token at a time, over and over.

Then we saw that language gets broken into tokens — small chunks of text that the model actually works with.

And finally, those tokens get converted into numbers called embeddings — vectors that capture meaning in a way a neural network can learn from.

But here is a question we haven’t answered yet.

When a model reads a sentence with dozens of words, how does it know which words actually matter?

The Problem: Not All Words Are Equal

Read this sentence:

The animal didn't cross the street because it was too tired.

Quick — what does “it” refer to?

You almost certainly said “the animal.” And you’re right.

But here’s what’s interesting: you didn’t have to think about it. Your brain automatically scanned the sentence, connected “it” back to “animal,” and moved on. You did that in milliseconds without any conscious effort.

Now imagine you’re a language model. You don’t have human intuition. All you have is a sequence of tokens. How do you figure out that “it” points to “animal” and not “street”?

This is the exact problem that attention solves.

The Intuition: Reading With a Highlighter

Here’s a simple way to think about it.

Imagine you’re reading a sentence with a highlighter in your hand. As you reach each word, you instinctively glance back at the rest of the sentence and highlight the words that feel most connected to the one you’re reading.

Some words light up brightly — they’re highly relevant.

Others barely register — they’re just structural glue.

A transformer model does something remarkably similar. When it processes each token, it looks at every other token in the sentence and assigns an attention weight — a number that says “how much should I care about this word right now?”

High weight means “this word matters a lot for what I’m trying to predict.”

Low weight means “this word isn’t very relevant here.”

That’s the core idea. Attention is the model’s way of asking: “Given the word I’m looking at right now, which other words in this sentence should I focus on?”

Seeing It in Action

Let’s make this concrete. Say the model is processing this sentence and trying to predict the next word:

The capital of France is ____

You and I both know the answer is probably “Paris.” But how does the model figure that out?

It looks at every token in the sentence and assigns attention weights. Here’s roughly what that looks like:

Token	Attention Weight
The	0.02
capital	0.30
of	0.03
France	0.60
is	0.05

Notice what happened. The model put 60% of its attention on “France” and 30% on “capital.” Together, those two words carry 90% of the signal. The rest — “The,” “of,” “is” — are basically noise for this prediction.

This makes intuitive sense. If someone asks you “The capital of France is ___?” your brain locks onto “capital” and “France” too. Everything else is structural filler.

The difference is that you do this instinctively. The model learned to do it by processing billions of sentences during training.

Key insight: Attention weights aren’t fixed. They change for every token the model processes. When the model was focused on a different word in the same sentence, the weights would look completely different. Attention is dynamic — it shifts based on what the model is currently trying to figure out.

Why This Was a Breakthrough

Before attention came along, most language models processed text sequentially — one word after another, left to right. Think of it like reading through a keyhole. You could only see the word right in front of you, plus a fading memory of what came before.

This worked okay for short sentences. But for longer text, the model would “forget” important words that appeared much earlier. By the time it reached the end of a paragraph, the beginning was a blur.

Attention changed the game completely. Instead of reading through a keyhole, the model could now see the entire sentence at once and decide which parts to focus on.

The architecture that uses this mechanism is called a Transformer — and it’s the foundation of every modern LLM you’ve heard of. ChatGPT, Claude, Gemini — they’re all transformer models at their core.

Why it’s called “Transformer”: The original 2017 paper by Google researchers was titled “Attention Is All You Need.” The name stuck because attention was so effective that it replaced the previous approaches entirely. One mechanism to rule them all.

Your Turn: Test Your Intuition

Before you move on, try this.

Read the following sentence:

The cat sat on the mat because it was soft.

Now answer these questions:

What does “it” refer to? The cat, or the mat?
If a model was processing the word “it,” which two tokens would likely get the highest attention weights?
Would the attention weights for “it” look the same or different if the sentence said “because it was tired” instead?

Think about your answers before reading on.

Answers

1. “It” refers to the mat — because “soft” describes a surface, not an animal.

2. The highest attention weights would likely go to “mat” and “soft” — these are the tokens most relevant to understanding what “it” means in this context.

3. The weights would be completely different. If the sentence said “because it was tired,” the model would shift its attention to “cat” instead of “mat” — because tiredness applies to animals, not surfaces. This is exactly what we mean by attention being dynamic. Same word, same position, totally different focus depending on context.

If you got these right, you already understand the core intuition behind attention. That’s not a small thing — this mechanism is the reason modern AI can write essays, hold conversations, and understand nuance.

Summary

Here’s the key takeaway from this lesson:

Language models don’t treat every word equally. The attention mechanism allows the model to look at the entire sentence and dynamically decide which words matter most for each prediction. Important words get high attention weights. Irrelevant words get low weights.

This ability to focus on what matters — flexibly, dynamically, for every single token — is what makes transformer models so powerful. It’s why they can handle long text, understand pronouns, and produce responses that actually make sense in context.

You now understand the four building blocks of how modern LLMs work: next-token prediction, tokenization, embeddings, and attention. These aren’t just academic concepts — they’re the machinery running inside every AI tool you use.

In the next part of the course, we’ll explore how scale — more data, larger models, and more compute — takes these building blocks and turns them into something that feels almost magical. That’s where things get really interesting.

Foundational Intelligence

Phase 1. Core mechanics

Quick Recap

The Problem: Not All Words Are Equal

The Intuition: Reading With a Highlighter

Seeing It in Action

Why This Was a Breakthrough

Your Turn: Test Your Intuition

Answers

Summary

Mark your place before you move on.

Foundational Intelligence

Phase 1. Core mechanics

Foundational Intelligence

Quick Recap

The Problem: Not All Words Are Equal

The Intuition: Reading With a Highlighter

Seeing It in Action

Why This Was a Breakthrough

Your Turn: Test Your Intuition

Answers

Summary

Mark your place before you move on.

How Words Become Numbers

Data & Prompting