Why Some Words Are Easy for AI and Others Are Hard

Quick Recap

So far, we have learned four important ideas.

A Large Language Model generates text by predicting what comes next. It does this using tokens, which are small pieces of language. Those tokens are turned into embeddings, and attention helps the model decide which parts of a sentence matter most.

That gives us a basic picture of how the machine works.

Now we begin the next phase of the course: understanding what shapes the model’s behavior.

A good place to start is with a question that sounds small but matters more than most people realize.

If a model works with tokens, does every word break apart in the same way?

The answer is no.

The Problem: Not All Words Break Cleanly

Look at these two words:

cat

internationalization

To you, both are just words.

But to a model, they may not be equally simple.

A short and common word like cat may be handled as one clean piece.

A longer and less common word like internationalization may be split into several smaller pieces.

That difference matters.

The model does not read words the way people do. It reads whatever token pieces the tokenizer gives it.

So even though two words may feel equally normal to a human reader, they may be very different for the model internally.

A Better Way to Picture It

Imagine two workers packing items into boxes.

The first worker gets one solid item and places it into one box.

The second worker gets the same amount of material, but it arrives broken into many smaller pieces. That worker must handle each piece one by one.

The second job takes more steps.

Tokenization works like that.

When a word is split into many token pieces, the model has to process more pieces to understand or generate it.

That can make the text more expensive and sometimes harder to handle well.

Why This Happens

Tokenizers are built from patterns found in huge amounts of training text.

Common words and common word fragments are often given efficient token splits.

Rare words, unusual spellings, long technical terms, local slang, mixed-language phrases, and names may be broken into many smaller pieces instead.

For example, a model may handle a common word like this:

house

as one token.

But a less common string might be handled more like this:

electroencephalography
→ electro | encephalo | graphy

The exact split depends on the tokenizer, but the main idea stays the same.

Some language fits neatly into the model’s token system.

Some language does not.

Why This Matters in Practice

At first, this may sound like a small technical detail.

It is not.

Uneven tokenization affects real behavior in at least three ways.

1. It affects cost

Many AI systems are priced by tokens.

If a phrase breaks into more token pieces, it can cost more to send and more to generate.

2. It affects speed

More tokens usually mean more work.

The model has to process each token, so longer token sequences can slow things down.

3. It can affect quality

When names, slang, rare terms, or mixed-language text are split awkwardly, the model may handle them less smoothly.

This does not always cause problems, but it helps explain why models sometimes do better with some kinds of text than others.

A Global Example

This is especially important outside the English-speaking world.

A model may handle common American English very efficiently because it has seen a huge amount of it during training.

But local names, Filipino slang, mixed English-Tagalog phrases, or region-specific spelling may break into less efficient pieces.

That does not mean the model cannot work with them.

It means the internal representation may be rougher.

And rougher representation can sometimes lead to rougher output.

Your Turn

Think about these kinds of text:

a short common word like school
a long technical word like photosynthesis
a local nickname or slang phrase
a mixed-language sentence using English and Tagalog

Which of these do you think is most likely to break into many token pieces?

Which do you think the model will handle most easily?

Pause for a moment before reading on.

A Reasonable Answer

In general, the short and common word will be easiest.

The long technical term, unusual slang, nickname, or mixed-language phrase is more likely to split into more pieces.

That usually makes the text less efficient for the model to process.

The exact token split depends on the tokenizer, but the pattern is clear:

common and familiar text is often easier for the model than rare or unusual text.

The Key Idea

Tokenization is not neutral.

Some words fit the model’s token system neatly.

Others are chopped into many smaller parts.

That affects how efficiently the model can process language, how much the interaction costs, and sometimes how well the model performs.

Summary

AI does not see words the same way people do.

Before a model can process language, text must be broken into tokens. Some words split cleanly into a small number of tokens, while others break into many pieces.

This uneven tokenization affects cost, speed, and sometimes quality.

In the next lesson, we will look at something even more important: the quality of the training data itself. Because even a powerful model can only learn from the material it was given.

Data & Prompting

Phase 1. Behavioral inputs

Quick Recap

The Problem: Not All Words Break Cleanly

A Better Way to Picture It

Why This Happens

Why This Matters in Practice

1. It affects cost

2. It affects speed

3. It can affect quality

A Global Example

Your Turn

A Reasonable Answer

The Key Idea

Summary

Mark your place before you move on.

Data & Prompting

Phase 1. Behavioral inputs

Data & Prompting

Quick Recap

The Problem: Not All Words Break Cleanly

A Better Way to Picture It

Why This Happens

Why This Matters in Practice

1. It affects cost

2. It affects speed

3. It can affect quality

A Global Example

Your Turn

A Reasonable Answer

The Key Idea

Summary

Mark your place before you move on.

Foundational Intelligence

Garbage In, Fluent Garbage Out