Why Some Words Are Easy for AI and Others Are Hard
AI does not see words the way humans do. Some words split into clean, simple pieces, while others break into many small parts. In this lesson, we explore why tokenization is uneven and why that affects how models process language.
Course contents
Quick Recap
So far, we have learned four important ideas.
A Large Language Model generates text by predicting what comes next. It does this using tokens, which are small pieces of language. Those tokens are turned into embeddings, and attention helps the model decide which parts of a sentence matter most.
That gives us a basic picture of how the machine works.
Now we begin the next phase of the course: understanding what shapes the model’s behavior.
A good place to start is with a question that sounds small but matters more than most people realize.
If a model works with tokens, does every word break apart in the same way?
The answer is no.
The Problem: Not All Words Break Cleanly
Look at these two words:
cat
internationalization
To you, both are just words.
But to a model, they may not be equally simple.
A short and common word like cat may be handled as one clean piece.
A longer and less common word like internationalization may be split into several smaller pieces.
That difference matters.
The model does not read words the way people do. It reads whatever token pieces the tokenizer gives it.
So even though two words may feel equally normal to a human reader, they may be very different for the model internally.
A Better Way to Picture It
Imagine two workers packing items into boxes.
The first worker gets one solid item and places it into one box.
The second worker gets the same amount of material, but it arrives broken into many smaller pieces. That worker must handle each piece one by one.
The second job takes more steps.
Tokenization works like that.
When a word is split into many token pieces, the model has to process more pieces to understand or generate it.
That can make the text more expensive and sometimes harder to handle well.
Why This Happens
Tokenizers are built from patterns found in huge amounts of training text.
Common words and common word fragments are often given efficient token splits.
Rare words, unusual spellings, long technical terms, local slang, mixed-language phrases, and names may be broken into many smaller pieces instead.
For example, a model may handle a common word like this:
house
as one token.
But a less common string might be handled more like this:
electroencephalography
→ electro | encephalo | graphy
The exact split depends on the tokenizer, but the main idea stays the same.
Some language fits neatly into the model’s token system.
Some language does not.
Why This Matters in Practice
At first, this may sound like a small technical detail.
It is not.
Uneven tokenization affects real behavior in at least three ways.
1. It affects cost
Many AI systems are priced by tokens.
If a phrase breaks into more token pieces, it can cost more to send and more to generate.
2. It affects speed
More tokens usually mean more work.
The model has to process each token, so longer token sequences can slow things down.
3. It can affect quality
When names, slang, rare terms, or mixed-language text are split awkwardly, the model may handle them less smoothly.
This does not always cause problems, but it helps explain why models sometimes do better with some kinds of text than others.
A Global Example
This is especially important outside the English-speaking world.
A model may handle common American English very efficiently because it has seen a huge amount of it during training.
But local names, Filipino slang, mixed English-Tagalog phrases, or region-specific spelling may break into less efficient pieces.
That does not mean the model cannot work with them.
It means the internal representation may be rougher.
And rougher representation can sometimes lead to rougher output.
Your Turn
Think about these kinds of text:
- a short common word like
school - a long technical word like
photosynthesis - a local nickname or slang phrase
- a mixed-language sentence using English and Tagalog
Which of these do you think is most likely to break into many token pieces?
Which do you think the model will handle most easily?
Pause for a moment before reading on.
A Reasonable Answer
In general, the short and common word will be easiest.
The long technical term, unusual slang, nickname, or mixed-language phrase is more likely to split into more pieces.
That usually makes the text less efficient for the model to process.
The exact token split depends on the tokenizer, but the pattern is clear:
common and familiar text is often easier for the model than rare or unusual text.
The Key Idea
Tokenization is not neutral.
Some words fit the model’s token system neatly.
Others are chopped into many smaller parts.
That affects how efficiently the model can process language, how much the interaction costs, and sometimes how well the model performs.
Summary
AI does not see words the same way people do.
Before a model can process language, text must be broken into tokens. Some words split cleanly into a small number of tokens, while others break into many pieces.
This uneven tokenization affects cost, speed, and sometimes quality.
In the next lesson, we will look at something even more important: the quality of the training data itself. Because even a powerful model can only learn from the material it was given.
Mark your place before you move on.
Lesson 2 of 2 · Garbage In, Fluent Garbage Out. Progress is stored on this device so the course can show what to continue next.