Intro to a “cat in litter box” journey
Large Language Models sounds mysterious because the name is doing a lot of work.
Large sounds expensive.
Language sounds human.
Model sounds like it should be walking down the runway or pose next to a new sports car.
But at it’s core, a language model is trained to do something surprisingly simple:
Predict next token.
That’s it. That’s the trick.
A very large, very expensive, suspiciously useful autocomplete machine.
Of course, like most simple ideas in computer science, it becomes terrifyingly powerful once you add enough data, enough compute, and enough people saying, “Surely one more training run will fix it.”
Tokens: language chopped into tiny bits
Before the model can understand the language, it has to turn the language into numbers.
Computers naturally do not undrestand “human” words like
cat,
litter box,
forbidden chemistry,
production outage
They understand numbers. So text gets splitted into pieces called tokens.
A token can be a whole word:
cat
box
or part of a word
chem
istry
or a punctuation
!
So the sentence:
The cat is doing forbidden chemistry in the litter box.
might become something like this:
“The”, “ cat”, “ is”, “ doing”, “ forb”, “idden”, “ chem”, “istry”, “ in”, “ the”, “ litter”, “ box”, “.”
The exact split depends on the tokenizer, but the idea is the same: text goes in, small pieces come out, and each piece gets represented as a number.
Basically, language goes through a tiny bureaucratic office and comes back stamped, filed, and mildly embarrassed.
As this is a funt topic I asked AI to help me with some iconographic. Do you like it? Let me know!

Next-token prediction: the world’s most intense guessing game
The main training task is beautifully simple.
Given some text, predict what comes next.
For example:
The cat is doing forbidden chemistry in the litter _____
A decent model should guess:
box
A confused model might guess
laboratory
a very dramatic model might guess
dimention
And honestly, dependently on a cat, we cannot fully rule that out (Matrix anyone?).
During training, the model makes a prediction, compares it with the real next token from the training text, and adjusts itself slightly when it is wrong.
It does this again, and again, and again. Billions or trillions of times.
At first the model is terrible. It guesses a nonsense. It has the confidence of a junior developer editing Kubernetes YAML at 2 in the morning. However, over time, after seeing huge amounts of text, it becomes better at predicting what token should come next. Eventually, it gets good enough that people start asking it to write emails, explain Docker, summarize contracts, and generate code.
But it is not just memorizing
A common misunderstanding is that language models simply memorize all their training data.
They can memorize some things, especially if they see the same text many times. But the useful part is that they learn patterns.
For example from seeing a lot of examples like:
The cat sleeps on the sofa.
The cat scratches the chair.
The cat knocks over the glass.
The cat is doing forbidden chemistry in the litter box.
The model learns relationships between words, grammar, style, facts, and context.
It learns that after:
The cat is doing forbidden chemistry in the litter
“box” is much more likely than “spreadsheet.”
It learns that after:
The capital of France is
“Paris” is much more likely than “lasagna.”
It learns that after:
public static void
you are probably about to enter Java or C# territory.
It learns that:
Dear Hiring Manager
is probably followed by professional language, not pirate dialogue.
Unless requested.
Then: ahoy, your career growth.
Pretraining: reading everything before breakfast
The big learning phase is called pretraining.
This is where the model is trained on huge amounts of text: books, websites, articles, documentation, code, forums, tutorials, and many other written sources.
During pretraining, the model is not usually being taught one task directly.
Nobody sits there saying:
Monday: grammar.
Tuesday: sarcasm.
Wednesday: Kubernetes trauma.
Thursday: cat bathroom euphemisms.
Instead, the model learns by repeatedly trying to predict the next token across a massive range of examples.
That simple training task forces it to learn reusable patterns.
It sees recipes, legal text, poetry, product reviews, bug reports, API documentation, academic papers, forum arguments, and probably at least one person explaining why their printer has developed a personality.
From all of that, the model learns things like:
What word usually comes next?
What tone fits this context?
What structure does this type of answer usually have?
What facts are relevant?
Is this code, or is someone just suffering in YAML?
This is why pretrained language models become surprisingly general.
They were not trained only to answer one type of question. They learned broad statistical patterns from a huge amount of language.
Parameters: the model’s adjustable brain knobs
A model learns by adjusting internal values called parameters.
You can think of parameters as tiny knobs inside the model. During training, the model turns these knobs a little bit so it becomes better at predicting the next token.
Small models have fewer knobs. Large models have many more knobs.
More parameters usually mean the model has more capacity to learn complex patterns. But more parameters also mean more compute, more memory, more cost, and more engineers quietly whispering “please don’t crash” at a training dashboard.
A model with too few parameters may not capture enough complexity. At the same time too many parameters can lead to learning richer patterns, but only if it gets enough training data.
This matters.
A huge model trained on too little data is like buying a giant library and reading only one shelf.
Impressive building. Questionable education.
Or, to keep our theme:
It is like giving the cat a full chemistry lab but only teaching it the word “meow.”
Dangerous and academically incomplete.
Loss: how wrong the model is
During training, the model does not just guess one answer. It assigns probabilities to many possible next tokens.
For example:
The cat is doing forbidden chemistry in the litter ___
The model might assign probabilities like:
box: 82%
tray: 8%
room: 3%
laboratory: 1%
blockchain: unfortunately not zero
The correct next token is probably:
box
If the model gives “box” a high probability, great. If it gives “blockchain” a high probability, however, we need to have a meeting.
The training process uses a measurement called loss to track how wrong the model is. Lower loss means the model is getting better at predicting the next token.
Training is basically a long process of reducing loss.
Or, more emotionally:
The model is wrong, gets judged mathematically, adjusts itself, and tries again.
Relatable.
Why scale matters
For a long time, people knew that bigger models often performed better.
But then researchers showed something very important: the improvement was not random magic. It followed predictable patterns.
Kaplan et al.’s Scaling Laws for Neural Language Models showed that language model performance improves in predictable ways as you increase:
model size
dataset size
compute
In plain English:
Bigger model + more data + more compute = lower loss, in a surprisingly predictable way.
This was a big deal.
It suggested that if you had enough compute and data, you could estimate how much better a larger model might get before actually training it.
Not perfectly. This is still machine learning, not astrology with GPUs.
But predictably enough that scaling started to look less like wizardry and more like engineering.
Very expensive engineering.
But engineering.
The Chinchilla lesson: do not just make the model huge
Then came another important insight. Hoffmann et al.’s Chinchilla paper argued that many large language models were undertrained. That means they had lots of parameters, but they had not been trained on enough tokens.
Imagine giving someone a giant brain, a huge desk, three monitors, and then letting them read only half a Wikipedia article and a restaurant menu.
That is not compute-optimal.
The Chinchilla result suggested that for a fixed compute budget, you should balance:
model size
number of training tokens
Instead of spending all your compute on making the model bigger, it may be better to train a somewhat smaller model on much more data.
In other words:
More parameters are good.
More data is good.
But balance is better.
This changed how people thought about training large models.
It was not just:
Make it bigger.
It became:
Make it the right size for the amount of data and compute you have.
Less catchy. More correct.
Like most useful engineering advice.
Why “more data + more compute” changed everything
Modern language models did not become powerful because someone found one magic trick hidden under a GPU. The big change came from combining several things:
First, there was much more text data available.
Second, hardware improved, especially GPUs and large-scale training infrastructure.
Third, researchers learned how to train large neural networks more reliably.
Fourth, scaling laws gave people confidence that larger training runs would likely produce better models.
And finally, organizations became willing to spend truly impressive amounts of money on compute.
The result was a jump from models that could complete simple sentences to models that can write code, explain concepts, translate languages, summarize documents, help with debugging, draft blog posts, and politely pretend that your variable naming is fine.
So does the model understand language?
This is the spicy question.
A language model is trained to predict tokens. That sounds simple. Almost too simple. However to predict language well, the model has to learn a lot of structure.
To predict:
The Eiffel Tower is located in ___
it helps to know geography.
To predict:
public async Task
it helps to know programming patterns.
To predict:
I’m sorry for your loss
it helps to know emotional and social context.
And to predict:
The cat is doing forbidden chemistry in the litter ___
it helps to know that cats use litter boxes, that “forbidden chemistry” is a joke, and that “box” is more likely than “compiler.”
The model is still optimizing a statistical objective: next-token prediction.
It does not learn like humans do.
It does not experience the world.
It does not have beliefs, intentions, or a tiny internal cat causing chaos.
But at massive scale, next-token prediction forces the model to learn rich internal representations of language, facts, style, and reasoning patterns.
That is why something trained on a simple prediction task can become useful for many different tasks.
Prediction becomes representation.
Representation becomes capability.
Capability becomes someone asking it to rewrite a message so it sounds “friendly but not too friendly.”
A tiny example
Suppose the training text contains:
The cat is doing forbidden chemistry in the litter box.
During training, the model might see partial sequences like:
The
The cat
The cat is
The cat is doing
The cat is doing forbidden
The cat is doing forbidden chemistry
The cat is doing forbidden chemistry in
The cat is doing forbidden chemistry in the
The cat is doing forbidden chemistry in the litter
At each step, it tries to predict the next token.
When it sees:
The cat is doing forbidden chemistry in the litter ___
it should learn that:
box
is a very likely continuation.
But the model is not only learning this exact sentence.
It is also learning smaller reusable patterns:
“cat” relates to pets
“litter box” is a common phrase
“doing forbidden chemistry” is humorous figurative language
sentence structure affects what comes next
context changes probability
Across billions or trillions of examples, these tiny lessons stack up.
One prediction at a time, the model gets better.
Not because it was explicitly taught every rule, but because predicting text at scale requires learning a surprising amount about how language works.
Why this became so powerful
The important part is not just that models predict the next token. The important part is that they do it at enormous scale. At small scale, next-token prediction gives you autocomplete.
A scale + enough parameters + enough data + enough compute + the same basic training objective = much richer behavior.
The model learns grammar, facts, style, code patterns, reasoning templates, and social conventions. It learns that a recipe and a legal contract should not sound the same. It learns that an error message followed by a stack trace probably means debugging, and that email starting with “Dear manager” (whoever starts emails like this really, probably shouldn’t be written in a slang.
This is why scale changed everything.
The task stayed simple.
The results did not.
Thank you for staying with us until the end. Make sure to check out out previous post – the serie’s introduction where we cover more topics on the higher level. And stay tuned for more by following me on Linkedin!
More interesting reads
If you want to read more, I highly reccomend:
Scaling Laws for Neural Language Models
Visualising machine learning
Training Compute-Optimal Large Language Models <- The Chinchilla lesson 🙂
Bonus: Cat gets a bill
Thank you AI for helping me visualise this process, I’m terrible with any graphic design. This is a simplified version, a lot actually. But let’s make it a small sneak peek of what’s coming later.
