LLMs for Boomers

tl;dr: If you don't know anything about large language models, all you need to know is transformers were invented by Google, words are now tokens, data preparation is half the battle, and context is everything. Keep reading if you want to know what made ChatGPT possible.

“When I was a kid…”

…in summer of 2020, I had my first software engineering internship. I was tasked to use an LLM to try to predict events from Twitter data.

I hadn’t taken my courses yet in ML/AI, so I opened up google and searched, “LLMs for dummies.” This is what I found out…

The Road to LLMs: BERT and GPT

In 2013, Google researchers invented “word embeddings.” Suddenly, computers could understand relationships between words, like “Paris is to France what Rome is to Italy.”
In 2017, another Google team published “Attention Is All You Need.” This paper introduced the Transformer, a model architecture that used self-attention to read entire sentences at once instead of word by word.
- Example: In “I love rocks because they are so cool,” the Transformer can figure out that “they” refers to rocks.

That breakthrough changed everything. And that was the paper I read Day 1 of my internship.

In 2018, two landmark models had appeared, both built on the Transformer:

BERT (Google) – An encoder that learned by filling in missing words (like Mad Libs). Great at understanding text, not at writing it.
GPT (OpenAI) – A decoder that learned by predicting the next word. This small design choice made it a writer, not just a reader.

They were born the same year, from the same Transformer “parent,” but trained with different goals. BERT became the master of understanding; GPT, with scale and compute, became the master of generation.

Why GPT Surprised Everyone

At the time, most experts thought BERT was the future. After all, it was already powering Google Search. GPT, by contrast, looked like “just autocomplete.” But when OpenAI scaled it up — more data, more compute, bigger context windows — it didn’t just parrot words back. It could write essays, answer questions, even hold conversations.

Why this matters beyond the engineering: it reminds us that creativity isn’t purely human invention. We’re uncovering structures that God planted in language, logic, pattern. GPT’s surprise is a small glimpse into the order God built into the world and into our minds.

To summarize:
The Transformer unlocked the door in 2017.
BERT and GPT walked through it together in 2018 — but in different directions.
One became the master of understanding; the other, with scale and compute, became the master of generation.

At this point you must be thinking, how do computers read words in the first place?

From Words to Tokens

Why start with tokens?

Because everything complex is built from smaller pieces. That is how God designed creation itself — atoms form molecules, letters form words, habits form character. Nothing is random.

And at the heart of every LLM is a simple fact: computers don’t understand words; they understand numbers. Before any model can learn patterns in language, text has to be broken into tokens and mapped to IDs (numbers).

Think of it like teaching a child to read:

At first, you break down sentences into letters and words.
Then you assign each letter a place in the alphabet.
Finally, you make meaning from the sequence.

LLMs do something very similar, but they don’t use an alphabet—they use a vocabulary of tokens.

Step 1: Tokenization

Tokenization is just a fancy word for “chopping up text.”

A naïve way is splitting on spaces. That gives you words, but also problems. For example:
- "Hello," vs "Hello" → two different tokens, even though they mean the same thing.
A better way is regex tokenization, where you explicitly separate punctuation, words, and symbols.
The modern solution is subword tokenization (like GPT-2 uses). This splits words into smaller, reusable chunks.
- Example: unbelievable → ["un", "believ", "able"].
- This way, the model never gets stuck on words it hasn’t seen before.

👉 Key takeaway: Tokenization ensures any text can be broken into manageable pieces the model knows how to handle.

Step 2: From tokens to IDs

Once text is split into tokens, each token gets assigned a unique number (ID).

"the" might be 5, "dog" might be 317, and so on.
This is called a vocabulary: the lookup table that maps tokens ↔ IDs.

The model doesn’t see "dog". It sees [5, 317, 92]. Later, these IDs are turned into embeddings—vectors the model actually learns from.

Step 3: Making training samples

Now that we have IDs, we need to build training examples. LLMs learn by predicting the next token in a sequence.

Imagine our text:

the quick brown fox jumps

Input (x): [the, quick, brown, fox]
Target (y): [quick, brown, fox, jumps]

The model sees x and tries to predict y. This simple shift is how GPT models learn to write, reason, and even code.

Step 4: Context length (T)

Models can’t look at infinite text at once—they have a context window.

If T=8, the model only sees 8 tokens at a time.
More tokens = better understanding of context, but also more memory and compute.
Attention cost grows roughly like T². Double T, and the compute goes up 4×.

👉 That’s why context length is such a big deal when comparing models.

💡 Fun Fact: ChatGPT pro version currently has a 128k token context size. And at the time of writing this, Gemini Pro version has a context window of 1 million tokens… which means it can process up to 1,500 pages of text or 30K lines of code simultaneously.

Step 5: Stride

When preparing data, we slide a window across the token stream. Stride is how far we move the window each time.

Small stride (e.g., 1) → lots of overlap, many training examples, but more redundancy.
Large stride (e.g., 64) → fewer, more unique examples.

Stride is like setting the pace in a marching band—tight steps give more rhythm, big steps cover more ground.

Why this all matters

Everything in this chapter (and this post) is about feeding the model the right kind of data:

Text → tokens → IDs → (x, y) pairs → ready for embeddings.

Along the way, we solved real-world problems (punctuation, OOV words, batching). Without this foundation, the Transformer architecture behind these models wouldn’t have anything to chew on.

You now know:

Who BERT and GPT really are :)
What tokens and vocabularies are.
Why subword tokenization (WordPiece & BPE) solved a major problem.
How context length and stride shape training data.
How (x, y) shifting sets up the prediction task.

But more importantly, you see how much of what makes LLMs powerful is structure, order, and relationship — things God designed us to notice.

When we understand these ideas, we see not just what AI can do, but why it should lead us toward humility, gratitude, and wisdom, instead of towards greed or fear.

The fact that human engineers can map tokens and build predictions is possible only because God made a world with patterns that can be mapped. Our creativity is real, but it’s also reflective of our Creator.

This brilliance is a gift — coming from the same One who spoke galaxies into being and formed our minds so we could explore them.

Final Thoughts

Looking back, the early LLM days taught us that data preparation is half the battle. You can’t train a good model if your inputs are messy or incomplete. That’s why Chapter 2 of Build a Large Language Model (From Scratch) from Sebastian Raschka is all about tokenization and batching—it’s the invisible plumbing that makes LLMs possible.

In the next post, we’ll move past the building blocks and into the model itself — where token IDs come alive as embeddings and attention layers, the core ingredients that let machines begin to capture meaning in language.

LLMs for Boomers - Pt. 1