PART 1
Welcome to the first blog in our series on Large Language Models (LLMs), the technology powering tools like ChatGPT. While these AI systems may seem magical, they’re built on complex pipelines and huge amounts of data. In this blog, we’ll break down the first stage in creating an LLM: pre-training, and how the data that fuels these systems is gathered and prepared.
What Is a Large Language Model, Really?
Before diving into the technical pipeline, let’s address a question many people have:
What exactly are you talking to when you interact with ChatGPT?
You’re interacting with a software model that has learned, by reading vast amounts of text, how to predict and generate human-like language. It doesn’t understand text the way humans do, but it has developed statistical associations and linguistic structures that allow it to respond in surprisingly coherent and useful ways.
But how does it get this knowledge?
Step 1: Pre-Training – The Foundation of Learning
LLMs like ChatGPT are built through a multi-stage process, and the first of those is pre-training. This is where the model learns language itself – grammar, facts, reasoning patterns, and writing styles, by processing massive amounts of text.
But where does this text come from?

Mining the Internet: Building a Training Dataset
To train a language model, companies begin by gathering enormous quantities of text data from the internet. Think of it like downloading a giant chunk of the readable internet. This includes:
- News articles
- Wikipedia entries
- Blogs
- Forums
- Books
- Code repositories
- Educational materials
However, not all internet content is high quality or appropriate. So the data must be filtered, cleaned, and curated before it can be used.
A good example of this is FineWeb, a curated dataset developed by the AI company Hugging Face. It’s similar to what major players like OpenAI, Google, and Anthropic use internally, though they each have their own proprietary processes and datasets.
FineWeb, for instance, contains about 44 terabytes of cleaned and filtered text data. While that might sound like a lot, it can actually fit on a handful of modern hard drives. It’s a surprisingly small slice of the internet but carefully selected for quality and diversity.
Step 2: The Crawling Begins – Common Crawl
Before any filtering begins, most LLM training pipelines rely on massive-scale web snapshots provided by Common Crawl, a nonprofit that has been crawling the web since 2007.
- Common Crawl’s bots navigate from known websites and follow hyperlinks to index a broad swath of the public internet.
- As of 2024, it has archived over 2.7 billion pages across domains and topics.
But this raw dump is messy and noisy, it includes everything from valuable articles to boilerplate HTML, spam, and broken pages. This initial crawl serves as the starting point, which is then passed through rigorous filtering and preprocessing steps, as outlined above.
Step 3: Filtering the Noise
Once the raw HTML pages are collected, they must be filtered. Here are some of the key steps:
1. URL Filtering
Before downloading content, curators apply rules to select only high-quality, content-rich URLs.
- Exclude: spammy domains, pornography, low-content pages (like image galleries, login portals), known SEO farms.
- Include: encyclopedias, news sites, open forums, research publications, code repositories, etc.
Techniques:
- Domain allowlists/blocklists
- URL pattern filtering (e.g., avoid
/login,/cart,/comments)
2. Content Extraction
Once URLs are selected, the HTML content is downloaded — but HTML is messy. You need to extract just the human-readable text.
- Strip away:
- HTML tags
- JavaScript and CSS
- Navigation bars, footers, sidebars
- Tools used: Readability libraries (like Mozilla Readability or jusText), boilerplate removal tools, or custom HTML parsers.
Goal: Isolate meaningful, central content (e.g., the main article or answer).
3. Language Filtering
Many LLMs are trained to work in specific languages (e.g., English, Hindi, Tamil). After extraction, documents are passed through a language identification model.
- Keep only documents in supported languages
- Use classifiers like fastText, Compact Language Detector, or in-house models
4. Deduplication
The internet is full of repeated content: mirror sites, reposts, templates, copied articles.
- Algorithms like MinHash or SimHash help identify exact and near-duplicate texts.
- Helps reduce noise and make learning more efficient
5. Toxicity and Safety Filtering
To avoid training the model on harmful, hateful, or dangerous content, documents are filtered based on toxicity.
- Use classifiers trained to detect offensive language
- Remove or down-weight documents flagged as harmful, violent, or discriminatory
6. Quality Scoring
Not all clean text is good text. Some pages may be poorly written, nonsensical, or low-value.
- Apply quality classifiers to score documents on grammar, coherence, informativeness
- Discard or down-weight low-scoring entries
7. Formatting and Normalization
Cleaned documents undergo final normalization steps to prepare them for tokenization.
- Fix encoding issues (e.g., character corruption)
- Normalize quotes, dashes, punctuation
- Remove odd whitespace, line breaks, non-text artifacts
8. Dataset Balancing and Metadata Tagging
Curators often aim for balanced representation across topics, domains, or languages.
- Adjust proportions: e.g., ensure technical papers aren’t overwhelmed by Reddit posts
- Tag documents with metadata like source, domain, language, quality score, etc.
Why All This Matters
You can think of this stage as building the “textual brain” of the AI. If you feed it junk, you get junk out. But if you carefully feed it diverse, well-structured, and meaningful content, the result is a model that can answer questions, write code, compose essays, and more.
Up Next
In the next blogs of the series, we’ll dive into how models tokenize language, process information through neural networks, and generate intelligent outputs. We’ll also explore post-training techniques, fine-tuning, reinforcement learning, and how models like GPT and Llama truly “think.”
Stay tuned!