Unit 3: Pretraining
In this unit, you will get an overview of different issues related to the development of large language models, with a focus on the pretraining stage. In particular, the unit covers the issue of data, scaling laws, the systems perspective, and the impact of LLMs on the environment.
Lectures
The lectures begin by introducing the key stages in LLM development. Next, you will learn how LLMs are pretrained and how large-scale datasets and scaling laws shape their performance. Finally, the lectures explore the systems perspective of training and the environmental cost of chatbot technology.
| Section | Title | Video | Slides | Quiz |
|---|---|---|---|---|
| 3.1 | Introduction to LLM development | video | slides | quiz |
| 3.2 | Training LLMs | video | slides | quiz |
| 3.3 | Data for LLM pretraining | video | slides | quiz |
| 3.4 | Scaling laws | video | slides | quiz |
| 3.5 | Emergent abilities of LLMs | video | slides | quiz |
| 3.6 | Environmental cost of chatbot technology | video | slides | quiz |
To earn a wildcard for this unit, you must complete the quizzes on the day before the online meeting.
Online meeting
The discussion at the online meeting will focus on some of the ethical aspects of training large language models. In particular, we will look at the working conditions of employees at AI companies that filter out harmful content from pretraining data.
The meeting will take place on 2026-04-01 between 18:00–20:00. A Zoom link will be sent out via the course mailing list.
Lab
Lab 3 is about pretraining large language models. You will work through the full pretraining process for a GPT model, explore different settings, and implement optimisations that make training more efficient. You will also reflect on the impact of data curation on the quality of the pretrained model.