Unit 3: Pretraining

Published

February 26, 2026

In this unit, you will get an overview of different issues related to the development of large language models, with a focus on the pretraining stage. In particular, the unit covers the issue of data, scaling laws, the systems perspective, and the impact of LLMs on the environment.

Lectures

The lectures begin by introducing the key stages in LLM development. Next, you will learn how LLMs are pretrained and how large-scale datasets and scaling laws shape their performance. Finally, the lectures explore the systems perspective of training and the environmental cost of chatbot technology.

Section	Title	Video	Slides	Quiz
3.1	Introduction to LLM development	video	slides	quiz
3.2	Training LLMs	video	slides	quiz
3.3	Data for LLM pretraining	video	slides	quiz
3.4	Scaling laws	video	slides	quiz
3.5	Emergent abilities of LLMs	video	slides	quiz
3.6	Environmental cost of chatbot technology	video	slides	quiz

Quiz deadline

To earn a wildcard for this unit, you must complete the quizzes on the day before the online meeting.

Online meeting

The discussion at the online meeting will focus on some of the ethical aspects of training large language models. In particular, we will look at the working conditions of employees at AI companies that filter out harmful content from pretraining data.

Meeting details

The meeting will take place on 2026-04-01 between 18:00–20:00. A Zoom link will be sent out via the course mailing list.

Additional materials

Lab

Lab 3 is about pretraining large language models. You will work through the full pretraining process for a GPT model, explore different settings, and implement optimisations that make training more efficient. You will also reflect on the impact of data curation on the quality of the pretrained model.

View the lab on GitLab