Unit 3: Pretraining

Published

February 26, 2026

In this unit, you will get an overview of different issues related to the development of large language models, with a focus on the pretraining stage. In particular, the unit covers the issue of data, scaling laws, the systems perspective, and the impact of LLMs on the environment.

Lectures

The lectures begin by introducing the key stages in LLM development. Next, you will learn how LLMs are pretrained and how large-scale datasets and scaling laws shape their performance. Finally, the lectures explore the systems perspective of training and the environmental cost of chatbot technology.

Section Title Video Slides Quiz
3.1 Introduction to LLM development video slides quiz
3.2 Training LLMs video slides quiz
3.3 Data for LLM pretraining video slides quiz
3.4 Scaling laws video slides quiz
3.5 Emergent abilities of LLMs video slides quiz
3.6 Environmental cost of chatbot technology video slides quiz
Quiz deadline

To earn a wildcard for this unit, you must complete the quizzes on the day before the online meeting.

Online meeting

The discussion at the online meeting will focus on some of the ethical aspects of training large language models. In particular, we will look at the working conditions of employees at AI companies that filter out harmful content from pretraining data.

Meeting details

The meeting will take place on 2026-04-01 between 18:00–20:00. A Zoom link will be sent out via the course mailing list.

Lab

Lab 3 is about pretraining large language models. You will work through the full pretraining process for a GPT model, explore different settings, and implement optimisations that make training more efficient. You will also reflect on the impact of data curation on the quality of the pretrained model.

View the lab on GitLab