Unit 3: Pretraining

Published

February 2, 2026

In this unit, you will get an overview of different issues related to the development of large language models, with a focus on the pretraining stage. In particular, the unit covers the issue of data, scaling laws, the systems perspective, and the impact of LLMs on the environment.

Lectures

The lectures begin by introducing the key stages in LLM development. Next, you will learn how LLMs are pretrained and how large-scale datasets and scaling laws shape their performance. Finally, the lectures explore the systems perspective of training and the environmental cost of chatbot technology.

Section Title Video Slides Quiz
3.1 Introduction to LLM development video slides quiz
3.2 Training LLMs video slides quiz
3.3 Data for LLM pretraining video slides quiz
3.4 Scaling laws video slides quiz
3.5 Emergent abilities of LLMs video slides quiz
3.6 Environmental cost of chatbot technology video slides quiz
ImportantQuiz deadline

To earn a wildcard for this unit, you must complete the quizzes before the teaching session on Unit 3.

Additional materials

Lab

Lab 3 is about pretraining large language models. You will work through the full pretraining process for a GPT model, explore different settings, and implement optimisations that make training more efficient. You will also reflect on the impact of data curation on the quality of the pretrained model.

Link to the lab