Unit 3: Developing LLMs
In this unit, you will get an overview of different issues related to the development of large language models. In particular, the unit covers training strategies, the issue of data, emergent abilities of LLMs, and LLM alignment.
Lectures
The lectures begin by introducing the key stages in LLM development. Next, you will learn how LLMs are trained and how large-scale datasets and scaling laws shape their performance. Finally, the lectures explore the emergent abilities of these models and the techniques used to align them with human goals and values.
| Section | Title | Video | Slides | Quiz |
|---|---|---|---|---|
| 3.1 | Introduction to LLM development | video | slides | quiz |
| 3.2 | Training LLMs | video | slides | quiz |
| 3.3 | Data for LLM pretraining | video | slides | quiz |
| 3.4 | Scaling laws | video | slides | quiz |
| 3.5 | Emergent abilities of LLMs | video | slides | quiz |
| 3.6 | LLM alignment | video | slides | quiz |
To earn a wildcard for this unit, you must complete the quizzes no later than 2025-11-03.
Online meeting
During the online meeting, we will examine the ethical and environmental implications of LLMs. In particular, we will discuss the working conditions of human annotators who create alignment data, particularly in countries of the Global South, as well as the environmental costs of training and deploying these models.
The meeting will take place on 2025-11-04 between 18:00–20:00. A Zoom link will be sent out via the course mailing list.
Note that the date of the online meeting was changed at late notice!
Additional materials
- Sample solutions to the quizzes
- The Environmental Impact of ChatGPT
- OpenAI Used Kenyan Workers to Make ChatGPT Less Toxic
- Save the AI (Satirical Information on AI’s Environmental Impact)
- Codecarbon (Track CO2 Emissions)
Lab
Lab 3 is about pretraining large language models. You will work through the full pretraining process for a GPT model, explore different settings, and implement optimisations that make training more efficient. You will also reflect on the impact of data curation on the quality of the pretrained model.
If you want a written review of this lab, you must submit it (via Lisam) no later than 2025-12-19.