Unit 3: Developing LLMs

Published

October 20, 2025

In this unit, you will get an overview of different issues related to the development of large language models. In particular, the unit covers training strategies, the issue of data, emergent abilities of LLMs, and LLM alignment.

Lectures

The lectures begin by introducing the key stages in LLM development. Next, you will learn how LLMs are trained and how large-scale datasets and scaling laws shape their performance. Finally, the lectures explore the emergent abilities of these models and the techniques used to align them with human goals and values.

Section	Title	Video	Slides	Quiz
3.1	Introduction to LLM development	video	slides	quiz
3.2	Training LLMs	video	slides	quiz
3.3	Data for LLM pretraining	video	slides	quiz
3.4	Scaling laws	video	slides	quiz
3.5	Emergent abilities of LLMs	video	slides	quiz
3.6	LLM alignment	video	slides	quiz

Quiz deadline

To earn a wildcard for this unit, you must complete the quizzes no later than 2025-11-03.

Online meeting

During the online meeting, we will examine the ethical and environmental implications of LLMs. In particular, we will discuss the working conditions of human annotators who create alignment data, particularly in countries of the Global South, as well as the environmental costs of training and deploying these models.

Meeting details

The meeting will take place on 2025-11-04 between 18:00–20:00. A Zoom link will be sent out via the course mailing list.

Warning

Note that the date of the online meeting was changed at late notice!

Additional materials

Sample solutions to the quizzes
The Environmental Impact of ChatGPT
OpenAI Used Kenyan Workers to Make ChatGPT Less Toxic
Save the AI (Satirical Information on AI’s Environmental Impact)
Codecarbon (Track CO2 Emissions)

Lab

Lab 3 is about pretraining large language models. You will work through the full pretraining process for a GPT model, explore different settings, and implement optimisations that make training more efficient. You will also reflect on the impact of data curation on the quality of the pretrained model.

View the lab on GitLab

Review deadline

If you want a written review of this lab, you must submit it (via Lisam) no later than 2025-12-19.