Unit 1: Tokenisation and embeddings

Published

September 4, 2025

This unit covers tokenisation and embeddings, two fundamental concepts of modern NLP. Tokenisers split text into smaller units such as words, subwords, or characters. Embeddings are fixed-size vector representations of tokens that can be learned from data and optimised for different tasks.

Lectures

The lectures start with traditional word-based tokenisation and then present the Byte Pair Encoding (BPE) algorithm, which is used by most current language models. In the second half of the lectures, you will learn about embeddings and different methods for how they can be learned.

Section	Title	Video	Slides	Quiz
1.1	Introduction to tokenisation	video	slides	quiz
1.2	The Byte Pair Encoding algorithm	video	slides	quiz
1.3	Introduction to embeddings	video	slides	quiz
1.4	Word embeddings	video	slides	quiz
1.5	Learning word embeddings: Matrix decomposition	video	slides	quiz
1.6	Learning word embeddings: The skip-gram model	video	slides	quiz

Quiz deadline

To earn a wildcard for this unit, you must complete the quizzes no later than 2025-09-16.

Online meeting

In the online meeting for this unit, we will discuss the issue of bias in word embeddings. We will examine how biases arise, how to detect and measure them, and what mitigation strategies exist – along with their trade-offs and limits. These issues have broad consequences for real-world uses of NLP systems.

Meeting details

The meeting will take place on 2025-09-17 between 18:00–20:00. A Zoom link will be sent out via the course mailing list.

Additional materials

Lab

In lab 1, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will code and analyse a tokeniser based on the Byte Pair Encoding (BPE) algorithm, and then explore embeddings in the context of a simple text classifier architecture.

View the lab on GitLab

Review deadline

If you want a written review of this lab, you must submit it (via Lisam) no later than 2025-10-31.