Unit 1: Tokenisation and embeddings
This unit covers tokenisation and embeddings, two fundamental concepts of modern NLP. We start with traditional word-based tokenisation and then present the BPE algorithm, which is used by most current language models. In the second half of the unit, you will learn about embeddings and in particular word embeddings.
Lectures
Deadline for the quizzes: 2025-03-05
Section | Title | Video | Slides | Quiz |
---|---|---|---|---|
1.1 | Introduction to tokenisation | video | slides | quiz |
1.2 | The Byte Pair Encoding algorithm | video | slides | quiz |
1.3 | Introduction to embeddings | video | slides | quiz |
1.4 | Word embeddings | video | slides | quiz |
1.5 | Learning word embeddings: Matrix decomposition | video | slides | quiz |
1.6 | Learning word embeddings: The skip-gram model | video | slides | quiz |
Lab
Deadline for the lab: 2025-03-26
In lab 1, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will explore two key concepts: tokenisation and embeddings. Tokenisation splits text into smaller units such as words, subwords, or characters. Embeddings are dense, fixed-size vector representations of tokens in a continuous space.