Unit 1: Tokenisation and embeddings

Published

January 19, 2026

This unit covers tokenisation and embeddings, two fundamental concepts of modern NLP. Tokenisers split text into smaller units such as words, subwords, or characters. Embeddings are fixed-size vector representations of tokens that can be learned from data and optimised for different tasks.

Lectures

The lectures start with traditional word-based tokenisation and then present the Byte Pair Encoding (BPE) algorithm, which is used by most current language models. In the second half of the lectures, you will learn about embeddings and different methods for how they can be learned.

Section Title Video Slides Quiz
1.1 Introduction to tokenisation video slides quiz
1.2 The Byte Pair Encoding algorithm video slides quiz
1.3 Tokenisation fairness TBD TBD TBD
1.4 Introduction to embeddings video slides quiz
1.5 Word embeddings video slides quiz
1.6 The skip-gram model video slides quiz
ImportantQuiz deadline

To earn a wildcard for this unit, you must complete the quizzes before the teaching session on Unit 1.

Lab

In lab 1, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will code and analyse a tokeniser based on the Byte Pair Encoding (BPE) algorithm, and then explore embeddings in the context of a simple text classifier architecture.

View the lab on GitLab