Unit 1: Tokenisation and embeddings

Published

March 30, 2026

This unit covers tokenisation and embeddings, two fundamental concepts of modern NLP. Tokenisers split text into smaller units such as words, subwords, or characters. Embeddings are fixed-size vector representations of tokens (or other discrete entities) that can be learned from data and optimised for different tasks.

Lectures

The lectures start with traditional word-based tokenisation and then present the Byte Pair Encoding (BPE) algorithm, which is used by most current language models. In the second half of the lectures, you will learn about embeddings and different methods for how they can be learned.

Section	Title	Video	Slides
1.1	Introduction to tokenisation	video	slides
1.2	The Byte Pair Encoding algorithm	video	slides
1.3	Tokenisation fairness	video	slides
1.4	Introduction to embeddings	video	slides
1.5	Word embeddings	video	slides
1.6	Contextualised word embeddings	video	slides

Additional materials

Article on word embeddings and stereotypes by Garg et al. (2018)

Assignment

Link to the assignment