Unit 1: Tokenisation and embeddings

Published

January 27, 2025

This unit covers tokenisation and embeddings, two fundamental concepts of modern NLP. We start with traditional word-based tokenisation and then present the BPE algorithm, which is used by most current language models. In the second half of the unit, you will learn about embeddings and in particular word embeddings.

Lectures

Deadline for the quizzes: 2025-03-05

Section	Title	Video	Slides	Quiz
1.1	Introduction to tokenisation	video	slides	quiz
1.2	The Byte Pair Encoding algorithm	video	slides	quiz
1.3	Introduction to embeddings	video	slides	quiz
1.4	Word embeddings	video	slides	quiz
1.5	Learning word embeddings: Matrix decomposition	video	slides	quiz
1.6	Learning word embeddings: The skip-gram model	video	slides	quiz

Lab

Deadline for the lab: 2025-03-26

In lab 1, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will explore two key concepts: tokenisation and embeddings. Tokenisation splits text into smaller units such as words, subwords, or characters. Embeddings are dense, fixed-size vector representations of tokens in a continuous space.

Link to the lab