Unit 1: Tokenisation and embeddings

Published

January 20, 2025

Lectures

The lectures of this unit cover tokenisation and embeddings, two fundamental concepts of modern NLP. We start with traditional word-based tokenisation and then present the Byte-Pair Encoding (BPE) algorithm, which is used by most current language models. In the second half of the unit, you will learn about embeddings and in particular word embeddings.

Section Title Video Slides Quiz
1.1 Introduction to tokenisation video slides quiz
1.2 The Byte Pair Encoding algorithm video slides quiz
1.3 Introduction to embeddings video slides quiz
1.4 Word embeddings video slides quiz
1.5 Learning word embeddings: Matrix decomposition video slides quiz
1.6 Learning word embeddings: The skip-gram model video slides quiz

Lab

In lab 1, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will explore two key concepts: tokenisation and embeddings. Tokenisation splits text into smaller units such as words, subwords, or characters. Embeddings are dense, fixed-size vector representations of tokens in a continuous space.

Link to the lab