Unit 1: Tokenisation and embeddings
Lectures
The lectures of this unit cover tokenisation and embeddings, two fundamental concepts of modern NLP. We start with traditional word-based tokenisation and then present the Byte-Pair Encoding (BPE) algorithm, which is used by most current language models. In the second half of the unit, you will learn about embeddings and in particular word embeddings.
Section | Title | Video | Slides | Quiz |
---|---|---|---|---|
1.1 | Introduction to tokenisation | video | slides | quiz |
1.2 | The Byte Pair Encoding algorithm | video | slides | quiz |
1.3 | Introduction to embeddings | video | slides | quiz |
1.4 | Word embeddings | video | slides | quiz |
1.5 | Learning word embeddings: Matrix decomposition | video | slides | quiz |
1.6 | Learning word embeddings: The skip-gram model | video | slides | quiz |
Lab
In lab 1, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will explore two key concepts: tokenisation and embeddings. Tokenisation splits text into smaller units such as words, subwords, or characters. Embeddings are dense, fixed-size vector representations of tokens in a continuous space.