Week 2
I hope you are getting settled in the course! I am currently travelling to attend a thesis defence, so my newsletter for this week is slightly delayed – but here it comes.
This week: Tokenisation and embeddings
This unit covers tokenisation and embeddings, two fundamental concepts of modern NLP. Tokenisers split text into smaller units such as words, subwords, or characters. Embeddings are fixed-size vector representations of tokens that can be learned from data and optimised for different tasks.
In lab 1, you build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you code and analyse a tokeniser based on the Byte Pair Encoding (BPE) algorithm, and then explore embeddings in the context of a simple text classifier architecture.
To-do this week
Here is a list of to-do items from this week:
Todo 1: Register for the course
LiU requires you to register for your courses in Ladok no later than one week after they begin. Please do this as soon as possible to keep access to the mailing list.
Todo 2: Find a lab partner
In addition to the Ladok registration, you also need to register your lab groups in Webreg. The deadline for this is 30 January. I hope that most of you have found a lab partner by now; however, if not, do not worry: Register as a one-person group, and we will pair you up with someone else in the same situation.
Next week: LLM architectures
In the next unit, you will explore the Transformer architecture, which forms the foundation of today’s large language models. You will also learn about the two main types of language models built on this architecture: decoder-based models (such as GPT) and encoder-based models (such as BERT).
In lab 2, you will do a deep dive into the inner workings of the GPT architecture. You will walk through a complete implementation of the architecture in PyTorch, instantiate this implementation with pre-trained weights, and put the resulting model to the test by generating text.