Assignment 1: Introduction to language modeling
Language modeling is the foundation that recent advances in NLP technlogies build on. In essence, language modeling means that we learn how to imitate the language that we observe in the wild. More formally, we want to train a system that models the statistical distribution of natural language. Solving this task is exactly what the famous commercial large language models do (with some additional post-hoc tweaking to make the systems more interactive and avoid generating provocative outputs).
In the course, we will cover a variety of technical solutions to this fundamental task (in most cases, various types of Transformers). In this first assignment of the course, we are going to build a neural network-based language model that uses recurrent neural networks (RNNs) to model the interaction between words.
However, setting up the neural network itself is a small part of this assignment, and the main focus is on all the other steps we have to carry out in order to train a language model. That is: we need to process the text files, manage the vocabulary, run the training loop, and evaluate the trained models.
About this document
The work for your submission is described in Part 1–Part 5 below.
Your work is divided into a set of numbered Tasks. The tasks that have the academic cap symbol (🎓) are those that you can select for the final oral exam, while tasks that have the gear symbol (⚙) are pure implementation tasks that are less interesting to discuss.
There are Hints at various places in the instructions. You can click on these Hints to expand them to get some additional advice.
Pedagogical purposes of this assignment
- Introducing the task of language modeling,
- Getting experience of preprocessing text,
- Understanding the concept of word embeddings,
- Refreshing basic skills in how to set up and train a neural network,
- Introducing some parts of the HuggingFace ecosystem.
Prerequisites
We expect that you can program in Python and that you have some knowledge of basic object-oriented programming. We will use terms such as “classes”, “methods”, “attributes”, “functions” and so on.
On the theoretical side, you will need to remember fundamental concepts related to neural networks such as forward and backward passes, batches, initialization, optimization.
On the practical side, you will need to understand the basics of PyTorch such as tensors, models, optimizers, loss functions and how to write the training loop. (If you need a refresher, there are plenty of tutorials available, for instance on the PyTorch website.) In particular, the Optimizing Model Parameters tutorial contains more or less everything you need to know for this assignment about PyTorch training loops.
Submission requirements
Submission of this assignment for feedback is optional. If you want feedback, please submit your solution in Canvas.
Submission deadline: April 20.
You can submit a link to a Github repository, or alternatively a set of Python files or notebooks containing your solution to the programming tasks described below. In addition, include a document indicating which of the assignment tasks you would like to receive feedback for.
This is a pure programming assignment and you do not have to write a technical report or explain details of your solution: at the end of the course, there will be a separate individual oral exam where you will discuss a subset of the assignment tasks.
Part 0: Preliminaries
⚙ Task 0.1: Setting up the environment
You can in principle solve this assignment on a regular laptop but it will be boring to train the full language model on a machine that does not have a GPU available. For this reason, we recommend to use a compute cluster such as Alvis or Berzelius.
Make sure that the following libraries are installed:
- NLTK or SpaCy for word splitting,
- PyTorch for building and training the models,
- Transformers and Datasets from HuggingFace,
- Optional: Matplotlib and scikit-learn for the embedding visualization in the last step.
If you are using a Colab notebook, these libraries are already installed.
Then download and extract this archive. It contains the text files and a code skeleton to help get you started.
Part 1: Tokenization
Terminological note: It can be useful to keep in mind that people in NLP use the word tokenization in a couple of different ways. Traditionally, tokenization referred to the process of splitting texts into separate words. More recently, tokenization typically tends to mean all preprocessing steps we carry out to convert text into a numerical format suitable for neural networks. To avoid confusion, in this assignment we will use the term tokenization in the modern sense, and use the term word splitting otherwise.
⚙ Task 1.1: Using NLTK or SpaCy for word splitting
In this assignment, you will just use an existing library to split texts into words. Popular NLP libraries such as SpaCy and NLTK come with built-in functions for this purpose. We recommend NLTK in this assignment since it is somewhat faster than SpaCy and somewhat easier to use.
Hint: How to use NLTK’s English word splitter.
Import the function word_tokenize from the nltk library. If you are running this on your own machine, you will first need to install NLTK with pip or conda. In Colab, NLTK is already installed.
For instance, word_tokenize(“Let’s test!!”) should give the result [“Let”, “’s”, “test”, “!”, “!”]
🎓 Task 1.2: Building the vocabulary
Each nonempty line in the text files correspond to one paragraph in Wikipedia. Apply the tokenizer to all paragraphs in the training and validation datasets. Convert all words into lowercase.
Create a function that goes through the training text and creates a vocabulary: a mapping from token strings to integers.
In addition, the vocabulary should contain 4 special symbols:
- a symbol for previously unseen or low-frequency tokens,
- a symbol we will put at the beginning of each paragraph,
- a symbol we will put at the end of each paragraph.
- a symbol we will use for padding so that we can make input tensors rectangular.
The total size of the vocabulary (including the 4 symbols) should be at most max_voc_size, which is is a user-specified hyperparameter. If the number of unique tokens in the text is greater than max_voc_size, then use the most frequent ones.
Hint: A Counter can be convenient when computing the frequencies.
Counter is like a regular Python dictionary, with some additional functionality for computing frequencies. For instance, you can go through each paragraph and call update. After building the Counter on your dataset, most_common gives the most frequent items.
Also create some utility that allows you to go back from the integer to the original word token. This will only be used in the final part of the assignment, where we look at model outputs and word embedding neighbors.
Example: you might end up with something like this:str_to_int = { 'BEGINNING':0, 'END':1, 'UNKNOWN':2, 'PAD': 3, 'the':4, 'and':5, ... }
int_to_str = { 0:'BEGINNING', 1:'END', 2:'UNKNOWN', 3:'PAD', 4:'the', 5:'and', ... }
Sanity check: after creating the vocabulary, make sure that
- the size of your vocabulary is not greater than the max vocabulary size you specified,
- the 4 special symbols exist in the vocabulary and that they don’t coincide with any real words,
- some highly frequent example words (e.g. “the”, “and”) are included in the vocabulary but that some rare words (e.g. “cuboidal”, “epiglottis”) are not.
- if you take some test word, you can map it to an integer and then back to the original test word using the inverse mapping.
⚙ Task 1.3: Implementing a HuggingFace-like Tokenizer
Now, we turn to the task of implementing the utility that will turn a text into a numerical format that can be provided to neural networks as an input. Our implementation will be functionally similar to the tokenizers provided by the HuggingFace library.
Write code for the missing parts in the A1Tokenizer in the skeleton Python file. You will need to implement the three methods __init__, __call__, and __len__. Most of the work will be done in __call__: __init__ is simply where you pass the information you need to set up the tokenize, and __len__ should just return the size of the vocabulary.
Hint: The weird-looking method call is a special method that allows an object to be called like a function.
tokenizer(some_texts)and
tokenizer.__call__(some_texts)
It can be useful to create a function that first builds the vocabulary and then creates the tokenizer object, so that you can build the tokenizer in one step. The skeleton includes a function build_tokenizer exemplifying the interface of such a function.
Sanity check: Apply your tokenizer to an input consisting of few texts and make sure that it seems to work. In particular, verify that the tokenizer can create a tensor output in a situation where the input texts do not contain the same number of words: in these cases, the shorter texts should be “padded” on the right side. For instance
tokenizer = (... create your tokenizer...)
test_texts = [['This is a test.', 'Another test.']]
tokenizer(test_texts, return_tensors='pt', padding=True,
truncation=True)
The result should be something similar to the following example output (assuming that the integer 0 corresponds to the padding dummy token):
{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0]]),
'input_ids': tensor([[2, 35, 14, 11, 965, 6, 3],
[2, 153, 965, 6, 3, 0, 0]])}
Verify that at least the input_ids tensor corresponds to what you expect. (As mentioned in the skeleton code, the attention_mask is optional for this assignment.)
When you are confident that your tokenizer works correctly, save it to a file (your_tokenizer.save('some_file_name')) so that you do not have to re-create it every time you run your program. You load the saved tokenizer by calling A1Tokenizer.from_file('some_file_name').
Part 2: Loading the text files and creating batches
(This part just introduces some functionalities you may find useful when processing the data: it functions as a stepping stone for what you will do in Part 4. You do not have to include solutions to this part in your submission.)
⚙ Task 2.1: Loading the texts
We will use the HuggingFace Datasets library to load the texts from the training and validation text files. (You may feel that we are overdoing it, since these are simple text files, but once again we want to introduce you to the standard ecosystem used in NLP.)
from datasets import load_dataset
dataset = load_dataset('text', data_files={'train': TRAIN_FILE, 'val': VAL_FILE})
The training and validation sections can now be accessed as dataset['train'] and dataset['val'] respectively. The datasets internally use the Arrow format for efficiency; in practice, they can be accessed as if they were regular Python lists. That is: you can write dataset['train'][8] to access the 8th text in the training set.
Each instance in the training and validation sets correspond to Wikipedia paragraphs. Now, remove empty lines from the data:
dataset = dataset.filter(lambda x: x['text'].strip() != ''
Sanity check: after loading the datasets and removing empty lines, you should have around 147,000 training and 18,000 validation instances.
Optionally, it can be useful in the development phase to work with smaller datasets. The following is one way of achieving that:
from torch.utils.data import Subset
for sec in ['train', 'val']:
dataset[sec] = Subset(dataset[sec], range(1000))
⚙ Task 2.2: Iterating through the datasets.
When training and running neural networks, we typically use batching: that is, to improve computational efficiency, we process several instances in parallel. We will use the DataLoader utility from PyTorch. Data loaders help users iterate through a dataset and create batches.
Hint: More information about DataLoader.
DataLoader to help us create batches. It can work on a variety of underlying data structures, but in this assignment, we’ll just apply it to the datasets you prepared previously.
dl = DataLoader(your_dataset, batch_size=..., shuffle=...)The arguments here are as follows:
-
batch_size: the number of instances in each batch. -
shuffle: whether or not we rearrange the instances randomly. It is common to shuffle instances while training.
DataLoader, you can iterate through the dataset batch by batch:
for batch in dl: ... do something with each batch ...
Sanity check: create a DataLoader, look at the first batch, and confirm that it corresponds to your expectations.
for batch in dl:
print(batch)
break
Optional task: we are keeping it a bit simple here. If you want to be even more closely aligned with the HuggingFace standard API, you should also 1) use tokenized texts in the Datasets instead of raw text, and 2) apply a collator, such as DataCollatorForLanguageModeling.
Part 3: Defining the language model neural network and computing the loss
🎓 Task 3.1: Setting up the network
Define a neural network that implements an RNN-based autoregressive language model. Use the skeleton provided in the class A1RNNModel. It should include the following layers:
- an embedding layer that maps token integers to floating-point vectors,
- an recurrent layer implementing some RNN variant (we suggest
nn.LSTMornn.GRU, and it is best to avoid the “basic”nn.RNN), - an output layer (or unembedding layer) that computes (the logits of) a probability distribution over the vocabulary.
Once again, we base our implementation on the HuggingFace Transformers library, to exemplify how models are defined when we use this library. Specifically, note that
- The model hyperparameters are stored in a configuration object
A1RNNModelConfigthat inherits from HuggingFace’sPretrainedConfig; - The neural network class inherits from HuggingFace’s
PreTrainedModelrather than PyTorch’snn.Module.
When you set up your model, you should use the hyperparameters stored in the A1RNNModelConfig.
Hint: If you are doing the batching as recommended above, you should set batch_first=True when declaring the RNN.
The input to an RNN is a 3-dimensional tensor. If we set batch_first=True, then we assume that the input tensor is arranged as (B, N, E) where B is the batch size, N is the sequence length, and E the embedding dimensionality. In this case, the RNN “walks” along the second dimension: that is, over the sequence of tokens.
If on the other hand you set batch_first=False, then the RNN walks along the first dimension of the input tensor and it is assumed to be arranged as (N, B, E).
Hint: How to apply RNNs in PyTorch.
Take a look at the documentation of one of the RNN types in PyTorch. For instance, here is the documentation of nn.LSTM. In particular, look at the section called Outputs. It is important to note here that all types of RNNs return two outputs when you call them in the forward pass. In this assignment, you will need the first of these outputs, which correspond to the RNN’s output for each token. (The other outputs are the layer-wise outputs.)
class MyRNNBasedLanguageModel(nn.Module):
def __init__(self, ... ):
super().__init__()
... initialize model components here ...
def forward(self, input_ids, labels):
embedded = ... apply the embedding layer ...
rnn_out, _ = self.rnn(embedded)
... do the rest ...
🎓 Task 3.2: Computing the loss
Ifforward in your model has been called with the optional labels argument, then you should also compute a loss. For language modeling, this is simply the categorical cross-entropy loss, applied to all the tokens in the batch. There are a few non-trivial things to keep in mind in this step. We will discuss these points in the following three hints:
Hint: exclude the last position of the logits tensor and the first position of the labels tensor.
In HuggingFace language models, labels is typically identical to input_ids, except that padding tokens in the input are replaced by the dummy token id -100 (we will discuss this in Part 4).
For instance, let’s say our training text is Good stuff ! (in practice, the words will be integer-coded). That means that at the first word (Good), we want the model to predict the second word (stuff). At the second word, the goal is to predict !.
So when you compute the loss, you will exclude the last position of the logits tensor because we do not observe what happens after !, and similarly you will exclude the first position of labels because we are not observing anything before Good.
Hint: how to apply the loss function when training a language model.
CrossEntropyLoss) expects two input tensors:
- the logits (that is: the unnormalized log probabilities) of the predictions,
-
the targets, that is the true output values we want the model to predict, referred to as
labelsin HuggingFace.
Here, the tensor is expected to be one-dimensional (of length B, where B is the batch size) and the logits tensor to be two-dimensional (of shape (B, V) where V is the number of choices).
In our case, the loss function’s expected input format requires a small trick, since our targets/labels tensor is two-dimensional (B, N) where N is the maximal text length in the batch. Analogously, the logits tensor is three-dimensional (B, N, V). To deal with this, you need to reshape the tensors before applying the loss function.labels = labels.view(-1) # 2-dimensional -> 1-dimensional logits = logits.view(-1, logits.shape[-1]) # 3-dimensional -> 2-dimensional
Sanity check: carry out the following steps:
- Create an integer tensor of shape 1xN where N is the length of the sequence. It doesn’t matter what the integers are except that they should be less than the vocabulary size. (Alternatively, take one instance from your training set.)
- Apply the model to this input tensor. It shouldn’t crash here.
- Make sure that the shape of the returned output tensor is 1xNxV where V is the size of the vocabulary. This output corresponds to the logits of the next-token probability distribution, but it is useless at this point because we haven’t yet trained the model.
Part 4: Training the model
We will now put all the pieces together and implement the code to train the language model.
Similarly to Part 1, we will mimic the functionality of the HuggingFace Transformers library. The Trainer is the main utility the Transformers library provides to handle model training, and it provides a variety of complex functionality including multi-GPU training and many other bells and whistles. In our case, we will just implement a basic training loop.
🎓 Task 4.1: Implementing the trainer
Starting from the skeleton Python code, your task now is to complete the missing parts in the method train in the class A1Trainer.
The missing parts you need to provide are
- Setting up the optimizer, which is the PyTorch utility that updates model parameters during the training loop. The optimizer typically implements some variant of stochastic gradient descent. We recommend
AdamW, which is used to train most LLMs. - Setting up the
DataLoaders for the training and validation sets. The datasets are provided as inputs, and you can simply create theDataLoaders as in Part 2. - The training loop itself, which is where most of your work will be done.
Hyperparameters that control the training should be stored in a TrainingArguments object. HuggingFace defines a large number of such hyperparameters but you only need to consider a few of them. The skeleton code includes a hint that lists the relevant hyperparameters.
The training loop should look more or less like a regular PyTorch training loop (see the hint in the code).
Hint: masking out the padding tokens.
When the loss is computed, we don’t want to include the positions where we have inserted the dummy padding tokens. CrossEntropyLoss has a parameter ignore_index that you can set to the integer you use to represent the padding tokens. HuggingFace expects ignore_index to be the magic number of -100.
So when you call the model in the training loop, the labels tensor should be identical to the input_ids tensor, except that all occurrences of the padding token id should be replaced with -100.
If you used the DataCollatorForLanguageModeling in Part 2, this is done automatically.
While developing the code, we advise you to work with very small datasets until you know it doesn’t crash, and then use the full training set. Monitor the cross-entropy loss (and/or the perplexity) over the training: if the loss does not decrease while you are training, there is probably an error. For instance, if the learning rate is set to a value that is too large, the loss values may be unstable or increase.
If your solution is implemented correctly and you are using the full training set, training the model for one epoch with GPUs on Minerva should take a few minutes.
Part 5: Evaluation and analysis
Note: the skeleton implementation of train ends with the call self.model.save_pretrained. If you did not modify args.output_dir, then your trained model will be stored in the directory trainer_output. If you want to reuse a trained model without having to run the whole training loop again, then you can load it by calling A1RNNModel.from_pretrained('trainer_output'). In addition, you will probably want to load your saved tokenizer (A1Tokenizer.from_file('your_file_name')).
⚙ Task 5.1: Predicting the next word
Take some example text and use the model to predict the next word. For instance, if we apply the model to the text She lives in San, what word do you think will come next? - Apply the model to the integer-encoded text. As usual, this gives you (the logits of) a probability distribution over your vocabulary. (Make sure that you consider the right position here: if your tokenized input includes an end-of-sentence dummy, you should take the logits at the second-to-last position.) - Use argmax to find the index of the highest-scoring item, or topk to find the indices and scores of the k highest-scoring items. - Apply the inverse vocabulary encoder (that you created in Part 1) so that you can understand what words the model thinks are the most likely in this context.
🎓 Task 5.2: Computing the perplexity
The most common way to evaluate language models quantitatively is the perplexity score on a test dataset. The better the model is at predicting the actually occurring words, the lower the perplexity. This quantity is formally defined as follows:
\[\text{perplexity} = 2^{-\frac{1}{m}\sum_{i=1}^m \log_2 P(w_i | c_i)}\]
In this formula, m is the number of words in the dataset, P is the probability assigned by our model, wi and ci the word and context window at each position.
Compute the perplexity of your model on the validation set. The exact value will depend on various implementation choices you have made, how much of the training data you have been able to use, etc. Roughly speaking, if you get perplexity scores around 700 or more, there are probably problems. Carefully implemented and well-trained models will probably have perplexity scores in the range of 200–300.
Hint: An easy way to compute the perplexity in PyTorch.
As you can see in the formula, the perplexity is an exponential function applied to the mean of the negative log probability of each token. You are probably already computing the cross-entropy loss as part of your training loop, and this actually computes what you need here.
The perplexity is traditionally defined in terms of logarithms of base 2. However, we will get the same result regardless of what logarithmic base we use. So it is OK to use the natural logarithms and exponential functions, as long as we are consistent: this means that we can compute the perplexity by applying exp to the mean of the cross-entropy loss over your batches in the validation set.
If you have time for exploration, investigate the effect of model hyperparameters and training settings on the model’s perplexity.
🎓 Task 5.3: Inspecting the learned word embeddings
It is common to say that neural networks are “black boxes” and that we cannot fully understand their internal mechanics, especially as they grow larger and structurally more complex. The research area of model interpretability aims to develop methods to help us reason about the high-level functions the models implement.
We will briefly investigate the embeddings that your model learned while you trained it. If we have successfully trained a word embedding model, an embedding vector stores a crude representation of “word meaning”, so we can reason about the learned meaning representations by investigating the geometry of the vector space of word embeddings. The most common way to do this is to look at nearest neighbors in the vector space: intuitively, if we look at some example word, its neighbors should correspond to words that have a similar meaning.
Select some example words (e.g. "sweden") and look at their nearest neighbors in the vector space of word embeddings. Does it seem that the nearest neighbors make sense?
Hint: Example code for computing nearest neighbors.
emb is the nn.Embedding module of your language model, while voc and inv_voc are the string-to-integer and integer-to-string mappings you created in Step 2.
def nearest_neighbors(emb, voc, inv_voc, word, n_neighbors=5):
# Look up the embedding for the test word.
test_emb = emb.weight[voc[word]]
# We'll use a cosine similarity function to find the most similar words.
sim_func = nn.CosineSimilarity(dim=1)
cosine_scores = sim_func(test_emb, emb.weight)
# Find the positions of the highest cosine values.
near_nbr = cosine_scores.topk(n_neighbors+1)
topk_cos = near_nbr.values[1:]
topk_indices = near_nbr.indices[1:]
# NB: the first word in the top-k list is the query word itself!
# That's why we skip the first position in the code above.
# Finally, map word indices back to strings, and put the result in a list.
return [ (inv_voc[ix.item()], cos.item()) for ix, cos in zip(topk_indices, topk_cos) ]
plt.savefig).
Hint: Example code for PCA-based embedding scatterplot.
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
def plot_embeddings_pca(emb, inv_voc, words):
vectors = np.vstack([emb.weight[inv_voc[w]].cpu().detach().numpy() for w in words])
vectors -= vectors.mean(axis=0)
twodim = TruncatedSVD(n_components=2).fit_transform(vectors)
plt.figure(figsize=(5,5))
plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
for word, (x,y) in zip(words, twodim):
plt.text(x+0.02, y, word)
plt.axis('off')
plot_embeddings_pca(model[0], prepr, ['sweden', 'denmark', 'europe', 'africa', 'london', 'stockholm', 'large', 'small', 'great', 'black', '3', '7', '10', 'seven', 'three', 'ten', '1984', '2005', '2010'])