Text Preprocessing Example

Author

Marcel Bollmann

Published

November 5, 2024

For this example, let’s take a page from the Dwarf Fortress wiki as a source of our text. We use trafilatura to quickly download and extract from this wiki page:

from trafilatura import fetch_url, extract
downloaded = fetch_url("http://dwarffortresswiki.org/index.php/Dwarf")
text = extract(downloaded, include_tables=False)
print(text[:990])
- A short, sturdy creature fond of drink and industry.
This is a masterfully-designed engraving of a Dwarf and a battle axe.
Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode. They are well known for their stout physique and prominent beards (on the males), which begin to grow from birth; dwarves are stronger, shorter, stockier, and hairier than the average human, have a heightened sense of their surroundings and possess perfect darkvision. Dwarves live both in elaborate underground fortresses carved from the mountainside and above-ground hillocks, are naturally gifted miners, metalsmiths, and stone crafters, and value the acquisition of wealth and rare metals above all else.
Dwarven civilizations typically form (mostly) peaceful, trade-based relationships with humans and elves, but are bitter enemies with goblins, and consider kobolds a petty annoyance. 

Let’s take one example sentence from the Dwarf Fortress wiki:

sentence = 'Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode.'

Let’s assume that we’re trying to retrieve this sentence using the search term “dwarf”.

Tokenization

“Whitespace tokenization” just splits the input on whitespace:

for token in sentence.split():
    print(token)
Dwarves
(singular,
Dwarf)
are
"intelligent",
alcohol-dependent,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode,
as
well
as
being
playable
in
adventurer
mode.

Let’s see how spaCy performs tokenization:

import spacy
nlp = spacy.load('en_core_web_sm')
for token in nlp(sentence):
    print(token)
Dwarves
(
singular
,
Dwarf
)
are
"
intelligent
"
,
alcohol
-
dependent
,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode
,
as
well
as
being
playable
in
adventurer
mode
.

Stop words

A stop word is a frequent word that does not contribute (much) to a given task. Often, function words such as “a”, “the”, “of”, etc. are considered stop words.

spaCy comes with its own stop word list. Let’s check which words spaCy considers to be “stop words”:

import pandas as pd

data = [(token, token.is_stop) for token in nlp(sentence)]
pd.DataFrame(data, columns=["Token", "Is stop word?"])

Token Is stop word?
0 Dwarves False
1 ( False
2 singular False
3 , False
4 Dwarf False
5 ) False
6 are True
7 " False
8 intelligent False
9 " False
10 , False
11 alcohol False
12 - False
13 dependent False
14 , False
15 humanoid False
16 creatures False
17 that True
18 are True
19 the True
20 featured False
21 race False
22 of True
23 fortress False
24 mode False
25 , False
26 as True
27 well True
28 as True
29 being True
30 playable False
31 in True
32 adventurer False
33 mode False
34 . False

Do you see a “stop word” here that could potentially be undesirable to remove?

data = [(token, token.is_stop) for token in nlp("The well of ascension")]
pd.DataFrame(data, columns=["Token", "Is stop word?"])

Token Is stop word?
0 The True
1 well True
2 of True
3 ascension False

Punctuation

Punctuation marks are not considered “stop words”, but there’s a separate property to check for them:

data = [(token, token.is_stop, token.is_punct) for token in nlp(sentence)]
pd.DataFrame(data, columns=["Token", "Is stop word?", "Is punctuation?"])

Token Is stop word? Is punctuation?
0 Dwarves False False
1 ( False True
2 singular False False
3 , False True
4 Dwarf False False
5 ) False True
6 are True False
7 " False True
8 intelligent False False
9 " False True
10 , False True
11 alcohol False False
12 - False True
13 dependent False False
14 , False True
15 humanoid False False
16 creatures False False
17 that True False
18 are True False
19 the True False
20 featured False False
21 race False False
22 of True False
23 fortress False False
24 mode False False
25 , False True
26 as True False
27 well True False
28 as True False
29 being True False
30 playable False False
31 in True False
32 adventurer False False
33 mode False False
34 . False True

Let’s see how our text looks after we remove stop words and punctuation:

tokens = [token for token in nlp(sentence) if not (token.is_stop or token.is_punct)]
tokens

[
    Dwarves,
    singular,
    Dwarf,
    intelligent,
    alcohol,
    dependent,
    humanoid,
    creatures,
    featured,
    race,
    fortress,
    mode,
    playable,
    adventurer,
    mode
]

Lemmatization

  • A lexeme is a set of word forms sharing the “same fundamental meaning”
    • for example: Dwarf and Dwarves \(\longleftrightarrow\) lexeme Dwarf
  • A lemma is a word form representing a given lexeme; sometimes called “dictionary form”, because it’s what you would look in a dictionary or lexicon.
    • for example: dwarf
  • Mapping all word forms to their lemmas (or: lemmata) is called lemmatization.

Tokens in spaCy have an attribute that tells us their lemma:

lemmas = [token.lemma_ for token in tokens]
lemmas

[
    'dwarf',
    'singular',
    'Dwarf',
    'intelligent',
    'alcohol',
    'dependent',
    'humanoid',
    'creature',
    'feature',
    'race',
    'fortress',
    'mode',
    'playable',
    'adventurer',
    'mode'
]

Lowercasing

For some applications, the distinction between Dwarf and dwarf doesn’t matter (much). In those cases, we might consider lowercasing the entire input. This may make it easier for some models to learn patterns from the data.

[lemma.lower() for lemma in lemmas]

[
    'dwarf',
    'singular',
    'dwarf',
    'intelligent',
    'alcohol',
    'dependent',
    'humanoid',
    'creature',
    'feature',
    'race',
    'fortress',
    'mode',
    'playable',
    'adventurer',
    'mode'
]

Don’t apply these steps blindly!

You need to consider both the task that you’re trying to solve and the model that you’re working with.

  • Tokenization is always performed, but some models (such as LLMs) perform their own tokenization internally, and therefore expect their input to be untokenized.
  • Other preprocessing steps always lose information, so you need to weigh up the benefits and downsides.
    • For example, lowercasing everything loses the distinction between “Apple” (the company) and “apple” (the fruit).