Text Preprocessing Example

Author

Marcel Bollmann

Published

November 5, 2024

For this example, let’s take a page from the Dwarf Fortress wiki as a source of our text. We use trafilatura to quickly download and extract from this wiki page:

from trafilatura import fetch_url, extract
downloaded = fetch_url("http://dwarffortresswiki.org/index.php/Dwarf")
text = extract(downloaded, include_tables=False)

print(text[:990])

- A short, sturdy creature fond of drink and industry.
This is a masterfully-designed engraving of a Dwarf and a battle axe.
Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode. They are well known for their stout physique and prominent beards (on the males), which begin to grow from birth; dwarves are stronger, shorter, stockier, and hairier than the average human, have a heightened sense of their surroundings and possess perfect darkvision. Dwarves live both in elaborate underground fortresses carved from the mountainside and above-ground hillocks, are naturally gifted miners, metalsmiths, and stone crafters, and value the acquisition of wealth and rare metals above all else.
Dwarven civilizations typically form (mostly) peaceful, trade-based relationships with humans and elves, but are bitter enemies with goblins, and consider kobolds a petty annoyance.

Let’s take one example sentence from the Dwarf Fortress wiki:

sentence = 'Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode.'

Let’s assume that we’re trying to retrieve this sentence using the search term “dwarf”.

Tokenization

“Whitespace tokenization” just splits the input on whitespace:

for token in sentence.split():
    print(token)

Dwarves
(singular,
Dwarf)
are
"intelligent",
alcohol-dependent,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode,
as
well
as
being
playable
in
adventurer
mode.

Let’s see how spaCy performs tokenization:

import spacy
nlp = spacy.load('en_core_web_sm')

for token in nlp(sentence):
    print(token)

Dwarves
(
singular
,
Dwarf
)
are
"
intelligent
"
,
alcohol
-
dependent
,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode
,
as
well
as
being
playable
in
adventurer
mode
.

Stop words

A stop word is a frequent word that does not contribute (much) to a given task. Often, function words such as “a”, “the”, “of”, etc. are considered stop words.

spaCy comes with its own stop word list. Let’s check which words spaCy considers to be “stop words”:

import pandas as pd

data = [(token, token.is_stop) for token in nlp(sentence)]
pd.DataFrame(data, columns=["Token", "Is stop word?"])

	Token	Is stop word?
0	Dwarves	False
1	(	False
2	singular	False
3	,	False
4	Dwarf	False
5	)	False
6	are	True
7	"	False
8	intelligent	False
9	"	False
10	,	False
11	alcohol	False
12	-	False
13	dependent	False
14	,	False
15	humanoid	False
16	creatures	False
17	that	True
18	are	True
19	the	True
20	featured	False
21	race	False
22	of	True
23	fortress	False
24	mode	False
25	,	False
26	as	True
27	well	True
28	as	True
29	being	True
30	playable	False
31	in	True
32	adventurer	False
33	mode	False
34	.	False

Do you see a “stop word” here that could potentially be undesirable to remove?

data = [(token, token.is_stop) for token in nlp("The well of ascension")]
pd.DataFrame(data, columns=["Token", "Is stop word?"])

	Token	Is stop word?
0	The	True
1	well	True
2	of	True
3	ascension	False

Punctuation

Punctuation marks are not considered “stop words”, but there’s a separate property to check for them:

data = [(token, token.is_stop, token.is_punct) for token in nlp(sentence)]
pd.DataFrame(data, columns=["Token", "Is stop word?", "Is punctuation?"])

	Token	Is stop word?	Is punctuation?
0	Dwarves	False	False
1	(	False	True
2	singular	False	False
3	,	False	True
4	Dwarf	False	False
5	)	False	True
6	are	True	False
7	"	False	True
8	intelligent	False	False
9	"	False	True
10	,	False	True
11	alcohol	False	False
12	-	False	True
13	dependent	False	False
14	,	False	True
15	humanoid	False	False
16	creatures	False	False
17	that	True	False
18	are	True	False
19	the	True	False
20	featured	False	False
21	race	False	False
22	of	True	False
23	fortress	False	False
24	mode	False	False
25	,	False	True
26	as	True	False
27	well	True	False
28	as	True	False
29	being	True	False
30	playable	False	False
31	in	True	False
32	adventurer	False	False
33	mode	False	False
34	.	False	True

Let’s see how our text looks after we remove stop words and punctuation:

tokens = [token for token in nlp(sentence) if not (token.is_stop or token.is_punct)]
tokens

[
    Dwarves,
    singular,
    Dwarf,
    intelligent,
    alcohol,
    dependent,
    humanoid,
    creatures,
    featured,
    race,
    fortress,
    mode,
    playable,
    adventurer,
    mode
]

Lemmatization

A lexeme is a set of word forms sharing the “same fundamental meaning”
- for example: Dwarf and Dwarves \(\longleftrightarrow\) lexeme Dwarf

A lemma is a word form representing a given lexeme; sometimes called “dictionary form”, because it’s what you would look in a dictionary or lexicon.
- for example: dwarf

Mapping all word forms to their lemmas (or: lemmata) is called lemmatization.

Tokens in spaCy have an attribute that tells us their lemma:

lemmas = [token.lemma_ for token in tokens]
lemmas

[
    'dwarf',
    'singular',
    'Dwarf',
    'intelligent',
    'alcohol',
    'dependent',
    'humanoid',
    'creature',
    'feature',
    'race',
    'fortress',
    'mode',
    'playable',
    'adventurer',
    'mode'
]

Lowercasing

For some applications, the distinction between Dwarf and dwarf doesn’t matter (much). In those cases, we might consider lowercasing the entire input. This may make it easier for some models to learn patterns from the data.

[lemma.lower() for lemma in lemmas]

[
    'dwarf',
    'singular',
    'dwarf',
    'intelligent',
    'alcohol',
    'dependent',
    'humanoid',
    'creature',
    'feature',
    'race',
    'fortress',
    'mode',
    'playable',
    'adventurer',
    'mode'
]

Don’t apply these steps blindly!

You need to consider both the task that you’re trying to solve and the model that you’re working with.

Tokenization is always performed, but some models (such as LLMs) perform their own tokenization internally, and therefore expect their input to be untokenized.
Other preprocessing steps always lose information, so you need to weigh up the benefits and downsides.
- For example, lowercasing everything loses the distinction between “Apple” (the company) and “apple” (the fruit).