from trafilatura import fetch_url, extract
= fetch_url("http://dwarffortresswiki.org/index.php/Dwarf")
downloaded = extract(downloaded, include_tables=False) text
Text Preprocessing Example
For this example, let’s take a page from the Dwarf Fortress wiki as a source of our text. We use trafilatura to quickly download and extract from this wiki page:
print(text[:990])
- A short, sturdy creature fond of drink and industry.
This is a masterfully-designed engraving of a Dwarf and a battle axe.
Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode. They are well known for their stout physique and prominent beards (on the males), which begin to grow from birth; dwarves are stronger, shorter, stockier, and hairier than the average human, have a heightened sense of their surroundings and possess perfect darkvision. Dwarves live both in elaborate underground fortresses carved from the mountainside and above-ground hillocks, are naturally gifted miners, metalsmiths, and stone crafters, and value the acquisition of wealth and rare metals above all else.
Dwarven civilizations typically form (mostly) peaceful, trade-based relationships with humans and elves, but are bitter enemies with goblins, and consider kobolds a petty annoyance.
Let’s take one example sentence from the Dwarf Fortress wiki:
= 'Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode.' sentence
Let’s assume that we’re trying to retrieve this sentence using the search term “dwarf”.
Tokenization
“Whitespace tokenization” just splits the input on whitespace:
for token in sentence.split():
print(token)
Dwarves
(singular,
Dwarf)
are
"intelligent",
alcohol-dependent,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode,
as
well
as
being
playable
in
adventurer
mode.
Let’s see how spaCy performs tokenization:
import spacy
= spacy.load('en_core_web_sm') nlp
for token in nlp(sentence):
print(token)
Dwarves
(
singular
,
Dwarf
)
are
"
intelligent
"
,
alcohol
-
dependent
,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode
,
as
well
as
being
playable
in
adventurer
mode
.
Stop words
A stop word is a frequent word that does not contribute (much) to a given task. Often, function words such as “a”, “the”, “of”, etc. are considered stop words.
spaCy comes with its own stop word list. Let’s check which words spaCy considers to be “stop words”:
import pandas as pd
= [(token, token.is_stop) for token in nlp(sentence)]
data =["Token", "Is stop word?"]) pd.DataFrame(data, columns
Token | Is stop word? | |
---|---|---|
0 | Dwarves | False |
1 | ( | False |
2 | singular | False |
3 | , | False |
4 | Dwarf | False |
5 | ) | False |
6 | are | True |
7 | " | False |
8 | intelligent | False |
9 | " | False |
10 | , | False |
11 | alcohol | False |
12 | - | False |
13 | dependent | False |
14 | , | False |
15 | humanoid | False |
16 | creatures | False |
17 | that | True |
18 | are | True |
19 | the | True |
20 | featured | False |
21 | race | False |
22 | of | True |
23 | fortress | False |
24 | mode | False |
25 | , | False |
26 | as | True |
27 | well | True |
28 | as | True |
29 | being | True |
30 | playable | False |
31 | in | True |
32 | adventurer | False |
33 | mode | False |
34 | . | False |
Do you see a “stop word” here that could potentially be undesirable to remove?
= [(token, token.is_stop) for token in nlp("The well of ascension")]
data =["Token", "Is stop word?"]) pd.DataFrame(data, columns
Token | Is stop word? | |
---|---|---|
0 | The | True |
1 | well | True |
2 | of | True |
3 | ascension | False |
Punctuation
Punctuation marks are not considered “stop words”, but there’s a separate property to check for them:
= [(token, token.is_stop, token.is_punct) for token in nlp(sentence)]
data =["Token", "Is stop word?", "Is punctuation?"]) pd.DataFrame(data, columns
Token | Is stop word? | Is punctuation? | |
---|---|---|---|
0 | Dwarves | False | False |
1 | ( | False | True |
2 | singular | False | False |
3 | , | False | True |
4 | Dwarf | False | False |
5 | ) | False | True |
6 | are | True | False |
7 | " | False | True |
8 | intelligent | False | False |
9 | " | False | True |
10 | , | False | True |
11 | alcohol | False | False |
12 | - | False | True |
13 | dependent | False | False |
14 | , | False | True |
15 | humanoid | False | False |
16 | creatures | False | False |
17 | that | True | False |
18 | are | True | False |
19 | the | True | False |
20 | featured | False | False |
21 | race | False | False |
22 | of | True | False |
23 | fortress | False | False |
24 | mode | False | False |
25 | , | False | True |
26 | as | True | False |
27 | well | True | False |
28 | as | True | False |
29 | being | True | False |
30 | playable | False | False |
31 | in | True | False |
32 | adventurer | False | False |
33 | mode | False | False |
34 | . | False | True |
Let’s see how our text looks after we remove stop words and punctuation:
= [token for token in nlp(sentence) if not (token.is_stop or token.is_punct)]
tokens tokens
[ Dwarves, singular, Dwarf, intelligent, alcohol, dependent, humanoid, creatures, featured, race, fortress, mode, playable, adventurer, mode ]
Lemmatization
- A lexeme is a set of word forms sharing the “same fundamental meaning”
- for example: Dwarf and Dwarves \(\longleftrightarrow\) lexeme Dwarf
- A lemma is a word form representing a given lexeme; sometimes called “dictionary form”, because it’s what you would look in a dictionary or lexicon.
- for example: dwarf
- Mapping all word forms to their lemmas (or: lemmata) is called lemmatization.
Tokens in spaCy have an attribute that tells us their lemma:
= [token.lemma_ for token in tokens]
lemmas lemmas
[ 'dwarf', 'singular', 'Dwarf', 'intelligent', 'alcohol', 'dependent', 'humanoid', 'creature', 'feature', 'race', 'fortress', 'mode', 'playable', 'adventurer', 'mode' ]
Lowercasing
For some applications, the distinction between Dwarf and dwarf doesn’t matter (much). In those cases, we might consider lowercasing the entire input. This may make it easier for some models to learn patterns from the data.
for lemma in lemmas] [lemma.lower()
[ 'dwarf', 'singular', 'dwarf', 'intelligent', 'alcohol', 'dependent', 'humanoid', 'creature', 'feature', 'race', 'fortress', 'mode', 'playable', 'adventurer', 'mode' ]
Don’t apply these steps blindly!
You need to consider both the task that you’re trying to solve and the model that you’re working with.
- Tokenization is always performed, but some models (such as LLMs) perform their own tokenization internally, and therefore expect their input to be untokenized.
- Other preprocessing steps always lose information, so you need to weigh up the benefits and downsides.
- For example, lowercasing everything loses the distinction between “Apple” (the company) and “apple” (the fruit).