# Text Preprocessing Example

Marcel Bollmann  
2024-11-05

For this example, let’s take a [page from the Dwarf Fortress
wiki](http://dwarffortresswiki.org/index.php/Dwarf) as a source of our
text. We use
[trafilatura](https://trafilatura.readthedocs.io/en/latest/) to quickly
download and extract from this wiki page:

In [1]:
from trafilatura import fetch_url, extract
downloaded = fetch_url("http://dwarffortresswiki.org/index.php/Dwarf")
text = extract(downloaded, include_tables=False)

In [2]:
print(text[:990])

- A short, sturdy creature fond of drink and industry.
This is a masterfully-designed engraving of a Dwarf and a battle axe.
Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode. They are well known for their stout physique and prominent beards (on the males), which begin to grow from birth; dwarves are stronger, shorter, stockier, and hairier than the average human, have a heightened sense of their surroundings and possess perfect darkvision. Dwarves live both in elaborate underground fortresses carved from the mountainside and above-ground hillocks, are naturally gifted miners, metalsmiths, and stone crafters, and value the acquisition of wealth and rare metals above all else.
Dwarven civilizations typically form (mostly) peaceful, trade-based relationships with humans and elves, but are bitter enemies with goblins, and consider kobolds a petty annoyance. 

Let’s take one example sentence from the Dwarf Fortress wiki:

In [3]:
sentence = 'Dwarves (singular, Dwarf) are "intelligent", alcohol-dependent, humanoid creatures that are the featured race of fortress mode, as well as being playable in adventurer mode.'

Let’s assume that we’re trying to *retrieve* this sentence using the
search term “dwarf”.

## Tokenization

“Whitespace tokenization” just splits the input on whitespace:

In [4]:
for token in sentence.split():
    print(token)

Dwarves
(singular,
Dwarf)
are
"intelligent",
alcohol-dependent,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode,
as
well
as
being
playable
in
adventurer
mode.

Let’s see how [spaCy](https://spacy.io/) performs tokenization:

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [6]:
for token in nlp(sentence):
    print(token)

Dwarves
(
singular
,
Dwarf
)
are
"
intelligent
"
,
alcohol
-
dependent
,
humanoid
creatures
that
are
the
featured
race
of
fortress
mode
,
as
well
as
being
playable
in
adventurer
mode
.

## Stop words

A **stop word** is a frequent word that does not contribute (much) to a
given task. Often, function words such as “a”, “the”, “of”, etc. are
considered stop words.

spaCy comes with its own stop word list. Let’s check which words spaCy
considers to be “stop words”:

In [7]:
import pandas as pd

data = [(token, token.is_stop) for token in nlp(sentence)]
pd.DataFrame(data, columns=["Token", "Is stop word?"])

Do you see a “stop word” here that could potentially be undesirable to
remove?

In [8]:
data = [(token, token.is_stop) for token in nlp("The well of ascension")]
pd.DataFrame(data, columns=["Token", "Is stop word?"])

## Punctuation

**Punctuation marks** are not considered “stop words”, but there’s a
separate property to check for them:

In [9]:
data = [(token, token.is_stop, token.is_punct) for token in nlp(sentence)]
pd.DataFrame(data, columns=["Token", "Is stop word?", "Is punctuation?"])

Let’s see how our text looks after we remove stop words and punctuation:

In [10]:
tokens = [token for token in nlp(sentence) if not (token.is_stop or token.is_punct)]
tokens

## Lemmatization

-   A **lexeme** is a set of word forms sharing the “same fundamental
    meaning”
    -   for example: *Dwarf* and *Dwarves* $\longleftrightarrow$ lexeme
        <span class="smallcaps">Dwarf</span>

-   A **lemma** is a word form representing a given lexeme; sometimes
    called “dictionary form”, because it’s what you would look in a
    dictionary or lexicon.
    -   for example: *dwarf*

-   Mapping all word forms to their lemmas (*or:* lemmata) is called
    **lemmatization**.

Tokens in spaCy have an attribute that tells us their lemma:

In [11]:
lemmas = [token.lemma_ for token in tokens]
lemmas

## Lowercasing

For some applications, the distinction between *Dwarf* and *dwarf*
doesn’t matter (much). In those cases, we might consider **lowercasing**
the entire input. This may make it easier for some models to learn
patterns from the data.

In [12]:
[lemma.lower() for lemma in lemmas]

## Don’t apply these steps blindly!

You need to consider both the **task** that you’re trying to solve and
the **model** that you’re working with.

-   Tokenization is always performed, but some models (such as LLMs)
    perform their own tokenization internally, and therefore expect
    their input to be untokenized.
-   Other preprocessing steps always lose information, so you need to
    weigh up the benefits and downsides.
    -   For example, lowercasing everything loses the distinction
        between “Apple” *(the company)* and “apple” *(the fruit)*.