Removing NLTK Stopwords with Python

Chris Latimer•August 30, 2024

Stopwords are often filtered out in natural language processing (NLP) in order to improve text analysis and computational efficiency. By highlighting the more significant words, or content words, the removal of stopwords can increase the accuracy and relevance of NLP tasks. This post will provide you with a solid overview of stopwords.

To work with stopwords in Python, you need to import the NLTK library using the command import nltk.

It is believed that certain words, such as “the,” “and,” and “is,” are useless for conveying crucial information. While the list of stopwords may vary, the goal of removing words that contribute little to no understanding of the text is to speed up text processing.

What Are Stop Words in Natural Language Processing?

Stop words will include terms that typically get ignored when search indexes are entered into the prompt. During the retrieval process, the stop words are passed up. What exactly are stop words? Here’s an example of some of the following:

“the”
“a”
“an”
“in”

These terms may be eliminated from databases simply due to the amount of unnecessary processing time it might take. A list of stop words can be created on your own and can exclude other terms other than what we’ve listed above.

It would not be desirable for these terms to occupy space in our database or consume unnecessary processing time. We can simply eliminate them for this purpose by keeping a list of words that you deem to be stop words. The list of stopwords in NLTK (Natural Language Toolkit) in Python is kept in 16 different languages. NLTK has predefined lists of stopwords stored in multiple languages, which can be used to efficiently remove stopwords during text preprocessing.

Need To Remove the Stopwords in Python

On the other side, there is no one-size-fits-all approach to eliminating stop words in natural language processing tasks; this depends on the nature of the task. In the instance of natural language processing tasks like text classification or sentiment analysis, it will be rather critical to remove these stop words. More so in ensuring that they are eliminated for the purpose of putting more emphasis on some other words that shall carry much weight on the text a model is trying to generate.

This is done in an effort to focus more attention on the words that really capture the spirit of the writing. As was previously shown, some words—like “there,” “book,” and “table”—contribute significantly to the meaning of the text, in contrast to less illuminating terms like “is” and “on.”

Yet, there’s a caveat here when it comes to removing stopwords. It may not be as necessary if the tasks are machine translation or text summarization. Both will need such words in order to retain the content’s original meaning and the context the user may be looking for.

RAG Evaluation Made Simple Get actionable insights to improve your RAG application in minutes Try Free

Types of Stopwords

Words that are commonly used in a language but are rarely included in natural language processing (NLP) tasks are known as stopwords because they are not very important for understanding the meaning of a text. The specific stopword list may vary depending on the context and language under study. A comprehensive list of stopword categories is provided below:

Common Stopwords: During text preprocessing, these words—which are the most common in a language—are usually eliminated. “The,” “is,” “in,” “for,” “where,” “when,” “to,” “at,” and so forth are a few examples.

Custom Stopwords: More words might be regarded as stopwords, depending on the particular task or domain. These may be terms specific to a particular domain that add little to the overall meaning. Words like “patient” or “treatment,” for instance, might be regarded as custom stopwords in a medical setting. In some domains, words like “food” and its similar words may be considered unnecessary and thus treated as stopwords.

Numerical Stopwords: In some situations, numbers and numeric characters may be regarded as stopwords, particularly if the analysis is centered on the text’s meaning rather than on particular numerical values.

Stopwords with a single character: Characters like “a,” “I,” “s,” or “x” may be regarded as stopwords if they don’t have much meaning by themselves.

Words that are meaningful in one context but stopwords in another could be categorized as contextual stopwords. For example, in the context of general language processing, the word “will” may be a stopword, but it may be crucial in making predictions about the future.

Checking English Stopwords List with Corpus Import Stopwords

Common words with minimal semantic meaning that are frequently left out of text analysis are included in an English stopwords list. Words like “the,” “and,” “is,” “in,” “for,” and “it” are examples of these. When processing text data for tasks involving natural language processing, like text classification or sentiment analysis, these stopwords are often eliminated in order to concentrate on more significant terms.

The NLTK stop words list is crucial in text preprocessing as it helps in filtering out common words, allowing the focus to be on more meaningful terms that contribute to the document’s central theme.

To check the list of stopwords you can type the following commands in the python shell.

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", 
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 
'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 
'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 
'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', 
"couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', 
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', 
"wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Note: You can add new words to the English.txt file in the stopwords directory to change the list.

Removing stop words with NLTK

The following program removes stop words from a piece of text by first tokenizing words in the text:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
				showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether 
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
	if w not in stop_words:
		filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

Output:

The stopword removal using the Natural Language Toolkit (NLTK) library is demonstrated in the provided Python code. First, the word tokenize function is used to tokenize the sample sentence, which says, “This is a sample sentence, showing off the stop words filtration,” into words. Subsequently, the code converts all words to lowercase and verifies their existence in the collection of English stopwords acquired from NLTK in order to eliminate stopwords. The filtered sentence that is produced is printed out, showing both the original and lowercase versions. This gives the sentence a clean look by eliminating common stopwords in English.

Removing Stop Words with SpaCy

import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "There is a pen on the table"

# Process the text using spaCy
doc = nlp(text)

# Remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# Join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print("Original Text:", text)
print("Text after Stopword Removal:", clean_text)

Output:

Original Text: There is a pen on the table
Text after Stopword Removal: pen table

The provided Python code eliminates stopwords from an example text by utilizing the spaCy library for natural language processing. The sample text, “There is a pen on the table,” is first processed by spaCy after loading the spaCy English model. The processed tokens are then filtered to remove stopwords, and the remaining non-stopword tokens are combined to produce a clean copy of the text.

Removing Stop Words with Genism

from gensim.parsing.preprocessing import remove_stopwords

# Another sample text
new_text = "The majestic mountains provide a breathtaking view."

# Remove stopwords using Gensim
new_filtered_text = remove_stopwords(new_text)

print("Original Text:", new_text)
print("Text after Stopword Removal:", new_filtered_text)

Output:

Original Text: The majestic mountains provide a breathtaking view.
Text after Stopword Removal: The majestic mountains provide breathtaking view.

The provided Python code preprocesses a sample text by using Gensim’s
remove_stopwords function. The original text in this particular instance is
“The majestic mountains provide a breathtaking view.” Common English
stopwords are effectively removed by the remove_stopwords function, producing a
filtered version of the text that is printed next to the original.

Removing stop words with SkLearn

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Another sample text
new_text = "The quick brown fox jumps over the lazy dog."

# Tokenize the new text using NLTK
new_words = word_tokenize(new_text)

# Remove stopwords using NLTK
new_filtered_words = [
	word for word in new_words if word.lower() not in stopwords.words('english')]

# Join the filtered words to form a clean text
new_clean_text = ' '.join(new_filtered_words)

print("Original Text:", new_text)
print("Text after Stopword Removal:", new_clean_text)

Output:

Original Text: The quick brown fox jumps over the lazy dog.
Text after Stopword Removal: quick brown fox jumps lazy dog.

The provided Python code combines NLTK and scikit-learn for text processing and stopword removal. Using NLTK’s word tokenize function, the sample text “The quick brown fox jumps over the lazy dog” is first tokenized into words. Common English stop words are then eliminated by repeatedly going through the tokenized words and making sure they aren’t included in the NLTK stopwords set.

To create a clean version of the text, the last step is to join the non-stopword tokens. This method incorporates CountVectorizer from scikit-learn and can be used for additional text analysis, like generating a representation of a bag of words.

Free RAG Pipeline Builder Free for developers. Affordable for enterprises. Get Started Now