How do you remove stopwords and punctuation in python?

Question

How To Remove Stopwords From A String In Python With Code Examples

With this article, we’ll look at some examples of how to address the How To Remove Stopwords From A String In Python problem .

from gensim.parsing.preprocessing import remove_stopwords

text = "Nick likes to play football, however he is not too fond of tennis."
filtered_sentence = remove_stopwords(text)

print(filtered_sentence)

With numerous examples, we have seen how to resolve the How To Remove Stopwords From A String In Python problem.

Using translate(): translate() is another method that can be used to remove a character from a string in Python. translate() returns a string after removing the values passed in the table. Also, remember that to remove a character from a string using translate() you have to replace it with None and not "" .05-Aug-2022

How do you remove Stopwords and punctuation in Python?

In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.30-Jul-2021

What is Stopword removal?

Stop word removal is one of the most commonly used preprocessing steps across different NLP applications. The idea is simply removing the words that occur commonly across all the documents in the corpus. Typically, articles and pronouns are generally classified as stop words.

How do you remove meaningless words in Python?

1 Answer

import nltk.
words = set(nltk.corpus.words.words())
sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

How do I remove certain words from a string?

Using the replace() function We can use the replace() function to remove word from string in Python. This function replaces a given substring with the mentioned substring.

How do you trim a string in Python?

Python Trim String

strip(): returns a new string after removing any leading and trailing whitespaces including tabs ( \t ).
rstrip(): returns a new string with trailing whitespace removed.
lstrip(): returns a new string with leading whitespace removed, or removing whitespaces from the “left” side of the string.

How do you remove stop words from a text file in Python with NLTK?

We specifically considered the stop words from the English language. Now let us pass a string as input and indicate the code to remove stop words: from nltk. corpus import stopwords from nltk. tokenize import word_tokenize example = "Hello there, my name is Bob.

What does NLTK function word_tokenize () do?

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.18-Oct-2017

What are Stopwords in Python?

Practical Data Science using Python Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus.

Why do we remove Stopwords?

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.10-Feb-2021

Photo by Patrick Tomasso on Unsplash

Unstructured text data requires unique steps to preprocess in order to prepare it for machine learning. This article walks through some of those steps including tokenization, stopwords, removing punctuation, lemmatization, stemming, and vectorization.

Dataset Overview

To demonstrate some natural language processing text cleaning methods, we’ll be using song lyrics from one of my favorite musicians - Jimi Hendrix. The raw song lyric data can be found here. For the purposes of this demo, we’ll be using a few lines from his famous song The Wind Cries Mary:

A broom is drearily sweeping
Up the broken pieces of yesterdays life
Somewhere a queen is weeping
Somewhere a king has no wife
And the wind it cries Mary

Tokenization

One of the first steps in most natural language processing workflows is to tokenize your text. There are a few different varieties, but at the most basic sense this involves splitting a text string into individual words.

We’ll first review NLTK (used to demo most of the concepts in this article) and quickly see tokenization applied in a couple other frameworks.

NLTK

We’ve read the dataset into a list of strings and can use the word tokenize function from the NLTK python library. You’ll see that looping through each line, applying word tokenize splits the line into individual words and special characters.

from nltk.tokenize import word_tokenize

# Loop through each line of text and tokenize
sample_lines_tokenized = [word_tokenize(line) for line in sample_lines]

Before Tokenization

After Tokenization

Tokenizers In Other Libraries

There are many different ways to accomplish tokenization. The NLTK library has some great functions in this realm, but others include spaCy and many of the deep learning frameworks. Some examples of tokenization in those libraries are below.

Torch

# Pytorch tokenization
from torchtext.data import get_tokenizer

# Initialize object and tokenize each line
pytorch_tokenizer = get_tokenizer("basic_english")
pytorch_tokens = [pytorch_tokenizer(line) for line in sample_lines]

spaCy

#spaCy tokenization
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

# Initialize object and tokenize each line
nlp = English()
spacy_tokenizer = Tokenizer(nlp.vocab)
spacy_tokens = [spacy_tokenizer(line) for line in sample_lines]

There can be slight differences from one tokenizer to another, but the above more or less do the same. The spaCy library has its own objects that incorporate the framework’s features, for example returning a doc object instead of a list of tokens.

Stopwords

There may be some instances where removing stopwords improves the understanding or accuracy of a natural language processing model. Stopwords are commonly used words that may not carry much information and may be able to be removed with little information loss. You can get a list of stopwords from NLTK with the following python commands.

Loading and Viewing Stopwords

# Import and download stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')

# View stopwords
print(stopwords.words('english'))

Stopwords Click for Full Size Version

Removing Stopwords

We can create a simple function for removing stopwords and returning an updated list.

def remove_stopwords(input_text):
    return [token for token in input_text if token.lower() not in stopwords.words('english')]

# Apply stopword function
tokens_without_stopwords = [remove_stopwords(line) for line in sample_lines_tokenized]

Text Before & After Stopword Removal Click for Full Size Version

Punctuation

Similar to stopwords, since our text is already split into sentences removing punctuation can be performed without much information loss and to clean up the text to just words. One approach is to simple use the string object list of punctuation characters.

import string

def remove_punctuation(input_text):
    return [token for token in input_text if token not in set(string.punctuation)]

# Apply punctuation function
tokens_without_punctuation = [remove_punctuation(line) for line in tokens_without_stopwords]

We had one extra comma that is now removed after applying this function:

Text After Punctuation Removal

Lemmatization

We can further standardize our text through lemmatization. This boils down a word to just the root, which can be useful in minimizing the unique number of words used. This is certainly an optional step, in some cases such as text generation this information may be important - while in others such as classification it may be less so.

Testing Lemmatizers

To lemmatize our tokens, we’ll use the NLTK WordNetLemmatizer. One example applying the lemmatizer to the word “cries” yields the root word “cry”.

# Required library imports and downloads
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet', quiet=True)

# Instantiate and test on one word
lem = WordNetLemmatizer()
lem.lemmatize('cries')

Apply to All Tokens/Parts of Speech

The NLTK function runs on specific parts of speech, so we’ll loop through these in a generalized function to lemmatize tokens.

def lemmatize(input_text):
    # Instantiate class
    lem = WordNetLemmatizer()
    # Lemmatized text becomes input inside all loop runs
    lemmatized_text = input_text
    # Lemmatize each part of speech
    for part_of_speech in ['n', 'v', 'a', 'r', 's']:
        lemmatized_text = [lem.lemmatize(token, part_of_speech).lower() for token in lemmatized_text]
    return lemmatized_text

# Apply lemmatize function
tokens_lemmatized = [lemmatize(line) for line in tokens_without_punctuation]

Text Before & After Lemmatization Click for Full Size Version

Stemming

Stemming is similar to lemmatization, but rather than converting to a root word it chops off suffixes and prefixes. I prefer lemmatization since it is less aggressive and the words still are valid; however, stemming is also still sometimes used so I show how here.

Snowball Stemmer

There are many different flavors of stemming algorithms, for this example we use the SnowballStemmer from NLTK. Applying stemming to “sweeping” removes the suffix and yields the word “sweep”.

# Required improts
from nltk.stem import SnowballStemmer

# Instantiate and test on one word
stemmer = SnowballStemmer('english')
stemmer.stem('sweeping')

Apply to All Tokens

Similar to past steps, we can create a more generic function and apply this to each line.

def stem(input_text):
    stemmer = SnowballStemmer('english')
    return [stemmer.stem(token) for token in input_text]

# Apply stemming function
tokens_stemmed = [stem(line) for line in tokens_without_punctuation]

Text Before & After Stemming Click for Full Size Version

As you can see, some of these are not words. For this reason, I prefer to go with lemmatization in almost all cases so that word lookups during embedding is more successful.

Put it all together

We’ve went through a number of possible steps to clean our text and created functions for each while doing so. One final step would be combining this into one simple and generalized function to run on the text. I wrapped the functions in one combined function that allows enabling of any desired function and to run each sequentially on the various lines of text. I used a functional approach below, but a class could certainly be built as well using similar principles.

def clean_list_of_text(
        input_text, 
        enable_stopword_removal=True,
        enable_punctuation_removal=True,
        enable_lemmatization=True,
        enable_stemming=False
    ):
    # Get list of operations
    enabled_operations = [word_tokenize]
    if enable_stopword_removal:
        enabled_operations.append(remove_stopwords)
    if enable_punctuation_removal:
        enabled_operations.append(remove_punctuation)
    if enable_lemmatization:
        enabled_operations.append(lemmatize)
    if enable_stemming:
        enabled_operations.append(stem)
    print(f'Enabled Operations: {len(enabled_operations)}')
    

    # Run all operations
    cleaned_text_lines = input_text
    for operation in enabled_operations:
        # Run for all lines
        cleaned_text_lines = [operation(line) for line in cleaned_text_lines]
    
    return cleaned_text_lines

# Example of applying the function
clean_list_of_text(sample_lines, enable_stopword_removal=True, enable_punctuation_removal=True, enable_lemmatization=True)

Vector Embedding

Now that we finally have our text cleaned, is it ready for machine learning? Not quite. Most models require numeric inputs rather than strings. In order to do this, embeddings where strings are converted into vectors are often used. You can think of this as numerically capturing the information and meaning of text in a fixed length numerical vector.

We’ll walk through an example of using gensim; however, many of the deep learning frameworks may have ways to quickly load pre-trained embeddings as well.

Gensim Pre-Trained Model Overview

The library that we’ll be using to lookup pre-trained embedding vectors for our cleaned tokens is gensim. They have multiple pre-trained embeddings available for download, you can review these in the word2vec module inline documentation.

import gensim.downloader

# Load pretrained gensim model
glove_model = gensim.downloader.load("glove-wiki-gigaword-100")

Most Similar Words

Gensim provides multiple functionalities to use with the pre-trained embeddings. One is viewing which words are most similar. To get an idea of how this works, let’s try the word “queen” which is contained in our sample Jimi Hendrix lyrics.

# Show default most similar words given a word
glove_model.most_similar('queen')

Most Similar Words

Retrieving Vector Embedding Example

To convert a word to embedding vector we simply use the pre-trained model like a dictionary. Let’s see what the embedding vector looks like for the word “broom”.

# Sample of embedding vector for a word
glove_model['broom']

Sample Embedding Vector

Apply to All Tokens

Similar to past steps, we can simply loop through the cleaned tokens and build out a list converting to vectors instead. In reality, there is likely some error handling for words that don’t lookup successfully (and cause a key error since it is missing from the dictionary), but I’ve omitted that in this simple example.

# Convert all lines and tokens to vectors using our glove_model object
vectors = [[glove_model[token] for token in line] for line in text_to_convert]

Padding Vectors

Many natural language processing models require the same number of words as inputs. However, text length is often ragged, with each line not conforming to the exact same number of words. To fix this, one approach often taken is padding sequences. We can add dummy vectors at the end of the shorter sentences to get everything aligned.

Padding Sequences in Pytorch

Many libraries have helper methods for this type of workflow. For example, torch allows us to pad sequences in this manner as follows.

# Example of padding those embeddings and converting to torch tensor (num_examples, sequence_length, embed_dim)
torch_padded_tensor = torch.nn.utils.rnn.pad_sequence([torch.FloatTensor(vector) for vector in vectors], batch_first=True)
torch_padded_tensor.shape

Output: torch.Size([5 4 100])

After padding our sequences, you can now see that the 5 lines of text are each of length 4 with an embedding dimension of 100 as expected. But what happened to our first line which only had three words (tokens) after cleaning?

Viewing Padded Vectors

Torch by default just creates zero value vectors for anything that needs to be padded.

# What is the 4th word in the first line shown as, since it didn't exist
torch_padded_tensor[0][3]

Padded Zero Vector Example

Summary

Text data often requires unique steps when preparing data for machine learning. Cleaning text is important to standardize words to allow for embeddings and lookups, while losing the least amount of information possible for a given task. Once you’ve cleaned and prepared text data, it can be used for more advanced machine learning workflows like text generation or classification.

All examples and files available on Github.

For a deeper dive into some of the concepts related to this article, check out the following books:

How do you remove stop words and punctuations in Python?

In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.

How do I remove punctuation from text in Python?

One of the easiest ways to remove punctuation from a string in Python is to use the str. translate() method. The translate() method typically takes a translation table, which we'll do using the . maketrans() method.

How do I remove punctuation from a DataFrame in Python?

To remove punctuation with Python Pandas, we can use the DataFrame's str. replace method. We call replace with a regex string that matches all punctuation characters and replace them with empty strings. replace returns a new DataFrame column and we assign that to df['text'] .

How do you remove punctuation from Python using NLTK?

The workflow assumed by NLTK is that you first tokenize into sentences and then every sentence into words. That is why word_tokenize() does not work with multiple sentences. To get rid of the punctuation, you can use a regular expression or python's isalnum() function.

kode python Remove punctuation Python Sklearn remove punctuation