May 27, 2023 6 min read

Text Preprocessing: Strategies for Cleaning Text Data

Text Preprocessing

In the field of natural language processing (NLP), one of the essential steps in preparing text data for model training is text preprocessing. This crucial process involves cleaning up the text, removing irrelevant information, and transforming it into a format that is more suitable for further analysis and modeling. In this article, we will walk through a comprehensive text cleaning pipeline using a Jupyter Notebook. We’ll explore each step and demonstrate the implementation of various techniques to preprocess text data effectively. So let’s dive in and learn how to preprocess text data like a pro!

Create the Dataset

Let’s import the necessary libraries and create a sample dataset used for demonstration purposes for this article!

import pandas as pd
from bs4 import BeautifulSoup
import string

text = """Hey Amazon - my package never arrived but it shows it's delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX IT! \n <html> Amazon2022 © <br/> <br /> </html>"""
df = pd.DataFrame([text], columns=["text"])

“Hey Amazon — my package never arrived but it shows it’s delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX IT! \n <html> Amazon2022 © </html>”

Text Preprocessing Pipeline

Text preprocessing involves a series of steps that transform raw text into a cleaner and more structured format. Let’s explore each step in detail:

Text to Lowercase

To achieve consistency and remove any case-sensitive discrepancies, converting the entire text to lowercase is a common practice. This step ensures that words are treated equally, regardless of their original casing.

df["text"] = df["text"].str.lower()

“hey amazon — my package never arrived but it shows it’s delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix it! \n <html> amazon2022 © </html>”

Clean HTML Texts

In scenarios where the text data contains HTML tags or entities, it’s crucial to clean them before further processing. The Beautiful Soup library provides powerful tools to parse and remove HTML elements, leaving behind only the relevant text.

# Cleaning HTML Characters
df["text"] = df["text"].apply(lambda x: BeautifulSoup(x, "html.parser").text)

“hey amazon — my package never arrived but it shows it’s delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix it! \n amazon2022 © ”

Remove URLs

When dealing with text from sources such as web scraping or social media, it’s common to encounter URLs. These URLs do not contribute to the textual content’s semantic meaning and can be safely removed using appropriate techniques.

# Remove URLs
df["text"] = df["text"].str.replace(r"https?:\/\/.\S+", "", regex=True)

“hey amazon — my package never arrived but it shows it’s delivered please fix it! \n amazon2022 © ”

Split on Numbers

Text data often contains numbers that might not hold significant meaning in certain NLP tasks. By splitting the text on numbers, we can isolate alphanumeric sequences from the rest of the text, enabling better analysis and model training.

# Split Numbers from Words
df["text"] = df["text"].str.split(r"(\d+)", regex=True)
df["text"] = df["text"].apply(" ".join)

“hey amazon — my package never arrived but it shows it’s delivered please fix it! \n amazon 2022 © ”

Expand Contractions

Contractions like “can’t” or “wouldn’t” can pose challenges during text analysis. Expanding contractions involves converting such shortened forms into their complete counterparts. This step ensures that the full meaning of words is preserved.

# Remove Contracted Words
contractions = {"'s": " is", "n't": " not", "'m": " am", "'ll": " will", "'d": " would", "'ve": " have", "'re": " are"}

for key, value in contractions.items():
	df["text"] = df["text"].str.replace(key, value)

“hey amazon — my package never arrived but it shows it is delivered please fix it! \n amazon 2022 © ”

Remove Punctuation

Punctuation marks, such as periods, commas, or exclamation marks, are usually not crucial for many NLP tasks. Removing these punctuation marks helps streamline the text and reduces noise during subsequent analysis and modeling.

# Extra Punctuation List
extra_punct = [",", ".", '"', ":", ")", "(", "!", "?", "|", ";", "'", "&", "/", "[", "]", ">", "%", "=", "#", "*", "+", "\\", "•", "~", "@", "·", "_", "{", "}", "©", "^", "®", "`", "<", "→", "°", "€", "™", "›", "♥", "←", "×", "§", "″", "′", "Â", "█", "½", "à", "…", "“", "★", "”", "–", "●", "â", "►", "−", "¢", "²", "¬", "░", "¶", "↑", "±", "¿", "▾", "═", "¦", "║", "―", "¥", "▓", "—", "‹", "─", "▒", "：", "¼", "⊕", "▼", "▪", "†", "■", "’", "▀", "¨", "▄", "♫", "☆", "é", "¯", "♦", "¤", "▲", "è", "¸", "¾", "Ã", "⋅", "‘", "∞", "∙", "）", "↓", "、", "│", "（", "»", "，", "♪", "╩", "╚", "³", "・", "╦", "╣", "╔", "╗", "▬", "❤", "ï", "Ø", "¹", "≤", "‡", "√", "«", "»", "´", "º", "¾", "¡", "§"]

punctuation = list(string.punctuation) + extra_punct
for punct in list(punctuation):
	df["text"] = df["text"].str.replace(punct, "")

“hey amazon my package never arrived but it shows it is delivered please fix it \n amazon 2022 ”

Remove Stopwords

Stopwords are commonly occurring words in a language, such as “the,” “and,” or “is.” These words typically do not carry much information and can be safely removed to focus on more meaningful words. Libraries like NLTK or spaCy provide pre-defined sets of stopwords for different languages.

# Stopwords Removal
stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "she's", "her", "hers", "herself", "it", "it's", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "that'll", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does",     "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "can", "will", "just", "don", "don't", "should", "should've", "now", "ll", "re", "ve", "ain", "aren", "aren't", "couldn", "couldn't", "didn", "didn't", "doesn", "doesn't", "hadn", "hadn't", "hasn", "hasn't", "haven", "haven't", "isn", "isn't", "ma", "mightn", "mightn't", "mustn", "mustn't", "needn", "needn't", "shan", "shan't", "shouldn", "shouldn't", "wasn", "wasn't", "weren", "weren't", "won", "won't", "wouldn", "wouldn't"]

pattern = r'\b(?:{})\b'.format('|'.join(stopwords))
df['text'] = df['text'].str.replace(pattern, '', regex=True)

“hey amazon package never arrived shows delivered please fix \n amazon 2022 ”

Remove Numbers

Similar to removing punctuation marks, removing standalone numbers that do not contribute to the semantic meaning of the text can improve the quality of the processed data. This step eliminates distractions and helps the model focus on relevant textual features.

# Remove Numbers
df['text'] = df['text'].str.replace(r'\d+', '', regex=True)

“hey amazon package never arrived shows delivered please fix \n amazon ”

Shrink Extra Spaces

Text data often contains unnecessary extra spaces between words or at the beginning or end of the text. Removing these extra spaces not only makes the text visually cleaner but also ensures that the model treats words correctly during analysis and training.

# Shrink Extra Spaces
df["text"] = df["text"].str.strip().replace(r"\s+", " ", regex=True)

“hey amazon package never arrived shows delivered please fix amazon”

Since we have completed the text preprocessing pipeline, let’s compare the starting text with the final preprocessed text,

Hey Amazon — my package never arrived but it shows it’s delivered https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX IT! <html> Amazon2022 © </html>

hey amazon package never arrived shows delivered please fix amazon

Pretty good isn’t it? :)

Libraries Used

Throughout the text preprocessing pipeline, we rely on several libraries to perform the necessary operations. The key libraries used in this notebook are:

Pandas: Pandas is a powerful data manipulation library that provides efficient and flexible tools for handling structured data, including text data. It offers convenient functions and data structures to process and transform text data effectively.
BeautifulSoup: BeautifulSoup is a Python library that specializes in parsing and manipulating HTML and XML documents. In the text preprocessing pipeline, we utilize BeautifulSoup to clean HTML texts by removing tags and extracting the relevant textual content.
String: The String library in Python provides a collection of useful string manipulation functions and constants. We utilize this library to access pre-defined sets of punctuation marks, which simplifies the process of removing punctuation from the text.

Conclusion

In this article, we have explored a comprehensive text preprocessing pipeline using a Jupyter Notebook. Each step of the pipeline plays a crucial role in cleaning and preparing text data for model training in natural language processing tasks. By following this pipeline, you can ensure that your text data is transformed into a cleaner and more structured format, reducing noise and improving the quality of subsequent analysis and modeling.

Text preprocessing is an essential step in NLP, as it helps improve the accuracy and effectiveness of models that rely on textual data. By implementing techniques such as converting text to lowercase, cleaning HTML texts, removing URLs, splitting on numbers, expanding contractions, removing punctuation and stopwords, eliminating standalone numbers, and shrinking extra spaces, you can preprocess your text data like a pro.

With the help of libraries like Pandas, BeautifulSoup, and String, the implementation of the text preprocessing pipeline becomes more efficient and streamlined. These libraries provide powerful functions and tools that simplify the text cleaning process, making it easier to handle large volumes of text data.

By incorporating these text preprocessing techniques into your NLP projects, you can enhance the quality of your text analysis, sentiment analysis, text classification, or any other task that involves processing textual data. Clean and well-preprocessed text data sets the foundation for building robust and accurate NLP models.

You can find the code implementation and further details in the Jupyter Notebook accompanying this article.

To further explore the topic of text preprocessing, you may find it helpful to watch this video on YouTube: Link to YouTube Video

Remember, effective text preprocessing is just the first step in the journey of analyzing and understanding text data. From here, you can proceed with feature engineering, model selection, and fine-tuning to extract valuable insights from your text data. So, dive into the world of text preprocessing and unlock the potential hidden within your textual datasets.