Remove duplicate words in string python

Question

Harsh Jain

Table of Contents Show

What is regex?
Implementing regex
Remove duplicate words from a string in Python #
How do I remove repeating words from a string in Python?
How do you remove duplicate text in Python?
How do I remove repetitive words from a string?
How do you find duplicate words in a string in Python?

In this shot, we will use Regular Expressions, or Regex, to remove duplicate words from the text.

What is regex?

A regular expression, or regex, is basically a pattern used to search for something in textual data. Using regex can help you eliminate a dozen lines of code. Although understanding regex is a bit difficult due to its complex structure, these expressions can be accommodating if you practice them. These expressions are mainly used in text processing or when you are dealing with text data.

Implementing regex

We are going to use the below regex:

regex = "\\b(\\w+)(?:\\W+\\1\\b)+";

Let’s break down the sections:

\\b: This means “boundary,” which is needed because if you have a text like, “My thesis is great” and you want to find the occurrence of “is”, then it should not match with “thesis” as this word also has the occurrence of the “is” pattern. Here, word boundaries are helpful.
\\w: This denotes a word character, i.e., [a-zA-Z_0–9].
\\W+: This means a non-word character.
\\1: This matches whatever was matched in the previous group of parentheses, which in our case is the (\w+).
+: This is used to match whatever is placed before this 1 or more times.

Now, let’s take a look at the code:

import re
def removeDuplicatesFromText(text):
    regex = r'\b(\w+)(?:\W+\1\b)+' 
    return re.sub(regex, r'\1', text, flags=re.IGNORECASE)
str1 = "How are are you"



print(removeDuplicatesFromText(str1))
str2 = "Edpresso is the the best platform to learn"
print(removeDuplicatesFromText(str2))
str3 = "Programming is fun fun"
print(removeDuplicatesFromText(str3))

Remove duplicate words from text using Regex

Explanation:

In line 1, we import the re package, which will allow us to use regex.
In line 3, we define a function that will return text after removing the duplicate words.
In line 4, we define our regex pattern.
In line 5, we use the sub() function of the re module that returns a substring. Here, we pass the regex pattern: the \1 specifies what needs to be replaced in the input text when the regex pattern matches the text, and the flag ignores the case letters.
From lines 7 to 14, we pass some text data containing duplicate words (we can see in the output that it can remove duplicate words from the text).

In this way, it is somewhat effortless to perform text preprocessing.

CONTRIBUTOR

Harsh Jain

Remove duplicate words from a string in Python #

To remove the duplicate words from a string:

Use the OrderedDict class to get an ordered dictionary without any duplicates.
Use the join() method to join the keys of the dictionary into a string.

Copied!
from collections import OrderedDict

my_str = 'one two three one two four'

result = ' '.join(OrderedDict.fromkeys(my_str.split()))

print(result)  # 👉️ 'one two three four'

We used the OrderedDict class to remove the duplicate words from a string.

The OrderedDict collection is an instance of a dict subclass.

Copied!
from collections import OrderedDict

my_str = 'one two three one two four'

# 👇️ OrderedDict([('one', None), ('two', None), ('three', None), ('four', None)])
print(OrderedDict.fromkeys(my_str.split()))

We used an ordered dictionary because dictionary keys are unique.

We used the str.split() method to split the string on each space.

Copied!
my_str = 'one two three one two four'

# 👇️ ['one', 'two', 'three', 'one', 'two', 'four']
print(my_str.split())

The str.split() method splits the string into a list of substrings using a delimiter.

If no delimiter is provided, the method splits the string on each whitespace character.

The dict.fromkeys method takes an iterable and a value and creates a new dictionary with keys from the iterable and values set to the provided value.

Copied!
# 👇️ {'a': None, 'b': None, 'c': None}
print(dict.fromkeys(['a', 'b', 'c']))

# 👇️ {'a': 100, 'b': 100, 'c': 100}
print(dict.fromkeys(['a', 'b', 'c'], 100))

We only need the keys, so we didn't specify a value in the example.

The last step is to join the keys of the OrderedDict into a string.

Copied!
from collections import OrderedDict

my_str = 'one two three one two four'

result = ' '.join(OrderedDict.fromkeys(my_str.split()))

print(result)  # 👉️ 'one two three four'

The str.join method takes an iterable as an argument and returns a string which is the concatenation of the strings in the iterable.

We joined the collection of strings with a space separator.

Note that as of Python 3.7, the standard dict class is guaranteed to preserve the order as well.

We could replace the OrderedDict class with the dict class to achieve the same result.

Copied!
my_str = 'one two three one two four'

result = ' '.join(dict.fromkeys(my_str.split()))

print(result)  # 👉️ 'one two three four'

This also allows us to remove the import statement.

Which approach you pick is a matter of personal preference.

The OrderedDict class makes the code a little more readable but requires an extra import statement.

How do I remove repeating words from a string in Python?

1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.

How do you remove duplicate text in Python?

Explanation:.

First of all, save the path of the input and output file paths in two variables. ... .

Create one Set variable. ... .

Open the output file in write mode. ... .

Start one for loop to read from the input file line by line. ... .

Find the hash value of the current line. ... .

Check if this hash value is already in the Set variable or not..

How do I remove repetitive words from a string?

We create an empty hash table. Then split given string around spaces. For every word, we first check if it is in hash table or not. If not found in hash table, we print it and store in the hash table.