Show In this post I will write a project in Python to apply Zipf's Law to analysing word frequencies in a piece of text. Zipf's Law describes a probability distribution where each frequency is the reciprocal of its rank multiplied by the highest frequency. Therefore the second highest frequency is the highest multiplied by 1/2, the third highest is the highest multiplied by 1/3 and so on. This is best illustrated with a graph. The Zipfian Distribution can be applied to many different areas including populations, incomes, company revenues and so on, as well as words in a piece of text as I mentioned above. The word frequencies of a single piece of text are unlikely to be a good fit - for that you really need a large and varied range of texts. However, applying it to a single piece of text is a simple way to demonstrate the distribution, and if we implement it in Python we can also see a few useful language techniques in action along the way. The ProblemFor this project I want to implement the following features:
In this project I'll use a handful of slightly lesser know Python features which I'll list here just to give a sneak preview, and then I'll describe them in more detail later on.
The project consists of the following two Python files, as well as dracula.txt which contains the full text of Bram Stoker's novel which we'll use as the input. The files can be downloaded as a zip from the Downloads page, or you can clone/download the Github repository.
ZIP File This is the full code of zipfslaw.py. zipfslaw.py import collections def generate_zipf_table(text, top): text = text.lower() top_word_frequencies = _top_word_frequencies(text, top) zipf_table = _create_zipf_table(top_word_frequencies) return zipf_table def _remove_punctuation(text): tr = str.maketrans("", "", chars_to_remove) return text.translate(tr) def _top_word_frequencies(text, top): word_frequencies = collections.Counter(words) top_word_frequencies = word_frequencies.most_common(top) return top_word_frequencies def _create_zipf_table(frequencies): top_frequency = frequencies[0][1] for index, item in enumerate(frequencies, start=1): relative_frequency = "1/{}".format(index) zipf_table.append({"word": item[0], return zipf_table def print_zipf_table(zipf_table): print("-" * width) format_string = "|{:4}|{:12}|{:12.0f}|{:>12}|{:12.2f}|{:12.2f}|{:7.2f}%|" for index, item in enumerate(zipf_table, start=1): print(format_string.format(index, print("-" * width) generate_zipf_tableThis is the core function and which takes a string and a top argument which specifies the maximum number of items to return the frequencies of. (Data sets following the Zipfian distribution will often have a long tail of very low frequencies which aren't worth considering or trying to fit to the reciprocal of the rank.) It then generates and returns the data structure described in the bullet points above. _remove_punctuationThis is a short and simple but quite interesting function. It first creates a string containing all the characters we want to remove, basically punctuation plus numbers 0 to 9. It's not very sophisticated and I have to admit that this whole project is really only suitable for use with ASCII-only text, but it works with Dracula. Next we use the three-argument version of str.maketrans. This is a static string method which creates a table of character replacement mappings, in this case meaning no characters will be replaced (the first two arguments therefore being empty strings) but the characters in the third argument are to be removed. Finally we return the result of calling translate on the text with the translation table created in the previous line. _top_word_frequenciesI have gone a bit over the top with comments in this function just to make it clear how the rather less well known bits of code actually work. Firstly we split the text into a list of words and then use that list to construct a Counter. We then call the Counter object's most_common method to get a sorted list of the top words. _create_zipf_tableWe then take the list generated by the previous function and use it as the basis of a new list containing all the additional Zipfian Distribution stuff. Within a loop through the list we calculate all the various extra values before adding them as a dictionary to a new list. print_zipf_tableThis final function simply takes the data structure from _create_zipf_table and prints it in a neat table. The formatting string is long and unweildy so I have assigned it to a separate variable, not something I have to do often! Now we can move on to trying out the module in. zipfslawtest.py import zipfslaw def main(): print("-----------------") try: f = open("dracula.txt", "r") zipf_table = zipfslaw.generate_zipf_table(text, 135) zipfslaw.print_zipf_table(zipf_table) except IOError as e: print(e) main() After some very mundate code to read a text file we pass it to generate_zipf_table and then pass the result to print_zipf_table. Now we can run the code with this command: Run python3.7 zipfslawtest.py The output is Program Output (partial) ----------------- -------------------------------------------------------------------------------- As you can see Dracula isn't a good fit for the Zipfian Distribution - few individual texts are. The code in this project uses a rather naive series of reciprocals of ranks, but more sophisticated methods of calculating the Zipfian probabilities might provide a better fit. However, when counting words these formulae require a value for the number of words in the language of the text. This is of course such a vague concept that it is pretty much impossible to arrive at a suitable value. This project has been tailored to calculating the Zipfian distribution for words in a piece of text. However, as I stated above, the distribution can be applied to many types of data and often with a better fit. The code in this post could be enhanced to be more general-purpose, creating a probability distribution from a list of any data type. |