Lab 7: Frequency Counting
Bard College – Computer Science – Data Structures


In this lab we will explore Zipf's Law, and compare two different methods for frequency counting. We will compare Symbol Tables based on Sequential Search vs. Self-Organizing Search. Self-organizing search uses a heuristic: recently accessed items are likely to be accessed again soon, and thus, they should be moved to the front of the list. This approach takes advantage of locality of reference: an important notion from computer systems.

Part 1: Zipf's Law

Validate Zipf's law which as wikipedia, "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc." 

A fun video on Zipf’s law:

Find the most frequent 50 words from a large text corpus. The words should be presented in order of frequency. Plot the results on a log-log plot (a spreadsheet works just fine); if Zipf's law holds, you should see a line. 

HINTS: The FrequencyCounter class from page 372 is a good place to start. Consider using Quick.select to find the top 50 words. The SymbolTable API is on page 363.

Project Gutenberg is an excellent source for large plain text files (e.g., Melville's Moby Dick or the King James Bible). 

java Zipf < bible.txt 
the        62252
and        38641
of        34548
to        13457
And        12735
that        12464
in        12221
shall        9762
he        9510
unto        8932
I        8708
his        8362
a        7999
for        7160
they        6895
be        6736
is        6720
with        5995
not        5859
all        5256
thou        4629
it        4469
thy        4451
was        4448
which        4275
my        4135
LORD        3930
their        3873
have        3825
will        3757
from        3592