Lab 6: Frequency Counting
Bard College – Computer Science – Data Structures


In this lab we will compare two different methods for frequency counting, specifically within the context of Zipf's Law. We will compare Symbol Tables based on Sequential Search vs. what our book calls a Self-Organizing Search. Self-organizing search uses a heuristic: recently accessed items are likely to be accessed again, and thus, should be moved to the front of the list. This approach takes advantage of locality of reference: an important notion from from computer systems architecture.

Part 1: Zipf's Law

Validate Zipf's law which wikipedia states "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc." 

Find the most frequent 50 words from a large text corpus. The data should be presented in order of word frequency. Plot the results on a log-log plot; if Zipf's law holds, you should see a line. 

HINTS: The FrequencyCounter class from page 372 is a good place to start. Consider using Quick.select to find the top 50 words. The SymbolTable API is on page 363.

Project Gutenberg is an excellent source for large plain text files (e.g., Melville's Moby Dick or the King James Bible). 

java Zipf < bible.txt 
the        62252
and        38641
of        34548
to        13457
And        12735
that        12464
in        12221
shall        9762
he        9510
unto        8932
I        8708
his        8362
a        7999
for        7160
they        6895
be        6736
is        6720
with        5995
not        5859
all        5256
thou        4629
it        4469
thy        4451
was        4448
which        4275
my        4135
LORD        3930
their        3873
have        3825
will        3757
from        3592
ye        3566
them        3503
as        3265