Bard College – Computer Science – Data Structures
In this lab we will explore Zipf's Law, and compare two different methods for frequency counting. We will compare Symbol Tables based on Sequential Search vs. Self-Organizing Search. Self-organizing search uses a heuristic: recently accessed items are likely to be accessed again soon, and thus, they should be moved to the front of the list.This approach takes advantage of locality of reference: an important notion from computer systems.
Validate Zipf's law which as wikipedia,"given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc."
A fun video on Zipf’s law:
Find the most frequent 50 words from a large text corpus. The words should be presented in order of frequency. Plot the results on a log-log plot(a spreadsheet works just fine); if Zipf's law holds, you should see a line.
HINTS: The FrequencyCounter class from page 372 is a good place to start. Consider using Quick.select to find the top 50 words. The SymbolTable API is on page 363.
Part 1: Zipf's Law
java Zipf < bible.txt
the 62252
and 38641
of 34548
to 13457
And 12735
that 12464
in 12221
shall 9762
he 9510
unto 8932
I 8708
his 8362
a 7999
for 7160
they 6895
be 6736
is 6720
with 5995
not 5859
all 5256
thou 4629
it 4469
thy 4451
was 4448
which 4275
my 4135
LORD 3930
their 3873
have 3825
will 3757
from 3592