Lab 8: Frequency Counting
Bard College – Computer Science – Data Structures


In this lab we will compare two different methods for frequency counting, specifically within the context of Zipf's Law. We will compare Symbol Tables based on Sequential Search vs. what the book calls a Self-Organizing Search. Self-Organizing search uses a heuristic that recently accessed items are likely to be accessed again, and should be moved to the front of the list. This approach takes advantage of locality of reference (a notion from from computer systems architecture).

Self-Organizing Search

Starting with Algorithm 3.1: SequentialSearchST modify that data structure such that it moves the most recently accessed item (via get  or put ) to the beginning (or end) of the symbol table to facilitate fast access. In other words, frequently accessed items should be quick to find. (More on page 391 exercise 3.1.22)

java SelfOrgaanizingST < tiny.txt
S ->
E -> S ->
A -> E -> S ->
R -> A -> E -> S ->
C -> R -> A -> E -> S ->
H -> C -> R -> A -> E -> S ->
E -> H -> C -> R -> A -> S ->
X -> E -> H -> C -> R -> A -> S ->
A -> X -> E -> H -> C -> R -> S ->
M -> A -> X -> E -> H -> C -> R -> S ->
P -> M -> A -> X -> E -> H -> C -> R -> S ->
L -> P -> M -> A -> X -> E -> H -> C -> R -> S ->
E -> L -> P -> M -> A -> X -> H -> C -> R -> S ->
E 12
L 11
P 10
M 9
A 8
X 7
H 5
C 4
R 3
S 0

Evaluation & Zipf's Law

Validate Zipf's law which wikipedia states "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc." 

Present the frequency of the most frequent 50 words from a large text corpus; a Priority Queue could be useful for this bit. The data should be presented in order of word frequency (not by the word itself). Also, plot the results on a log-log plot; if Zipf's law holds, you should see a line. The Modify the for loop to print the top 50 words and their frequencies. Put that data into a spreadsheet and plot it on a graph. The FrequencyCounter class from page 372 is a good place to start.

Project Gutenberg is an excellent source for large plain text files (e.g., Melville's Moby Dick or the King James Bible). 

Compare the performance of both methods for frequency counting on your text corpus.

EXTRA
Find a specific problem or application where a symbol table based on:
  1. binary-search is better than self-organizing search
  1. self-organizing search is better than binary search

Submission

Submit via moodle a zip file that expands to a folder with the following files: