Bard College – Computer Science – Data Structures
In this lab we will compare two different methods for frequency counting, specifically within the context of Zipf's Law. We will compare Symbol Tables based on Sequential Search vs. what the book calls a Self-Organizing Search. Self-Organizing search uses a heuristic that recently accessed items are likely to be accessed again, and should be moved to the front of the list.This approach takes advantage of locality of reference(a notion from from computer systems architecture).
Self-Organizing Search
Starting with Algorithm 3.1: SequentialSearchST modify that data structure suchthat it moves the most recently accessed item(viaget or put ) to the beginning(or end) of the symbol table to facilitate fast access. In other words, frequently accessed items should be quick to find.(More on page 391 exercise 3.1.22)
javaSelfOrgaanizingST < tiny.txt
S ->
E -> S ->
A -> E -> S ->
R -> A -> E -> S ->
C -> R -> A -> E -> S ->
H -> C -> R -> A -> E -> S ->
E -> H -> C -> R -> A -> S ->
X -> E -> H -> C -> R -> A -> S ->
A -> X -> E -> H -> C -> R -> S ->
M -> A -> X -> E -> H -> C -> R -> S ->
P -> M -> A -> X -> E -> H -> C -> R -> S ->
L -> P -> M -> A -> X -> E -> H -> C -> R -> S ->
E -> L -> P -> M -> A -> X -> H -> C -> R -> S ->
Validate Zipf's law which wikipedia states"given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc."
Present the frequency of the most frequent 50 words from a large text corpus; a Priority Queue could be useful for this bit. The data should be presented in order of word frequency(not by the word itself). Also, plot the results on a log-log plot; if Zipf's law holds, you should see a line. The Modify the for loop to print the top 50 words and their frequencies. Put that data into a spreadsheet and plot it on a graph. The FrequencyCounter class from page 372 is a good place to start.
Self-Organizing Search
java SelfOrgaanizingST < tiny.txt
S ->
E -> S ->
A -> E -> S ->
R -> A -> E -> S ->
C -> R -> A -> E -> S ->
H -> C -> R -> A -> E -> S ->
E -> H -> C -> R -> A -> S ->
X -> E -> H -> C -> R -> A -> S ->
A -> X -> E -> H -> C -> R -> S ->
M -> A -> X -> E -> H -> C -> R -> S ->
P -> M -> A -> X -> E -> H -> C -> R -> S ->
L -> P -> M -> A -> X -> E -> H -> C -> R -> S ->
E -> L -> P -> M -> A -> X -> H -> C -> R -> S ->
E 12
L 11
P 10
M 9
A 8
X 7
H 5
C 4
R 3
S 0
Evaluation & Zipf's Law
Submission