Thank you all for your help, I appreciate it very much. I come from the world of OO programming and I am bending my head backwards to try to think in the functional paradigm.
Bellow you will find my version of Peter Norvig's spell checker (http://norvig.com/spell-correct.html) in clojure, followed by the java version. I coded the later first and then I attempted to "translate" it. In a few months time I will try to implement it again from scratch (in clojure) and I will be able to check my functional mindset evolution. Note: The reason for initializing the values of the dictionary to one is explained in Norvig's essay ===== C L O J U R E ===== (ns SpellChecker) ;; Version of http://norvig.com/spell-correct.html by 'miguel.arre...@gmail.com' ;; 7th January 2011 (import (java.io BufferedReader FileReader)) (defn build-dictionary [file-name] "Returns a \"dictionary{'word'} = frequency\" mapping each of the words in the file pointed to by 'file-name' to the number of its occurrences. Thank you 'Jason Wolfe' for the hints on how to produce the dictionary (http://groups.google.com/group/clojure/browse_thread/thread/ a1265d60229b8cd5)" (with-open [rdr (BufferedReader. (FileReader. file-name))] (reduce #(assoc %1 %2 (inc (get %1 %2 1))) {} (mapcat #(re-seq #"[a-z]+" (.toLowerCase %)) (line-seq rdr))))) (defn- alpha [] "Explicit alphabet a-z (English)" "abcdefghijklmnopqrstuvwxyz") (defn- deletes [word] "Words derivated from word which are missing a letter" (for [i (range (count word))] (str (subs word 0 i) (subs word (inc i))))) (defn- transposes [word] "Words derivated from word in which two adjacent letters are swapped" (for [i (range (dec (count word)))] (let [i1 (inc i), i2 (inc i1)] (str (subs word 0 i) (subs word i1 i2) (subs word i i1) (subs word i2))))) (defn- replaces [word] "Words derivated from word which are miss-spelled by one letter" (for [i (range (count word)), j (alpha)] (str (subs word 0 i) j (subs word (inc i))))) (defn- inserts [word] "Words derivated from word which contain an additional letter" (for [i (range (count word)), j (alpha)] (str (subs word 0 i) j (subs word i)))) (defn variations [word] "Returns all the words at edit distance 1 from word (includes: deletes, transposes, replaces and inserts)" (concat (replaces word) (inserts word) (deletes word) (transposes word))) (defn check [dict word] "Checks the word agains the dictionary, returning the variation at edit distances 1 and 2 that has the highest frequency (may be the word itself if not found in the dictionary)" (if (dict word) word (let [edit1 (distinct (filter #(dict %) (variations word))), edit2 (distinct (filter #(dict %) (map #(variations %) edit1))), candidates (reduce #(assoc %1 (dict %2 0) %2) (sorted-map-by >) (into edit1 edit2))] (if (empty? candidates) word (val (first candidates)))))) ;; Main (def dictionary (build-dictionary "big.txt")) (check dictionary "amunition") ;; End ===== J A V A ===== package power; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; public class SpellChecker { private Map<String, Integer> dictionary; public SpellChecker (String fileName) throws IOException { File fileHndl = new File(fileName); System.err.println(String.format( "Loading dictionary '%s' (%d bytes)", fileName, fileHndl.length() )); long startTime = System.currentTimeMillis(); dictionary = new HashMap<String, Integer>(); BufferedReader in = new BufferedReader(new FileReader(fileName)); Pattern pattern = Pattern.compile("\\w+"); for (String line=null; null != (line=in.readLine()); ) { Matcher matcher = pattern.matcher(line.toLowerCase()); while (matcher.find()) { String word = matcher.group(); Integer frequency = dictionary.get(word); if (null == frequency) { frequency = Integer.valueOf(1); } else { frequency = Integer.valueOf(frequency.intValue() + 1); } dictionary.put(word, frequency); } } in.close(); System.err.println(String.format( "Loaded, %d entries, %d millis", dictionary.size(), (System.currentTimeMillis() - startTime) )); } private final List<String> variations (String word) { // Words at distance == 1 from the original word // List<String> variations = new ArrayList<String>(); StringBuilder sb = new StringBuilder(); for (int i=0; i < word.length(); i++) { // Deletes sb.setLength(0); sb.append(word.substring(0, i)); sb.append(word.substring(i + 1)); variations.add(sb.toString()); } for (int i=0; i < word.length() - 1; i++) { // Transposes sb.setLength(0); sb.append(word.substring(0, i)); sb.append(word.substring(i + 1, i + 2)); sb.append(word.substring(i, i + 1)); sb.append(word.substring(i + 2)); variations.add(sb.toString()); } for (int i=0; i < word.length(); i++) { for (char c='a'; c <= 'z'; c++) { // Replaces sb.setLength(0); sb.append(word.substring(0, i)); sb.append(String.valueOf(c)); sb.append(word.substring(i + 1)); variations.add(sb.toString()); } } for (int i=0; i <= word.length(); i++) { for (char c='a'; c <= 'z'; c++) { // Inserts sb.setLength(0); sb.append(word.substring(0, i)); sb.append(String.valueOf(c)); sb.append(word.substring(i)); variations.add(sb.toString()); } } return variations; } public final String check (String word) { if (dictionary.containsKey(word)) { return word; } Map<Integer, String> candidates = new HashMap<Integer, String>(); List<String> variationsOfWord = variations(word); for (String var: variationsOfWord) { if (dictionary.containsKey(var)) { candidates.put(dictionary.get(var), var); } } if (candidates.size() > 0) { return candidates.get(Collections.max(candidates.keySet())); } for (String var: variationsOfWord) { for (String varvar: variations(var)) { // Edit distance of 2, only keep real words if (dictionary.containsKey(varvar)) { candidates.put(dictionary.get(varvar), varvar); } } } return candidates.size() > 0? candidates.get(Collections.max(candidates.keySet())) : word; } public static void main (String args[]) throws IOException { // Roger Mitton's Birkbeck spelling error corpus from the Oxford Text Archive. SpellChecker spellChecker = new SpellChecker("big.txt"); BufferedReader brin = new BufferedReader(new InputStreamReader(System.in)); for ( ; ; ) { System.out.print("Spell [ctrl^C to end]: "); String word = brin.readLine(); if (word != null && word.length() > 0) { String nword = spellChecker.check(word); if (nword.equals(word)) { System.out.println(" > Correct!"); } else { System.out.println(" > Did you mean? " + nword); } } } } } On Jan 6, 10:59 pm, Sean Corfield <seancorfi...@gmail.com> wrote: > On Thu, Jan 6, 2011 at 10:49 AM, new2clojure <miguel.arre...@gmail.com> wrote: > > I am trying to store into a map the frequency of each [a-z]+ word in a > > file. When I run this code in the repl the resulting dictionary is > > empty. How should I adapt my code to get this functionality right?. > > Other people have offered up alternate solutions that should get you > pointed in the right direction but I wanted to highlight an important > issue which may be tripping you up, depending on your background: > > > (def dictionary {}) > > This defines dictionary as an empty map. dictionary is immutable. > > > (defn process-file [file-name] > > (with-open > > [rdr (BufferedReader. (FileReader. file-name))] > > (doseq > > [line (line-seq rdr)] > > (reduce #(assoc %1 %2 (inc (get %1 %2 1))) dictionary (re-seq > > #"[a-z]+" (.toLowerCase line)))))) > > This does not change dictionary. The (assoc) call returns a new map. > The (reduce) call will ultimately return a new map (the result of > calling assoc several times). > > The lack of side-effects / assignment can be quite hard to get your > head around if you're not used to functional programming. > > As Ken noted, your (get) call needs to use 0 not one (since you always > (inc) it): > > user=> (reduce #(assoc %1 %2 (inc (get %1 %2 1))) {} ["these" "are" > "some" "words" "with" "some" "words" "repeated"]) > {"repeated" 2, "with" 2, "words" 3, "some" 3, "are" 2, "these" 2} > > With (get %1 %2 0) we get the correct result: > > user=> (reduce #(assoc %1 %2 (inc (get %1 %2 0))) {} ["these" "are" > "some" "words" "with" "some" "words" "repeated"]) > {"repeated" 1, "with" 1, "words" 2, "some" 2, "are" 1, "these" 1} > -- > Sean A Corfield -- (904) 302-SEAN > Railo Technologies, Inc. --http://getrailo.com/ > An Architect's View --http://corfield.org/ > > "If you're not annoying somebody, you're not really alive." > -- Margaret Atwood -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en