Re: What am I not getting here?

new2clojure Fri, 07 Jan 2011 05:54:29 -0800

Thank you all for your help, I appreciate it very much. I come from
the world of OO programming and I am bending my head backwards to try
to think in the functional paradigm.


Bellow you will find my version of Peter Norvig's spell checker
(http://norvig.com/spell-correct.html) in clojure, followed by the
java version. I coded the later first and then I attempted to
"translate" it. In a few months time I will try to implement it again
from scratch (in clojure) and I will be able to check my functional
mindset evolution.

Note: The reason for initializing the values of the dictionary to one
is explained in Norvig's essay

===== C L O J U R E =====

(ns SpellChecker)

;; Version of http://norvig.com/spell-correct.html by
'miguel.arre...@gmail.com'
;; 7th January 2011

(import (java.io BufferedReader FileReader))

(defn build-dictionary [file-name]
  "Returns a \"dictionary{'word'} = frequency\" mapping
   each of the words in the file pointed to by 'file-name' to the
   number of its occurrences.
   Thank you 'Jason Wolfe' for the hints on how to produce the
dictionary
   (http://groups.google.com/group/clojure/browse_thread/thread/
a1265d60229b8cd5)"
  (with-open [rdr (BufferedReader. (FileReader. file-name))]
    (reduce #(assoc %1 %2 (inc (get %1 %2 1)))
      {}
      (mapcat
        #(re-seq #"[a-z]+" (.toLowerCase %))
        (line-seq rdr)))))

(defn- alpha []
  "Explicit alphabet a-z (English)"
  "abcdefghijklmnopqrstuvwxyz")

(defn- deletes [word]
  "Words derivated from word which are missing a letter"
  (for [i (range (count word))] (str (subs word 0 i) (subs word (inc
i)))))

(defn- transposes [word]
  "Words derivated from word in which two adjacent letters are
swapped"
  (for [i (range (dec (count word)))]
    (let [i1 (inc i), i2 (inc i1)]
        (str (subs word 0 i) (subs word i1 i2) (subs word i i1) (subs
word i2)))))

(defn- replaces [word]
  "Words derivated from word which are miss-spelled by one letter"
  (for [i (range (count word)), j (alpha)]
    (str (subs word 0 i) j (subs word (inc i)))))

(defn- inserts [word]
  "Words derivated from word which contain an additional letter"
  (for [i (range (count word)), j (alpha)]
    (str (subs word 0 i) j (subs word i))))

(defn variations [word]
  "Returns all the words at edit distance 1 from word
   (includes: deletes, transposes, replaces and inserts)"
  (concat (replaces word) (inserts word) (deletes word) (transposes
word)))

(defn check [dict word]
  "Checks the word agains the dictionary, returning the variation
   at edit distances 1 and 2 that has the highest frequency (may be
   the word itself if not found in the dictionary)"
  (if (dict word)
    word
    (let [edit1 (distinct (filter #(dict %) (variations word))),
          edit2 (distinct (filter #(dict %) (map #(variations %)
edit1))),
          candidates (reduce #(assoc %1 (dict %2 0) %2) (sorted-map-by
>) (into edit1 edit2))]
       (if (empty? candidates) word (val (first candidates))))))

;; Main

(def dictionary (build-dictionary "big.txt"))
(check dictionary "amunition")

;; End


===== J A V A =====

package power;


import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class SpellChecker {

        private Map<String, Integer> dictionary;


        public SpellChecker (String fileName) throws IOException {

                File fileHndl = new File(fileName);
                System.err.println(String.format(
                    "Loading dictionary '%s' (%d bytes)",
                    fileName, fileHndl.length()
                ));
                long startTime = System.currentTimeMillis();
                dictionary = new HashMap<String, Integer>();
                BufferedReader in = new BufferedReader(new 
FileReader(fileName));
                Pattern pattern = Pattern.compile("\\w+");
                for (String line=null; null != (line=in.readLine()); ) {

                        Matcher matcher = pattern.matcher(line.toLowerCase());
                        while (matcher.find()) {

                                String word = matcher.group();
                                Integer frequency = dictionary.get(word);
                                if (null == frequency) {

                                        frequency = Integer.valueOf(1);
                                }
                                else {

                                        frequency = 
Integer.valueOf(frequency.intValue() + 1);
                                }
                                dictionary.put(word, frequency);
                        }
                }
                in.close();
                System.err.println(String.format(
                    "Loaded, %d entries, %d millis",
                    dictionary.size(), (System.currentTimeMillis() - startTime)
            ));
        }


        private final List<String> variations (String word) {

                // Words at distance == 1 from the original word
                //
                List<String> variations = new ArrayList<String>();
                StringBuilder sb = new StringBuilder();
                for (int i=0; i < word.length(); i++) {

                        // Deletes
                        sb.setLength(0);
                        sb.append(word.substring(0, i));
                        sb.append(word.substring(i + 1));
                        variations.add(sb.toString());
                }
                for (int i=0; i < word.length() - 1; i++) {

                        // Transposes
                        sb.setLength(0);
                        sb.append(word.substring(0, i));
                        sb.append(word.substring(i + 1, i + 2));
                        sb.append(word.substring(i, i + 1));
                        sb.append(word.substring(i + 2));
                        variations.add(sb.toString());
                }
                for (int i=0; i < word.length(); i++) {

                        for (char c='a'; c <= 'z'; c++) {

                                // Replaces
                                sb.setLength(0);
                                sb.append(word.substring(0, i));
                                sb.append(String.valueOf(c));
                                sb.append(word.substring(i + 1));
                                variations.add(sb.toString());
                        }
                }
                for (int i=0; i <= word.length(); i++) {

                        for (char c='a'; c <= 'z'; c++) {

                                // Inserts
                                sb.setLength(0);
                                sb.append(word.substring(0, i));
                                sb.append(String.valueOf(c));
                                sb.append(word.substring(i));
                                variations.add(sb.toString());
                        }
                }
                return variations;
        }


        public final String check (String word) {

                if (dictionary.containsKey(word)) {

                        return word;
                }

                Map<Integer, String> candidates = new HashMap<Integer, 
String>();
                List<String> variationsOfWord = variations(word);
                for (String var: variationsOfWord) {

                        if (dictionary.containsKey(var)) {

                                candidates.put(dictionary.get(var), var);
                        }
                }
                if (candidates.size() > 0) {

                        return 
candidates.get(Collections.max(candidates.keySet()));
                }

                for (String var: variationsOfWord) {

                        for (String varvar: variations(var)) {

                                // Edit distance of 2, only keep real words
                                if (dictionary.containsKey(varvar)) {

                                        candidates.put(dictionary.get(varvar), 
varvar);
                                }
                        }
                }

                return candidates.size() > 0?
candidates.get(Collections.max(candidates.keySet())) : word;
        }


        public static void main (String args[]) throws IOException {

                // Roger Mitton's Birkbeck spelling error corpus from the Oxford
Text Archive.
                SpellChecker spellChecker = new SpellChecker("big.txt");
                BufferedReader brin = new BufferedReader(new
InputStreamReader(System.in));
                for ( ; ; ) {

                        System.out.print("Spell [ctrl^C to end]: ");
                        String word = brin.readLine();
                        if (word != null && word.length() > 0) {

                                String nword = spellChecker.check(word);
                                if (nword.equals(word)) {

                                        System.out.println("  > Correct!");
                                }
                                else {

                                        System.out.println("  > Did you mean? " 
+ nword);
                                }
                        }
                }
        }
}


On Jan 6, 10:59 pm, Sean Corfield <seancorfi...@gmail.com> wrote:
> On Thu, Jan 6, 2011 at 10:49 AM, new2clojure <miguel.arre...@gmail.com> wrote:
> > I am trying to store into a map the frequency of each [a-z]+ word in a
> > file. When I run this code in the repl the resulting dictionary is
> > empty. How should I adapt my code to get this functionality right?.
>
> Other people have offered up alternate solutions that should get you
> pointed in the right direction but I wanted to highlight an important
> issue which may be tripping you up, depending on your background:
>
> > (def dictionary {})
>
> This defines dictionary as an empty map. dictionary is immutable.
>
> > (defn process-file [file-name]
> >  (with-open
> >    [rdr (BufferedReader. (FileReader. file-name))]
> >    (doseq
> >      [line (line-seq rdr)]
> >      (reduce #(assoc %1 %2 (inc (get %1 %2 1))) dictionary (re-seq
> > #"[a-z]+" (.toLowerCase line))))))
>
> This does not change dictionary. The (assoc) call returns a new map.
> The (reduce) call will ultimately return a new map (the result of
> calling assoc several times).
>
> The lack of side-effects / assignment can be quite hard to get your
> head around if you're not used to functional programming.
>
> As Ken noted, your (get) call needs to use 0 not one (since you always
> (inc) it):
>
> user=> (reduce #(assoc %1 %2 (inc (get %1 %2 1))) {} ["these" "are"
> "some" "words" "with" "some" "words" "repeated"])
> {"repeated" 2, "with" 2, "words" 3, "some" 3, "are" 2, "these" 2}
>
> With (get %1 %2 0) we get the correct result:
>
> user=> (reduce #(assoc %1 %2 (inc (get %1 %2 0))) {} ["these" "are"
> "some" "words" "with" "some" "words" "repeated"])
> {"repeated" 1, "with" 1, "words" 2, "some" 2, "are" 1, "these" 1}
> --
> Sean A Corfield -- (904) 302-SEAN
> Railo Technologies, Inc. --http://getrailo.com/
> An Architect's View --http://corfield.org/
>
> "If you're not annoying somebody, you're not really alive."
> -- Margaret Atwood

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: What am I not getting here?

Reply via email to