You could consider using a StreamTokenizer:

(import '(java.io StreamTokenizer BufferedReader FileReader))
(defn wordfreq [filename]
  (with-local-vars [words {}]
    (let [st (StreamTokenizer. (BufferedReader. (FileReader.
filename)))]
      (loop [tt (.nextToken st)]
        (when (not= tt StreamTokenizer/TT_EOF)
          (if (= tt StreamTokenizer/TT_WORD)
            (let [w (.toLowerCase (.sval st))]
                (var-set words (assoc @words w (inc (@words w 0))))))
          (recur (.nextToken st)))))
    (println (reverse (sort (map (fn [[k v]] [v k]) @words))))))


For me it was faster (even ignoring output):
user=> (time (wordfreq "wordfreq.txt"))
"Elapsed time: 444.171796 msecs"
user=> (time (top-words "wordfreq.txt" "out.txt"))
"Elapsed time: 618.196978 msecs"

Obviously if you wanted to take this approach you could rework to
apply your existing printer for a better comparison.

Interestingly when I compared 3 implementations:

1) by Chouser here:
http://groups.google.com/group/clojure/browse_thread/thread/d03e75812de6c6e2/5c47c243474c999d?lnk=gst&q=sort+by+value#5c47c243474c999d
2) top-words as described
3) Using a StreamTokenizer

I get 3 different histograms using a test file! All very similar but
slightly different. It is probably largely related  to my test file
having opposite architecture newlines... shows that word counting is
not necessarily a cut and dried thing! Hahahaha, so how just how many
words are in this file ??? :)

Regards,
Tim.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to