You could consider using a StreamTokenizer: (import '(java.io StreamTokenizer BufferedReader FileReader)) (defn wordfreq [filename] (with-local-vars [words {}] (let [st (StreamTokenizer. (BufferedReader. (FileReader. filename)))] (loop [tt (.nextToken st)] (when (not= tt StreamTokenizer/TT_EOF) (if (= tt StreamTokenizer/TT_WORD) (let [w (.toLowerCase (.sval st))] (var-set words (assoc @words w (inc (@words w 0)))))) (recur (.nextToken st))))) (println (reverse (sort (map (fn [[k v]] [v k]) @words))))))
For me it was faster (even ignoring output): user=> (time (wordfreq "wordfreq.txt")) "Elapsed time: 444.171796 msecs" user=> (time (top-words "wordfreq.txt" "out.txt")) "Elapsed time: 618.196978 msecs" Obviously if you wanted to take this approach you could rework to apply your existing printer for a better comparison. Interestingly when I compared 3 implementations: 1) by Chouser here: http://groups.google.com/group/clojure/browse_thread/thread/d03e75812de6c6e2/5c47c243474c999d?lnk=gst&q=sort+by+value#5c47c243474c999d 2) top-words as described 3) Using a StreamTokenizer I get 3 different histograms using a test file! All very similar but slightly different. It is probably largely related to my test file having opposite architecture newlines... shows that word counting is not necessarily a cut and dried thing! Hahahaha, so how just how many words are in this file ??? :) Regards, Tim. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---