Neat blog by the way.

On Monday, February 10, 2014 8:41:51 PM UTC-5, Rob Buhler wrote:
>
> Hi,
>
> I'm learning Clojure and I wrote a word-frequencies function that relies 
> heavily on clojure.core/frequencies (plus a little filtering)
>
> (ns topwords.core
>  (require [clojure.java.io :as io]
>           [clojure.string :as str]))
> (def stop-words #{"other" "still" "again" "where" "could" "there" 
>
>                   "their" "these" "those" "after" "while" "almost" "before" 
> "through" 
>
>                   "every" "being" "never" "should" "might" "thing" "among" 
>
>                   "which" "would" "though" "about"})
> (defn get-words [line]
>   (re-seq #"\p{Alpha}+" line))
> (defn min-length [word]
>  (< 4 (count word)))
> (defn ignore-words [word]
>  (if-not (contains? stop-words word) word))
> (defn word-frequencies [filename]
>   (with-open [rdr (io/reader filename)]
>      (let [lines (line-seq rdr)
>            words (comp get-words str/lower-case)
>            preds (every-pred min-length ignore-words)]
>        (frequencies (filter preds (words lines))))))
>
>
> It works (you can see some output from it on my blog if you want - 
> http://robbuhler.blogspot.com/2014/02/word-frequencies-from-file.html)
>
> Anyway, my questions are:
>
>
> 1) Why do I not need a doall on the line-seq? What is forcing the evaluation 
> here?
>
>
> 2) I'm assuming this is still reading the entire file into memory at once? If 
> so, how would I
>
>    count the frequencies of a really large file without consuming so much 
> memory?
>
>    I've thought about using doseq and for each line updating a atom that 
> holds a map,
>
>       but I'm not sure if I'm no the right track here.
>
>       I'm just thinking of something like this (in Python):
>
>       for i in xrange(100):
>
>          key = i % 10
>
>     if key in d:
>         d[key] += 1
>     else:
>         d[key] = 1
>
>    Can I somehow count all of the frequencies line by line and not use an 
> atom (or another ref type)?
>
>    I'm not looking for the ultimate performance code, just something that 
> would be considered idiomatic Clojure
>
>
>  Thanks,
>
>  Rob
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to