On Dec 2, 10:50 am, Johann Hibschman <joha...@gmail.com> wrote:
> I don't understand Clojure's space requirements when processing lazy
> sequences. Are there some rules-of-thumb that I could use to better
> predict what will use a lot of space?
>
> I have a 5.5 GB pipe-delimited data file, containing mostly floats (14
> M rows of 40 cols). I'd like to stream over that file, processing
> columns as I go, without holding the whole thing in RAM. As a first
> test, I'm trying to just split each row and count the total number of
> fields.
>
> Why does reduce seem to load in the whole file, yet test-split-4 not?
> Why does the if-let in test-split-3 vs test-split-3b make such a
> difference? And finally, is there any way I can parallelize this to
> use multiple cores without slurping in the whole file?
>
> If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included
> with incanter.
>
> Here's the code:
>
> (defn afile "/path/to/big/file")
>
> ;; Count the lines in the file.
> ;; 12.8 s, light memory use (0.8 GB).
> (defn test-count []
>   (with-open [rdr (duck-streams/reader afile)]
>     (count (line-seq rdr))))
>
> ;; Split and count.
> ;; 183.2 s, heavy memory use (8.6 GB).
> (defn test-split []
>   (with-open [rdr (duck-streams/reader afile)]
>     (reduce + (map #(count (.split %1 "\\|")) (line-seq rdr)))))
>
> ;; 190.8 s, heavy memory use (8.8 GB).
> (defn test-split-2 []
>   (with-open [rdr (duck-streams/reader afile)]
>     (loop [counts (seq (map #(count (.split %1 "\\|")) (line-seq
> rdr)))
>            cnt 0]
>       (if counts
>         (recur (next counts) (+ cnt (first counts)))
>         cnt))))
>
> ;; Use rest instead, if-let (followinghttp://clojure.org/lazy.)
> ;; 166.1 s, light memory use (1.4 GB)
> (defn test-split-3 []
>   (with-open [rdr (duck-streams/reader afile)]
>     (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr))
>            cnt 0]
>       (if-let [s (seq counts)]
>         (recur (rest s) (+ cnt (first s)))
>         cnt))))
>
> ;; Try without the if-let.
> ;; 211.6 s, heavy memory use (8.7 GB). Surprise!
> (defn test-split-3b []
>   (with-open [rdr (duck-streams/reader afile)]
>     (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr))
>            cnt 0]
>       (if (seq counts)
>         (recur (rest counts) (+ cnt (first counts)))
>         cnt))))
>
> ;; 160 s, light memory use. (1.5 GB)
> (defn test-split-4 []
>   (with-open [rdr (duck-streams/reader afile)]
>     (loop [lines (line-seq rdr)
>            cnt 0]
>       (if lines
>         (recur (next lines)
>                (+ cnt (count (.split (first lines) "\\|"))))
>         cnt))))
>
> ;; Parallel split and count.
> ;; Based on test-split-3, but using pmap.
> ;; 95.1 s, heavy memory use (8.7 GB)
> (defn test-psplit-1 []
>   (with-open [rdr (duck-streams/reader afile)]
>     (loop [counts (pmap #(count (.split %1 "\\|")) (line-seq rdr))
>            cnt 0]
>       (if-let [s (seq counts)]
>         (recur (rest s) (+ cnt (first s)))
>         cnt))))

After looking over the code, I'm inclined to not trust those numbers.
You're generating a lot of intermediate String instances, which is
where the memory is likely going.  My guess is the wildly varying
memory numbers are due to the GC kicking in (note the times are all in
the same ballpark).

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to