I don't understand Clojure's space requirements when processing lazy
sequences. Are there some rules-of-thumb that I could use to better
predict what will use a lot of space?

I have a 5.5 GB pipe-delimited data file, containing mostly floats (14
M rows of 40 cols). I'd like to stream over that file, processing
columns as I go, without holding the whole thing in RAM. As a first
test, I'm trying to just split each row and count the total number of
fields.

Why does reduce seem to load in the whole file, yet test-split-4 not?
Why does the if-let in test-split-3 vs test-split-3b make such a
difference? And finally, is there any way I can parallelize this to
use multiple cores without slurping in the whole file?

If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included
with incanter.

Here's the code:

(defn afile "/path/to/big/file")

;; Count the lines in the file.
;; 12.8 s, light memory use (0.8 GB).
(defn test-count []
  (with-open [rdr (duck-streams/reader afile)]
    (count (line-seq rdr))))

;; Split and count.
;; 183.2 s, heavy memory use (8.6 GB).
(defn test-split []
  (with-open [rdr (duck-streams/reader afile)]
    (reduce + (map #(count (.split %1 "\\|")) (line-seq rdr)))))

;; 190.8 s, heavy memory use (8.8 GB).
(defn test-split-2 []
  (with-open [rdr (duck-streams/reader afile)]
    (loop [counts (seq (map #(count (.split %1 "\\|")) (line-seq
rdr)))
           cnt 0]
      (if counts
        (recur (next counts) (+ cnt (first counts)))
        cnt))))

;; Use rest instead, if-let (following http://clojure.org/lazy.)
;; 166.1 s, light memory use (1.4 GB)
(defn test-split-3 []
  (with-open [rdr (duck-streams/reader afile)]
    (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr))
           cnt 0]
      (if-let [s (seq counts)]
        (recur (rest s) (+ cnt (first s)))
        cnt))))

;; Try without the if-let.
;; 211.6 s, heavy memory use (8.7 GB). Surprise!
(defn test-split-3b []
  (with-open [rdr (duck-streams/reader afile)]
    (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr))
           cnt 0]
      (if (seq counts)
        (recur (rest counts) (+ cnt (first counts)))
        cnt))))

;; 160 s, light memory use. (1.5 GB)
(defn test-split-4 []
  (with-open [rdr (duck-streams/reader afile)]
    (loop [lines (line-seq rdr)
           cnt 0]
      (if lines
        (recur (next lines)
               (+ cnt (count (.split (first lines) "\\|"))))
        cnt))))

;; Parallel split and count.
;; Based on test-split-3, but using pmap.
;; 95.1 s, heavy memory use (8.7 GB)
(defn test-psplit-1 []
  (with-open [rdr (duck-streams/reader afile)]
    (loop [counts (pmap #(count (.split %1 "\\|")) (line-seq rdr))
           cnt 0]
      (if-let [s (seq counts)]
        (recur (rest s) (+ cnt (first s)))
        cnt))))

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to