I don't understand Clojure's space requirements when processing lazy sequences. Are there some rules-of-thumb that I could use to better predict what will use a lot of space?
I have a 5.5 GB pipe-delimited data file, containing mostly floats (14 M rows of 40 cols). I'd like to stream over that file, processing columns as I go, without holding the whole thing in RAM. As a first test, I'm trying to just split each row and count the total number of fields. Why does reduce seem to load in the whole file, yet test-split-4 not? Why does the if-let in test-split-3 vs test-split-3b make such a difference? And finally, is there any way I can parallelize this to use multiple cores without slurping in the whole file? If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included with incanter. Here's the code: (defn afile "/path/to/big/file") ;; Count the lines in the file. ;; 12.8 s, light memory use (0.8 GB). (defn test-count [] (with-open [rdr (duck-streams/reader afile)] (count (line-seq rdr)))) ;; Split and count. ;; 183.2 s, heavy memory use (8.6 GB). (defn test-split [] (with-open [rdr (duck-streams/reader afile)] (reduce + (map #(count (.split %1 "\\|")) (line-seq rdr))))) ;; 190.8 s, heavy memory use (8.8 GB). (defn test-split-2 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (seq (map #(count (.split %1 "\\|")) (line-seq rdr))) cnt 0] (if counts (recur (next counts) (+ cnt (first counts))) cnt)))) ;; Use rest instead, if-let (following http://clojure.org/lazy.) ;; 166.1 s, light memory use (1.4 GB) (defn test-split-3 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt)))) ;; Try without the if-let. ;; 211.6 s, heavy memory use (8.7 GB). Surprise! (defn test-split-3b [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr)) cnt 0] (if (seq counts) (recur (rest counts) (+ cnt (first counts))) cnt)))) ;; 160 s, light memory use. (1.5 GB) (defn test-split-4 [] (with-open [rdr (duck-streams/reader afile)] (loop [lines (line-seq rdr) cnt 0] (if lines (recur (next lines) (+ cnt (count (.split (first lines) "\\|")))) cnt)))) ;; Parallel split and count. ;; Based on test-split-3, but using pmap. ;; 95.1 s, heavy memory use (8.7 GB) (defn test-psplit-1 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (pmap #(count (.split %1 "\\|")) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt)))) -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en