On Dec 2, 10:50 am, Johann Hibschman <joha...@gmail.com> wrote: > I don't understand Clojure's space requirements when processing lazy > sequences. Are there some rules-of-thumb that I could use to better > predict what will use a lot of space? > > I have a 5.5 GB pipe-delimited data file, containing mostly floats (14 > M rows of 40 cols). I'd like to stream over that file, processing > columns as I go, without holding the whole thing in RAM. As a first > test, I'm trying to just split each row and count the total number of > fields. > > Why does reduce seem to load in the whole file, yet test-split-4 not? > Why does the if-let in test-split-3 vs test-split-3b make such a > difference? And finally, is there any way I can parallelize this to > use multiple cores without slurping in the whole file? > > If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included > with incanter. > > Here's the code: > > (defn afile "/path/to/big/file") > > ;; Count the lines in the file. > ;; 12.8 s, light memory use (0.8 GB). > (defn test-count [] > (with-open [rdr (duck-streams/reader afile)] > (count (line-seq rdr)))) > > ;; Split and count. > ;; 183.2 s, heavy memory use (8.6 GB). > (defn test-split [] > (with-open [rdr (duck-streams/reader afile)] > (reduce + (map #(count (.split %1 "\\|")) (line-seq rdr))))) > > ;; 190.8 s, heavy memory use (8.8 GB). > (defn test-split-2 [] > (with-open [rdr (duck-streams/reader afile)] > (loop [counts (seq (map #(count (.split %1 "\\|")) (line-seq > rdr))) > cnt 0] > (if counts > (recur (next counts) (+ cnt (first counts))) > cnt)))) > > ;; Use rest instead, if-let (followinghttp://clojure.org/lazy.) > ;; 166.1 s, light memory use (1.4 GB) > (defn test-split-3 [] > (with-open [rdr (duck-streams/reader afile)] > (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr)) > cnt 0] > (if-let [s (seq counts)] > (recur (rest s) (+ cnt (first s))) > cnt)))) > > ;; Try without the if-let. > ;; 211.6 s, heavy memory use (8.7 GB). Surprise! > (defn test-split-3b [] > (with-open [rdr (duck-streams/reader afile)] > (loop [counts (map #(count (.split %1 "\\|")) (line-seq rdr)) > cnt 0] > (if (seq counts) > (recur (rest counts) (+ cnt (first counts))) > cnt)))) > > ;; 160 s, light memory use. (1.5 GB) > (defn test-split-4 [] > (with-open [rdr (duck-streams/reader afile)] > (loop [lines (line-seq rdr) > cnt 0] > (if lines > (recur (next lines) > (+ cnt (count (.split (first lines) "\\|")))) > cnt)))) > > ;; Parallel split and count. > ;; Based on test-split-3, but using pmap. > ;; 95.1 s, heavy memory use (8.7 GB) > (defn test-psplit-1 [] > (with-open [rdr (duck-streams/reader afile)] > (loop [counts (pmap #(count (.split %1 "\\|")) (line-seq rdr)) > cnt 0] > (if-let [s (seq counts)] > (recur (rest s) (+ cnt (first s))) > cnt))))
After reading the code, I'm inclined to not trust those numbers. Note that the time metrics for test-split* are all in the same ballpark, creating the same number of superfluous, intermediate String instances, but the memory numbers you list are wildly different. How are you collecting these numbers? Have you controlled for the GC kicking in? -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en