Hello all I have question about processing big XML files with lazy-xml. I'm trying to analyze StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with posts, i get java stack overflow, although i provide enough memory for java (1Gb of heap).
My code looks following way (ns stackoverflow (:import java.io.File) (:use clojure.contrib.lazy-xml)) (def so-base "..../data-sets/stack-overflow/2009-12/122009 SO") (def posts-file (File. (str so-base "/posts.xml"))) (defn count-post-entries [xml] (loop [counter 0 lst xml] (if (nil? lst) counter (let [elem (first lst) rst (rest lst)] (if (and (= (:type elem) :start-element) (= (:name elem) :row)) (recur (+ 1 counter) rst) (recur counter rst)))))) and run it with (stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file)) I don't collect real data here, so i expect, that clojure will discard already processed data. The same problem with stack overflow happens, when i use reduce: (reduce (fn [counter elem] (if (and (= (:type elem) :start-element) (= (:name elem) :row)) (+ 1 counter) counter)) 0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file)) So, question is open - how to process big xml files in constant space? (if I won't collect much data during processing) -- With best wishes, Alex Ott, MBA http://alexott.blogspot.com/ http://xtalk.msk.su/~ott/ http://alexott-ru.blogspot.com/
-- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en