Hi Alex, On Wed, Jan 6, 2010 at 9:06 AM, Alex Ott <alex...@gmail.com> wrote: > Hello all > > I have question about processing big XML files with lazy-xml. I'm trying to > analyze > StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with > posts, i get java stack overflow, although i provide enough memory for java > (1Gb of heap).
Someone asked this question a while back, and a suggestion given was to use Mark Triggs' XOM wrapper: http://github.com/marktriggs/xml-picker-seq Thread: http://groups.google.com/group/clojure/browse_thread/thread/365ca7aaaf8d55b7?pli=1 Cheers, Graham > > My code looks following way > > > (ns stackoverflow > (:import java.io.File) > (:use clojure.contrib.lazy-xml)) > > (def so-base "..../data-sets/stack-overflow/2009-12/122009 SO") > > (def posts-file (File. (str so-base "/posts.xml"))) > > (defn count-post-entries [xml] > (loop [counter 0 > lst xml] > (if (nil? lst) > counter > (let [elem (first lst) > rst (rest lst)] > (if (and (= (:type elem) :start-element) (= (:name elem) :row)) > (recur (+ 1 counter) rst) > (recur counter rst)))))) > > and run it with > > (stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq > stackoverflow/posts-file)) > > I don't collect real data here, so i expect, that clojure will discard > already processed data. > > The same problem with stack overflow happens, when i use reduce: > > (reduce (fn [counter elem] > (if (and (= (:type elem) :start-element) (= (:name elem) :row)) > (+ 1 counter) > counter)) > 0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file)) > > So, question is open - how to process big xml files in constant space? (if > I won't collect much data during processing) > > -- > With best wishes, Alex Ott, MBA > http://alexott.blogspot.com/ http://xtalk.msk.su/~ott/ > http://alexott-ru.blogspot.com/ > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en >
-- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en