Just a quick comment on a generic, similar issue. I need to parse Gigabyte files of multi-line JSON ( which is a similar problem to parsing Gigabytes of XML) where the record delimiter is not a newline. My strategy is to determine record separators (e.g. by counting the level of nestings) as chunks and then parsing the JSON on the chunk. In this way, I can scale better because these files are quite large, and I can leverage hadoop or cascalog eventually.
Has anyone attempted anything similar to achieve the "chunking"? Thanks, Avram On May 31, 1:33 pm, Ilya Kasnacheev <ilya.kasnach...@gmail.com> wrote: > Forgot to mention some things: > > -https://github.com/alamar/clojure-xml-streamon github. > - I'm yet to figure out that Lenin thing, so ant. > - The two-step handler system (there's a function that takes a method > and returns a handler, and handler accepts item being constructed and > stream-reader) seems suboptimal, maybe I'll figure it out later. > > On 31 май, 23:25, Ilya Kasnacheev <ilya.kasnach...@gmail.com> wrote: > > > > > Hi *! I've tried a few searches on parsing XML files larger than > > memory, didn't find anything and wrote a simple framework for parsing > > XML via STAX to lazy sequence of defrecords. It is therefore capable > > of reading several GB of xml without much problems. It is quite > > declarative but also quite ugly. > > > Take a peek: > > (technical babble after the fold) > > > $ git clone git://github.com/alamar/clojure-xml-stream.git > > $ ant > > > It turns this completely-invented XML: > > > <ground> > > <tree-species> > > <tree id="1"><name>Pine</name></tree> > > <tree id="2"><name>Birch</name></tree> > > <tree id="4"><name>Palmtree</name></tree> > > </tree-species> > > <forests> > > <forest id="1"> > > <name>Red Forest</name> > > <trees> > > <tree refid="1"><branch direction="left"/><branch > > direction="south"/></tree> > > <tree refid="2"><branch direction="right"/><branch > > direction="south"/><branch direction="west"/></tree> > > <tree refid="1"><branch direction="southwest"/></tree> > > </trees> > > </forest> > > <forest id="2"> > > <name>Dark Forest</name> > > <trees> > > <tree refid="2"><branch direction="right"/><branch > > direction="south"/><branch direction="west"/></tree> > > <tree refid="4"><branch direction="northwest"/></tree> > > </trees> > > </forest> > > </forests> > > </ground> > > > into a lazy sequence of: > > > #:example.TreeSpecies{:id 1, :name Pine} > > #:example.TreeSpecies{:id 2, :name Birch} > > #:example.TreeSpecies{:id 4, :name Palmtree} > > #:example.Forest{:id 1, :trees [#:example.Tree{:species-id > > 1, :branches (:left :south)} #:example.Tree{:species-id 2, :branches > > (:right :south :west)} #:example.Tree{:species-id 1, :branches > > (:southwest)}], :name Red Forest} > > #:example.Forest{:id 2, :trees [#:example.Tree{:species-id > > 2, :branches (:right :south :west)} #:example.Tree{:species-id > > 4, :branches (:northwest)}], :name Dark Forest} > > > using this code: > > > (defrecord TreeSpecies [id name]) > > (defrecord Forest [id trees name]) > > (defrecord Tree [species-id branches]) > > > (defmulti ground-element first-arg) > > > (defmulti tree-element first-arg) > > > (defmethod ground-element :tree [_ stream-reader] > > (TreeSpecies. (attribute-value stream-reader "id") nil)) > > > (defmethod ground-element [:TreeSpecies :name] [_ stream-reader tree] > > (assoc tree :name (element-text stream-reader))) > > > (defmethod ground-element :forest [_ stream-reader] > > (Forest. (attribute-value stream-reader "id") [] nil)) > > > (defmethod ground-element [:Forest :name] [_ stream-reader forest] > > (assoc forest :name (element-text stream-reader))) > > > (defmethod ground-element [:Forest :tree] [_ stream-reader forest] > > (assoc forest :trees > > (conj (:trees forest) > > (Tree. (attribute-value stream-reader "refid") > > (dispatch-partial stream-reader > > (element-struct-handler tree-element)))))) > > > (defmethod tree-element :branch [_ stream-reader] > > (keyword (attribute-value stream-reader "direction"))) > > > (defmethod ground-element :default [token & whatever] > > (comment println token)) > > (defmethod tree-element :default [token & whatever] > > (comment println token)) > > > (defn run [path] > > (with-open [input-stream (FileInputStream. path)] > > (let [handler (element-struct-handler ground-element) > > objects (parse-dispatch input-stream handler)] > > (doseq [object objects] (println object))))) > > > How it works: it reads elements and calls a method with > > the :ElementName > > If the method returns a record, it stuffs anything found in that > > element into this record. > > It can handle nested structures because it can parse subtrees (there > > is an example in code). > > The handler have to read events from stax (to get text nodes, for > > example), the only limitation is that handler should never iterate > > past END_ELEMENT of the element it was called on (or the parser would > > become confused). > > > The syntax and the way I call assoc seem ugly to me, so all > > suggestions are welcome. > > Suggestions on naming and general architecture are welcome too. > > Maybe this can grow into something generally usable. > > > Feel free to fork, use and complain. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en