Thank you everyone for your advice, I found it useful and think that I am 
part-way to a solution using clojure.data.xml/source-seq as suggested by 
dannue.

I'll post what I have done so far in the hope it might help someone else... 
comments on style welcome.

*Solution*:

Given the following XML,

<head>
  <title>This is some text</title>
  <body>
     <h1>This is a header</h1>
  </body>
</head>

data.xml/source-seq will return a lazy seq of data.xml.Event items 

#clojure.data.xml.Event{:type :start-element, :name :head, :attrs nil, :str 
nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :start-element, :name :title, :attrs nil, 
:str nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str This 
is some text}
#clojure.data.xml.Event{:type :end-element, :name :title, :attrs nil, :str 
nil}
#clojure.data.xml.Event{:type :start-element, :name :body, :attrs nil, :str 
nil}
#clojure.data.xml.Event{:type :start-element, :name :h1, :attrs nil, :str 
nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str This 
is a header}
#clojure.data.xml.Event{:type :end-element, :name :h1, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :end-element, :name :body, :attrs nil, :str 
nil}
#clojure.data.xml.Event{:type :end-element, :name :head, :attrs nil, :str 
nil}

This is perfect for finding elements with a particular name, but completely 
useless if I want to find an element based on its location. So I maintain a 
stack where each :start-element causes the element name to be pushed, and 
each :end-element to invoke a pop.

(filter (fn [x] (complement (nil? x)))
  (let [stack (atom []) 
        search-pattern "vmware/collectionHost/Object/Property/Property"] 

    (doseq[x (take 100 xml)] ; just test with the first 100 elements in seq.
      (do 
        (cond 
          (= (:type x) :start-element) (swap! stack conj (name (get x 
:name)))
          (= (:type x) :end-element) (swap! stack pop) 
        )   
        (when (= search-pattern (clojure.string/join "/" @stack)) (println 
(clojure.string/join "/" @stack)))
      )   
    )   
  )
)

This is a work in progress and does not take account of attributes on the 
elements, but I would appreciate any comments.

Thanks

Pete



On Wednesday, December 18, 2013 7:23:21 AM UTC, danneu wrote:
>
> Good question. Every lib that came to mind when I saw 
> clojure.data.xml/parse's
> tree of Elements {:tag _,
> :attrs _, :content _} only works on zippers which apparently sit in memory.
>
> One option is to use `clojure.data.xml/source-seq` to get back a lazy 
> sequence
> of Events {:type _, :name _, :attrs _, :str _} where the event :name is 
> either
> :start-element, :end-element, or :characters.
>
> For example, "<strong>Hello</strong>" would parse into the events
> [:start-element "strong"], [:characters "Hello"], [:end-element "strong"]. 
> You
> could use loop/recur to manage state as your consume the sequence.
>
> That's actually how I'm used to working with SAX parsers anyways. Here are 
> some
> naive Ruby examples if it's new to you: 
> https://gist.github.com/danneu/3977120.
>
> Of course, I imagine the ideal solution would involve some way to express 
> selectors on the
> Element tree like I'm used to doing with raynes/laser on zippers: 
> https://github.com/Raynes/laser/blob/master/docs/guide.md#screen-scraping.
>
>
> On Tuesday, December 17, 2013 4:57:32 AM UTC-6, Peter Ullah wrote:
>>
>>
>> Hi all, 
>>
>> I'm attempting to parse a large (500MB) XML, specifically I am trying to 
>> extract various parts using XPath. I've been using the examples presented 
>> here: 
>> http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html
>> and all was going when tested against small files, however now that I am 
>> using the larger file Fireplace/Vim just hangs and my laptop gets hot then 
>> I get a memory exception.
>>
>> I've been playing around with various other libraries such as 
>> clojure.data.xml and and found that the following works perfectly well for 
>> parsing... but when I come to search inside root, things start to snarl up 
>> again.
>>
>> (ns example.core
>>   (:require [clojure.java.io :as java.io] 
>>             [clojure.data.xml :as data.xml]
>>             ))  
>>
>> (def large-file "/path-to-large-file")
>>
>> ;; using clojure.data.xml returns quickly with no problems whereas 
>> clojure.xml/parse from the link above causes problems..
>> (def root 
>>   ( -> large-file
>>        java.io/input-stream
>>        data.xml/parse
>>        ))  
>>
>> (class root) ;clojure.data.xml.Element
>>
>> Does anyone know a way of searching within root that won't consume the 
>> heap?
>>
>> Forgive me, I'm new to Clojure and these forums, I've searched through 
>> previous posts but not managed to answer my own question.
>>
>> Thanks in advance.
>>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to