Re: Parsing SGML

Tassilo Horn Tue, 10 Jul 2012 08:17:01 -0700

Wolodja Wentland <babi...@gmail.com> writes:

Hi Wolodja,


> valid XML apart from the fact that they lack a root tag. The whole issue is
> complicated by the fact that the input files are pretty large (i.e. 9.1 GB
> gzipped files) and that I therefore cannot read them completely into
> memory.
>
> I am, however, unsure how to proceed. I tried wrapping the input
> stream in "<XML> ... </XML>" [10] but that requires me to read the
> entire file into memory and I get OutOfMemory errors when working on
> the complete corpus. So in short my questions are:
>
> * Do you know a parser that I can use to parse this data?
> * Lacking that: How can I wrap the GZIPInputStream in opening and closing
>   tags?
> * Do you think that I should just write a parser myself? (seems a lot of work
>   just because the enclosing tags are missing)
> * Are there other feasible approaches?

I think, I'd simply copy the clojure.xml code and adapt it to my needs.
Basically, you only have to adapt the handler methods in the proxied
org.xml.sax.DefaultHandler to create the structure you want.  Since it's
a SAX API, it just scans the document sequentially reporting elements
and attributes as they appear, and it doesn't need to keep everything in
memory and neither do the docs have to be strictly well-formed.

Not too pretty, but should be adequate for getting the job done.

Bye,
Tassilo

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Parsing SGML

Reply via email to