Just create a Reader over the file, and do something like (take-while identity (repeatedly #(read-one-wellformed-xml-tag the-reader))). It needs some fleshing out for boundary conditions, but I hope you get the general idea.
On Tuesday, July 10, 2012 6:04:23 AM UTC-7, Wolodja Wentland wrote: > > Dear all, > > I am running into problems when I try to parse SGML documents [0] that are > valid XML apart from the fact that they lack a root tag. The whole issue > is > complicated by the fact that the input files are pretty large (i.e. 9.1 GB > gzipped files) and that I therefore cannot read them completely into > memory. > My goal is to extract each "document" and index it using Lucene, so I need > access to the data at one point, but can throw it away immediately after > processing. > > The input data looks something like [1] and my main problem is that none > of > the parsers I tried cope with the missing root tag. The main problem is > that if the SGML is parsed with either clj-tagsoup's [3] parse-xml or > data.xml's [4] parse function I get a broken representation and can't > extract all > data using either zippers (e.g. like in [5]) or by working on the parsed > data > directly (as in [6]). > > The main problem is that the *first* <DOC> is (wrongly) assumed to be the > root > tag for the entire document and that the result of the parse looks > something > like [7] (for tagsoup) or [8] (for data.xml). As you can imagine I want > output > such as from my (hypothetical) processing function: > > (documents-from-gigaword-file (-> in-file > (io/input-stream) > (GZIPInputStream.)))) > ({:id "AFP_ENG_20101220.0219" > :type "story" > :headline "Headline 1" > :paragraphs ("Paragraph 1" "Paragraph 2")} > > {:id "AFP_ENG_20101220.0235" > :type "story" > :headline "Headline 2" > :text ("Paragraph 3")}) > > But I get the follwing right now: (no wonder!) > > user=> (clojure.pprint/pprint (gw-file->documents (io/file > "/home/babilen/foo.gz"))) > ({:id "AFP_ENG_20101206.0235", > :type "story", > :headline " Headline 2 ", > :paragraphs (" Paragraph 3 ")}) > > > I am, however, unsure how to proceed. I tried wrapping the input stream in > "<XML> ... </XML>" [10] but that requires me to read the entire file into > memory > and I get OutOfMemory errors when working on the complete corpus. So in > short > my questions are: > > * Do you know a parser that I can use to parse this data? > * Lacking that: How can I wrap the GZIPInputStream in opening and closing > tags? > * Do you think that I should just write a parser myself? (seems a lot of > work > just because the enclosing tags are missing) > * Are there other feasible approaches? > > Any input would be most appreciated! > > References > ---------- > > [0] The input data is the English gigaword corpus from > http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07 > > [1] Example data: > > <DOC id="AFP_ENG_20101220.0219" type="story" > > <HEADLINE> > Headline 1 > </HEADLINE> > <DATELINE> > Location, Dec 20, 2010 (AFP) > </DATELINE> > <TEXT> > <P> > Paragraph 1 > </P> > <P> > Paragraph 2 > </P> > </TEXT> > </DOC> > <DOC id="AFP_ENG_20101206.0235" type="story" > > <HEADLINE> > Headline 2 > </HEADLINE> > <DATELINE> > Location, Dec 6, 2010 (AFP) > </DATELINE> > <TEXT> > <P> > Paragraph 3 > </P> > </TEXT> > </DOC> > > [3] > https://github.com/nathell/clj-tagsoup/blob/master/src/pl/danieljanus/tagsoup.clj > > [4] https://github.com/clojure/data.xml/ > [5] Extraction using a zipper: > (defn gw-file->documents > [in-file] > (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))] > (map (fn [doc] > {:id (dzx/xml1-> doc (dzx/attr :id)) > :type (dzx/xml1-> doc (dzx/attr :type)) > :headline (dzx/xml1-> doc :HEADLINE dzx/text) > :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)}) > (dzx/xml-> xml-zipper :DOC)))) > [6] Example extraction of data on the output of parse(-xml) directly: > I use filter-tag to search for all :DOC's and call process-document > for > each. > > (defn- filter-tag > [tag xmls] > (filter identity > (for [x xmls > :when (= tag (:tag x))] > x))) > > (defn process-document > [doc] > {:id (:id (:attrs doc)) > :type (:type (:attrs doc)) > :headline (filter-tag :HEADLINE (xml-seq doc))}) > [7] Parsing with tagsoup > > user=> (clojure.pprint/pprint > (tagsoup/parse-xml > (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) > (GZIPInputStream.)))) > {:tag :DOC, > :attrs {:id "AFP_ENG_20101220.0219", :type "story"}, > :content > [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 1\n"]} > {:tag :DATELINE, > :attrs nil, > :content ["\nLocation, Dec 20, 2010 (AFP)\n"]} > {:tag :TEXT, > :attrs nil, > :content > [{:tag :p, :attrs nil, :content ["\nParagraph 1\n"]} > {:tag :p, :attrs nil, :content ["\nParagraph 2\n"]}]} > {:tag :DOC, > :attrs {:id "AFP_ENG_20101206.0235", :type "story"}, > :content > [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 2\n"]} > {:tag :DATELINE, > :attrs nil, > :content ["\nLocation, Dec 6, 2010 (AFP)\n"]} > {:tag :TEXT, > :attrs nil, > :content [{:tag :p, :attrs nil, :content ["\nParagraph 3\n"]}]}]}]} > > [8] Parsing with clojure.data.xml/parse > user=> (clojure.pprint/pprint > (clojure.data.xml/parse > (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) > (GZIPInputStream.)))) > {:tag :DOC, > :attrs {:id "AFP_ENG_20101220.0219", :type "story"}, > :content > ({:tag :HEADLINE, :attrs {}, :content ("\nHeadline 1\n")} > {:tag :DATELINE, > :attrs {}, > :content ("\nLocation, Dec 20, 2010 (AFP)\n")} > {:tag :TEXT, > :attrs {}, > :content > ({:tag :P, :attrs {}, :content ("\nParagraph 1\n")} > {:tag :P, :attrs {}, :content ("\nParagraph 2\n")})})} > > [9] My actual code: > > (defn- parse-gw-file > [in-file] > (->> in-file > (io/input-stream) > (GZIPInputStream.) > (ts/parse-xml))) > > (defn gw-file->documents > [in-file] > (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))] > (map (fn [doc] > {:id (dzx/xml1-> doc (dzx/attr :id)) > :type (dzx/xml1-> doc (dzx/attr :type)) > :headline (dzx/xml1-> doc :HEADLINE dzx/text) > :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)}) > (dzx/xml-> xml-zipper :DOC)))) > > [10] Wrapping the stream: > (defn- parse-gw-file > [in-file] > (let [unzipped-file (->> in-file > (io/input-stream) > (GZIPInputStream.)) > wrapped-file (str "<XML>" (slurp unzipped-file) "</XML>")] > (->> wrapped-file > (ByteArrayInputStream.) > (ts/parse-xml (ByteArrayInputStream. > (.getBytes (str "<XML>" (slurp unzipped-file) > "</XML>") > "UTF-8"))))) > -- > Wolodja <babi...@gmail.com> > > 4096R/CAF14EFC > 081C B7CD FF04 2BA9 94EA 36B2 8B7F 7D30 CAF1 4EFC > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en