Dear all, I am running into problems when I try to parse SGML documents [0] that are valid XML apart from the fact that they lack a root tag. The whole issue is complicated by the fact that the input files are pretty large (i.e. 9.1 GB gzipped files) and that I therefore cannot read them completely into memory. My goal is to extract each "document" and index it using Lucene, so I need access to the data at one point, but can throw it away immediately after processing.
The input data looks something like [1] and my main problem is that none of the parsers I tried cope with the missing root tag. The main problem is that if the SGML is parsed with either clj-tagsoup's [3] parse-xml or data.xml's [4] parse function I get a broken representation and can't extract all data using either zippers (e.g. like in [5]) or by working on the parsed data directly (as in [6]). The main problem is that the *first* <DOC> is (wrongly) assumed to be the root tag for the entire document and that the result of the parse looks something like [7] (for tagsoup) or [8] (for data.xml). As you can imagine I want output such as from my (hypothetical) processing function: (documents-from-gigaword-file (-> in-file (io/input-stream) (GZIPInputStream.)))) ({:id "AFP_ENG_20101220.0219" :type "story" :headline "Headline 1" :paragraphs ("Paragraph 1" "Paragraph 2")} {:id "AFP_ENG_20101220.0235" :type "story" :headline "Headline 2" :text ("Paragraph 3")}) But I get the follwing right now: (no wonder!) user=> (clojure.pprint/pprint (gw-file->documents (io/file "/home/babilen/foo.gz"))) ({:id "AFP_ENG_20101206.0235", :type "story", :headline " Headline 2 ", :paragraphs (" Paragraph 3 ")}) I am, however, unsure how to proceed. I tried wrapping the input stream in "<XML> ... </XML>" [10] but that requires me to read the entire file into memory and I get OutOfMemory errors when working on the complete corpus. So in short my questions are: * Do you know a parser that I can use to parse this data? * Lacking that: How can I wrap the GZIPInputStream in opening and closing tags? * Do you think that I should just write a parser myself? (seems a lot of work just because the enclosing tags are missing) * Are there other feasible approaches? Any input would be most appreciated! References ---------- [0] The input data is the English gigaword corpus from http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07 [1] Example data: <DOC id="AFP_ENG_20101220.0219" type="story" > <HEADLINE> Headline 1 </HEADLINE> <DATELINE> Location, Dec 20, 2010 (AFP) </DATELINE> <TEXT> <P> Paragraph 1 </P> <P> Paragraph 2 </P> </TEXT> </DOC> <DOC id="AFP_ENG_20101206.0235" type="story" > <HEADLINE> Headline 2 </HEADLINE> <DATELINE> Location, Dec 6, 2010 (AFP) </DATELINE> <TEXT> <P> Paragraph 3 </P> </TEXT> </DOC> [3] https://github.com/nathell/clj-tagsoup/blob/master/src/pl/danieljanus/tagsoup.clj [4] https://github.com/clojure/data.xml/ [5] Extraction using a zipper: (defn gw-file->documents [in-file] (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))] (map (fn [doc] {:id (dzx/xml1-> doc (dzx/attr :id)) :type (dzx/xml1-> doc (dzx/attr :type)) :headline (dzx/xml1-> doc :HEADLINE dzx/text) :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)}) (dzx/xml-> xml-zipper :DOC)))) [6] Example extraction of data on the output of parse(-xml) directly: I use filter-tag to search for all :DOC's and call process-document for each. (defn- filter-tag [tag xmls] (filter identity (for [x xmls :when (= tag (:tag x))] x))) (defn process-document [doc] {:id (:id (:attrs doc)) :type (:type (:attrs doc)) :headline (filter-tag :HEADLINE (xml-seq doc))}) [7] Parsing with tagsoup user=> (clojure.pprint/pprint (tagsoup/parse-xml (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) (GZIPInputStream.)))) {:tag :DOC, :attrs {:id "AFP_ENG_20101220.0219", :type "story"}, :content [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 1\n"]} {:tag :DATELINE, :attrs nil, :content ["\nLocation, Dec 20, 2010 (AFP)\n"]} {:tag :TEXT, :attrs nil, :content [{:tag :p, :attrs nil, :content ["\nParagraph 1\n"]} {:tag :p, :attrs nil, :content ["\nParagraph 2\n"]}]} {:tag :DOC, :attrs {:id "AFP_ENG_20101206.0235", :type "story"}, :content [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 2\n"]} {:tag :DATELINE, :attrs nil, :content ["\nLocation, Dec 6, 2010 (AFP)\n"]} {:tag :TEXT, :attrs nil, :content [{:tag :p, :attrs nil, :content ["\nParagraph 3\n"]}]}]}]} [8] Parsing with clojure.data.xml/parse user=> (clojure.pprint/pprint (clojure.data.xml/parse (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) (GZIPInputStream.)))) {:tag :DOC, :attrs {:id "AFP_ENG_20101220.0219", :type "story"}, :content ({:tag :HEADLINE, :attrs {}, :content ("\nHeadline 1\n")} {:tag :DATELINE, :attrs {}, :content ("\nLocation, Dec 20, 2010 (AFP)\n")} {:tag :TEXT, :attrs {}, :content ({:tag :P, :attrs {}, :content ("\nParagraph 1\n")} {:tag :P, :attrs {}, :content ("\nParagraph 2\n")})})} [9] My actual code: (defn- parse-gw-file [in-file] (->> in-file (io/input-stream) (GZIPInputStream.) (ts/parse-xml))) (defn gw-file->documents [in-file] (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))] (map (fn [doc] {:id (dzx/xml1-> doc (dzx/attr :id)) :type (dzx/xml1-> doc (dzx/attr :type)) :headline (dzx/xml1-> doc :HEADLINE dzx/text) :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)}) (dzx/xml-> xml-zipper :DOC)))) [10] Wrapping the stream: (defn- parse-gw-file [in-file] (let [unzipped-file (->> in-file (io/input-stream) (GZIPInputStream.)) wrapped-file (str "<XML>" (slurp unzipped-file) "</XML>")] (->> wrapped-file (ByteArrayInputStream.) (ts/parse-xml (ByteArrayInputStream. (.getBytes (str "<XML>" (slurp unzipped-file) "</XML>") "UTF-8"))))) -- Wolodja <babi...@gmail.com> 4096R/CAF14EFC 081C B7CD FF04 2BA9 94EA 36B2 8B7F 7D30 CAF1 4EFC
signature.asc
Description: Digital signature