Re: Parsing SGML

Alan Malloy Tue, 10 Jul 2012 12:06:12 -0700

Just create a Reader over the file, and do something like (take-while 
identity (repeatedly #(read-one-wellformed-xml-tag the-reader))). It needs 
some fleshing out for boundary conditions, but I hope you get the general 
idea.


On Tuesday, July 10, 2012 6:04:23 AM UTC-7, Wolodja Wentland wrote:
>
> Dear all, 
>
> I am running into problems when I try to parse SGML documents [0] that are 
> valid XML apart from the fact that they lack a root tag. The whole issue 
> is 
> complicated by the fact that the input files are pretty large (i.e. 9.1 GB 
> gzipped files) and that I therefore cannot read them completely into 
> memory. 
> My goal is to extract each "document" and index it using Lucene, so I need 
> access to the data at one point, but can throw it away immediately after 
> processing. 
>
> The input data looks something like [1] and my main problem is that none 
> of 
> the parsers I tried cope with the missing root tag. The main problem is 
> that if the SGML is parsed with either clj-tagsoup's [3] parse-xml or 
> data.xml's [4] parse function I get a broken representation and can't 
> extract all 
> data using either zippers (e.g. like in [5]) or by working on the parsed 
> data 
> directly (as in [6]). 
>
> The main problem is that the *first* <DOC> is (wrongly) assumed to be the 
> root 
> tag for the entire document and that the result of the parse looks 
> something 
> like [7] (for tagsoup) or [8] (for data.xml). As you can imagine I want 
> output 
> such as from my (hypothetical) processing function: 
>
> (documents-from-gigaword-file (-> in-file 
>                                   (io/input-stream) 
>                                   (GZIPInputStream.)))) 
> ({:id "AFP_ENG_20101220.0219" 
>   :type "story" 
>   :headline "Headline 1" 
>   :paragraphs ("Paragraph 1" "Paragraph 2")} 
>
> {:id "AFP_ENG_20101220.0235" 
>   :type "story" 
>   :headline "Headline 2" 
>   :text ("Paragraph 3")}) 
>
> But I get the follwing right now: (no wonder!) 
>
> user=> (clojure.pprint/pprint (gw-file->documents (io/file 
> "/home/babilen/foo.gz"))) 
> ({:id "AFP_ENG_20101206.0235", 
>   :type "story", 
>   :headline " Headline 2 ", 
>   :paragraphs (" Paragraph 3 ")}) 
>
>
> I am, however, unsure how to proceed. I tried wrapping the input stream in 
> "<XML> ... </XML>" [10] but that requires me to read the entire file into 
> memory 
> and I get OutOfMemory errors when working on the complete corpus. So in 
> short 
> my questions are: 
>
> * Do you know a parser that I can use to parse this data? 
> * Lacking that: How can I wrap the GZIPInputStream in opening and closing 
>   tags? 
> * Do you think that I should just write a parser myself? (seems a lot of 
> work 
>   just because the enclosing tags are missing) 
> * Are there other feasible approaches? 
>
> Any input would be most appreciated! 
>
> References 
> ---------- 
>
> [0] The input data is the English gigaword corpus from 
>     http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07 
>
> [1] Example data: 
>
>     <DOC id="AFP_ENG_20101220.0219" type="story" > 
>     <HEADLINE> 
>     Headline 1 
>     </HEADLINE> 
>     <DATELINE> 
>     Location, Dec 20, 2010 (AFP) 
>     </DATELINE> 
>     <TEXT> 
>     <P> 
>     Paragraph 1 
>     </P> 
>     <P> 
>     Paragraph 2 
>     </P> 
>     </TEXT> 
>     </DOC> 
>     <DOC id="AFP_ENG_20101206.0235" type="story" > 
>     <HEADLINE> 
>     Headline 2 
>     </HEADLINE> 
>     <DATELINE> 
>     Location, Dec 6, 2010 (AFP) 
>     </DATELINE> 
>     <TEXT> 
>     <P> 
>     Paragraph 3 
>     </P> 
>     </TEXT> 
>     </DOC> 
>
> [3] 
> https://github.com/nathell/clj-tagsoup/blob/master/src/pl/danieljanus/tagsoup.clj
>  
> [4] https://github.com/clojure/data.xml/ 
> [5] Extraction using a zipper: 
>     (defn gw-file->documents 
>      [in-file] 
>      (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))] 
>       (map (fn [doc] 
>             {:id (dzx/xml1-> doc (dzx/attr :id)) 
>             :type (dzx/xml1-> doc (dzx/attr :type)) 
>             :headline (dzx/xml1-> doc :HEADLINE dzx/text) 
>             :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)}) 
>        (dzx/xml-> xml-zipper :DOC)))) 
> [6] Example extraction of data on the output of parse(-xml) directly: 
>     I use filter-tag to search for all :DOC's and call process-document 
> for 
>     each. 
>
>     (defn- filter-tag 
>       [tag xmls] 
>       (filter identity 
>               (for [x xmls 
>                     :when (= tag (:tag x))] 
>                 x))) 
>
>     (defn process-document 
>       [doc] 
>       {:id   (:id (:attrs doc)) 
>        :type (:type (:attrs doc)) 
>        :headline (filter-tag :HEADLINE (xml-seq doc))}) 
> [7] Parsing with tagsoup 
>
> user=> (clojure.pprint/pprint 
>   (tagsoup/parse-xml 
>    (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) 
> (GZIPInputStream.)))) 
> {:tag :DOC, 
>  :attrs {:id "AFP_ENG_20101220.0219", :type "story"}, 
>  :content 
>  [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 1\n"]} 
>   {:tag :DATELINE, 
>    :attrs nil, 
>    :content ["\nLocation, Dec 20, 2010 (AFP)\n"]} 
>   {:tag :TEXT, 
>    :attrs nil, 
>    :content 
>    [{:tag :p, :attrs nil, :content ["\nParagraph 1\n"]} 
>     {:tag :p, :attrs nil, :content ["\nParagraph 2\n"]}]} 
>   {:tag :DOC, 
>    :attrs {:id "AFP_ENG_20101206.0235", :type "story"}, 
>    :content 
>    [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 2\n"]} 
>     {:tag :DATELINE, 
>      :attrs nil, 
>      :content ["\nLocation, Dec 6, 2010 (AFP)\n"]} 
>     {:tag :TEXT, 
>      :attrs nil, 
>      :content [{:tag :p, :attrs nil, :content ["\nParagraph 3\n"]}]}]}]} 
>
> [8] Parsing with clojure.data.xml/parse 
> user=> (clojure.pprint/pprint 
>   (clojure.data.xml/parse 
>     (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) 
> (GZIPInputStream.)))) 
> {:tag :DOC, 
>  :attrs {:id "AFP_ENG_20101220.0219", :type "story"}, 
>  :content 
>  ({:tag :HEADLINE, :attrs {}, :content ("\nHeadline 1\n")} 
>   {:tag :DATELINE, 
>    :attrs {}, 
>    :content ("\nLocation, Dec 20, 2010 (AFP)\n")} 
>   {:tag :TEXT, 
>    :attrs {}, 
>    :content 
>    ({:tag :P, :attrs {}, :content ("\nParagraph 1\n")} 
>     {:tag :P, :attrs {}, :content ("\nParagraph 2\n")})})} 
>
> [9] My actual code: 
>
> (defn- parse-gw-file 
>   [in-file] 
>   (->> in-file 
>     (io/input-stream) 
>     (GZIPInputStream.) 
>     (ts/parse-xml))) 
>
> (defn gw-file->documents 
>   [in-file] 
>   (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))] 
>     (map (fn [doc] 
>            {:id (dzx/xml1-> doc (dzx/attr :id)) 
>             :type (dzx/xml1-> doc (dzx/attr :type)) 
>             :headline (dzx/xml1-> doc :HEADLINE dzx/text) 
>             :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)}) 
>          (dzx/xml-> xml-zipper :DOC)))) 
>
> [10] Wrapping the stream: 
> (defn- parse-gw-file 
>   [in-file] 
>   (let [unzipped-file (->> in-file 
>                         (io/input-stream) 
>                         (GZIPInputStream.)) 
>         wrapped-file (str "<XML>" (slurp unzipped-file) "</XML>")] 
>     (->> wrapped-file 
>       (ByteArrayInputStream.) 
>     (ts/parse-xml (ByteArrayInputStream. 
>                     (.getBytes (str "<XML>" (slurp unzipped-file) 
> "</XML>") 
>                                "UTF-8"))))) 
> -- 
> Wolodja <babi...@gmail.com> 
>
> 4096R/CAF14EFC 
> 081C B7CD FF04 2BA9 94EA  36B2 8B7F 7D30 CAF1 4EFC 
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Parsing SGML

Reply via email to