Re: Help Parsing an HTML File

Stefan Behnel Fri, 15 Feb 2008 23:46:17 -0800

[EMAIL PROTECTED] wrote:
> I have a single unicode file that has  descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
> 
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
> 
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1    H2      DIV     Segment1        Segment2        Segment3
> RoséH1-1      RoséH2-1        RoséDIV-1       RoséSegmentDIV1-1       
> RoséSegmentDIV2-1
> ------Tab Separated Output File End------
> 
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
> 
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
> 
> </html>
> ------HTML Example End------


Now, what ugly markup is that? You will never manage to get any HTML compliant
parser return the "segmentX" stuff in there. I think your best bet is really
going for pyparsing or regular expressions (and I actually recommend pyparsing
here).

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Help Parsing an HTML File

Reply via email to