[EMAIL PROTECTED] wrote: > I have a single unicode file that has descriptions of hundreds of > objects. The file fairly resembles HTML-EXAMPLE pasted below. > > I need to parse the file in such a way to extract data out of the html > and to come up with a tab separated file that would look like OUTPUT- > FILE below. > > =====OUTPUT-FILE===== > /please note that the first line of the file contains column headers/ > ------Tab Separated Output File Begin------ > H1 H2 DIV Segment1 Segment2 Segment3 > RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 > RoséSegmentDIV2-1 > ------Tab Separated Output File End------ > > =====HTML-EXAMPLE===== > ------HTML Example Begin------ > <html> > > <h1>RoséH1-1</h1> > <h2>RoséH2-1</h2> > <div>RoséDIV-1</div> > <div "segment1">RoséSegmentDIV1-1</div><br> > <div "segment2">RoséSegmentDIV2-1</div><br> > <div "segment3">RoséSegmentDIV3-1</div><br> > <br> > <br> > > </html> > ------HTML Example End------
Now, what ugly markup is that? You will never manage to get any HTML compliant parser return the "segmentX" stuff in there. I think your best bet is really going for pyparsing or regular expressions (and I actually recommend pyparsing here). Stefan -- http://mail.python.org/mailman/listinfo/python-list