On Feb 15, 3:28 pm, [EMAIL PROTECTED] wrote: > Hello Python Community, > > It'd be great if someone could provide guidance or sample code for > accomplishing the following: > > I have a single unicode file that has descriptions of hundreds of > objects. The file fairly resembles HTML-EXAMPLE pasted below. > > I need to parse the file in such a way to extract data out of the html > and to come up with a tab separated file that would look like OUTPUT- > FILE below. > > Any tips, advice and guidance is greatly appreciated. > > Thanks, > > Egon > > =====OUTPUT-FILE===== > /please note that the first line of the file contains column headers/ > ------Tab Separated Output File Begin------ > H1 H2 DIV Segment1 Segment2 Segment3 > RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 > RoséSegmentDIV2-1 > RoséSegmentDIV3-1 > PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 > No-Value No-Value > BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 > No-Value No-Value > YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4 > YellowSegmentDIV2-4 No-Value > ------Tab Separated Output File End------ > > =====HTML-EXAMPLE===== > ------HTML Example Begin------ > <html> > > <h1>RoséH1-1</h1> > <h2>RoséH2-1</h2> > <div>RoséDIV-1</div> > <div "segment1">RoséSegmentDIV1-1</div><br> > <div "segment2">RoséSegmentDIV2-1</div><br> > <div "segment3">RoséSegmentDIV3-1</div><br> > <br> > <br> > > <h1>PinkH1-2</h1> > <h2>PinkH2-2</h2> > <div>PinkDIV2-2</div> > <div "segment1">PinkSegmentDIV1-2</div><br> > <br> > <comment></comment> > > <h1>BlackH1-3</h1> > <h2>BlackH2-3</h2> > <div>BlackDIV2-3</div> > <div "segment1">BlackSegmentDIV1-3</div><br> > > <h1>YellowH1-4</h1> > <h2>YellowH2-4</h2> > <div>YellowDIV2-4</div> > <div "segment1">YellowSegmentDIV1-4</div><br> > <div "segment2">YellowSegmentDIV2-4</div><br> > > </html> > ------HTML Example End------
Pyparsing, ElementTree and lxml are all good candidates as well. BeautifulSoup takes care of malformed html though. http://pyparsing.wikispaces.com/ http://effbot.org/zone/element-index.htm http://codespeak.net/lxml/ Mike -- http://mail.python.org/mailman/listinfo/python-list