Hello Python Community, It'd be great if someone could provide guidance or sample code for accomplishing the following:
I have a single unicode file that has descriptions of hundreds of objects. The file fairly resembles HTML-EXAMPLE pasted below. I need to parse the file in such a way to extract data out of the html and to come up with a tab separated file that would look like OUTPUT- FILE below. Any tips, advice and guidance is greatly appreciated. Thanks, Egon =====OUTPUT-FILE===== /please note that the first line of the file contains column headers/ ------Tab Separated Output File Begin------ H1 H2 DIV Segment1 Segment2 Segment3 RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1 RoséSegmentDIV3-1 PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 No-Value No-Value YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4 YellowSegmentDIV2-4 No-Value ------Tab Separated Output File End------ =====HTML-EXAMPLE===== ------HTML Example Begin------ <html> <h1>RoséH1-1</h1> <h2>RoséH2-1</h2> <div>RoséDIV-1</div> <div "segment1">RoséSegmentDIV1-1</div><br> <div "segment2">RoséSegmentDIV2-1</div><br> <div "segment3">RoséSegmentDIV3-1</div><br> <br> <br> <h1>PinkH1-2</h1> <h2>PinkH2-2</h2> <div>PinkDIV2-2</div> <div "segment1">PinkSegmentDIV1-2</div><br> <br> <comment></comment> <h1>BlackH1-3</h1> <h2>BlackH2-3</h2> <div>BlackDIV2-3</div> <div "segment1">BlackSegmentDIV1-3</div><br> <h1>YellowH1-4</h1> <h2>YellowH2-4</h2> <div>YellowDIV2-4</div> <div "segment1">YellowSegmentDIV1-4</div><br> <div "segment2">YellowSegmentDIV2-4</div><br> </html> ------HTML Example End------ -- http://mail.python.org/mailman/listinfo/python-list