Stefan Behnel wrote: > [EMAIL PROTECTED] wrote: >> I have a single unicode file that has descriptions of hundreds of >> objects. The file fairly resembles HTML-EXAMPLE pasted below. >> >> I need to parse the file in such a way to extract data out of the html >> and to come up with a tab separated file that would look like OUTPUT- >> FILE below. >> >> =====OUTPUT-FILE===== >> /please note that the first line of the file contains column headers/ >> ------Tab Separated Output File Begin------ >> H1 H2 DIV Segment1 Segment2 Segment3 >> RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 >> RoséSegmentDIV2-1 >> ------Tab Separated Output File End------ >> >> =====HTML-EXAMPLE===== >> ------HTML Example Begin------ >> <html> >> >> <h1>RoséH1-1</h1> >> <h2>RoséH2-1</h2> >> <div>RoséDIV-1</div> >> <div "segment1">RoséSegmentDIV1-1</div><br> >> <div "segment2">RoséSegmentDIV2-1</div><br> >> <div "segment3">RoséSegmentDIV3-1</div><br> >> <br> >> <br> >> >> </html> >> ------HTML Example End------ > > Now, what ugly markup is that? You will never manage to get any HTML > compliant parser return the "segmentX" stuff in there. I think your best > bet is really going for pyparsing or regular expressions (and I actually > recommend pyparsing here). > > Stefan
In practice the following might be sufficient: from BeautifulSoup import BeautifulSoup def chunks(bs): chunk = [] for tag in bs.findAll(["h1", "h2", "div"]): if tag.name == "h1": if chunk: yield chunk chunk = [] chunk.append(tag) if chunk: yield chunk def process(filename): bs = BeautifulSoup(open(filename)) for chunk in chunks(bs): columns = [tag.string for tag in chunk] columns += ["No Value"] * (6 - len(columns)) print "\t".join(columns) if __name__ == "__main__": process("example.html") The biggest caveat is that only columns at the end of a row may be left out. Peter -- http://mail.python.org/mailman/listinfo/python-list