On Feb 15, 3:28 pm, [EMAIL PROTECTED] wrote: > Hello Python Community, > > It'd be great if someone could provide guidance or sample code for > accomplishing the following: > > I have a single unicode file that has descriptions of hundreds of > objects. The file fairly resembles HTML-EXAMPLE pasted below. >
Pyparsing was mentioned earlier, here is a sample with some annotating comments. I'm a little worried when you say the file "fairly resembles HTML- EXAMPLE." With parsers, the devil is in the details, and if you have scrambled this format - the HTML attributes are especially suspicious - then the parser will need to be cleaned up to match the real input. If the file being parsed really has proper HTML attributes (of the form <tag attrname="attrvalue">), then you could simplify the code to use the pyparsing method makeHTMLTags. But the example I wrote matches the example you posted. -- Paul # encoding=utf-8 from pyparsing import * data = """ <h1>RoséH1-1</h1> <h2>RoséH2-1</h2> ... snip ... """ # define <XXX> and </XXX> tags CL = CaselessLiteral h1,h2,cmnt,br = \ map(Suppress, map(CL,["<%s>" % s for s in "h1 h2 comment br".split()])) h1end,h2end,cmntEnd,divEnd = \ map(Suppress, map(CL,["</%s>" % s for s in "h1 h2 comment div".split()])) # h1,h1end = makeHTMLTags("h1") # define special format for <div>, incl. optional quoted string "attribute" div = "<" + CL("div") + Optional(QuotedString('"'))("name") + ">" div.setParseAction( lambda toks: "name" in toks and toks.name.title() or "DIV") # define <xxx>body</xxx> entries h1Entry = h1 + SkipTo(h1end) + h1end h2Entry = h2 + SkipTo(h2end) + h2end comment = cmnt + SkipTo(cmntEnd) + cmntEnd divEntry = div + SkipTo(divEnd) + divEnd # just return nested tokens grammar = (OneOrMore(Group(h1Entry + (Group(h2Entry + (OneOrMore(Group(divEntry)))))))) grammar.ignore(br) grammar.ignore(comment) results = grammar.parseString(data) from pprint import pprint pprint(results.asList()) print # return nested tokens, with dict grammar = Dict(OneOrMore(Group( h1Entry + Dict(Group(h2Entry + Dict(OneOrMore(Group(divEntry)))))))) grammar.ignore(br) grammar.ignore(comment) results = grammar.parseString(data) print results.dump() Prints: [['Ros\xe9H1-1', ['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'], ['Segment3', 'Ros\xe9SegmentDIV3-1']]], ['PinkH1-2', ['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1', 'PinkSegmentDIV1-2']]], ['BlackH1-3', ['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1', 'BlackSegmentDIV1-3']]], ['YellowH1-4', ['YellowH2-4', ['DIV', 'YellowDIV2-4'], ['Segment1', 'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]]] [['Ros\xe9H1-1', ['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros\xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'], ['Segment3', 'Ros\xe9SegmentDIV3-1']]], ['PinkH1-2', ['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1', 'PinkSegmentDIV1-2']]], ['BlackH1-3', ['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1', 'BlackSegmentDIV1-3']]], ['YellowH1-4', ['YellowH2-4', ['DIV', 'YellowDIV2-4'], ['Segment1', 'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]]] - BlackH1-3: [['BlackH2-3', ['DIV', 'BlackDIV2-3'], ['Segment1', 'BlackSegmentDIV1-3']]] - BlackH2-3: [['DIV', 'BlackDIV2-3'], ['Segment1', 'BlackSegmentDIV1-3']] - DIV: BlackDIV2-3 - Segment1: BlackSegmentDIV1-3 - PinkH1-2: [['PinkH2-2', ['DIV', 'PinkDIV2-2'], ['Segment1', 'PinkSegmentDIV1-2']]] - PinkH2-2: [['DIV', 'PinkDIV2-2'], ['Segment1', 'PinkSegmentDIV1-2']] - DIV: PinkDIV2-2 - Segment1: PinkSegmentDIV1-2 - RoséH1-1: [['Ros\xe9H2-1', ['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros \xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'], ['Segment3', 'Ros\xe9SegmentDIV3-1']]] - RoséH2-1: [['DIV', 'Ros\xe9DIV-1'], ['Segment1', 'Ros \xe9SegmentDIV1-1'], ['Segment2', 'Ros\xe9SegmentDIV2-1'], ['Segment3', 'Ros\xe9SegmentDIV3-1']] - DIV: RoséDIV-1 - Segment1: RoséSegmentDIV1-1 - Segment2: RoséSegmentDIV2-1 - Segment3: RoséSegmentDIV3-1 - YellowH1-4: [['YellowH2-4', ['DIV', 'YellowDIV2-4'], ['Segment1', 'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']]] - YellowH2-4: [['DIV', 'YellowDIV2-4'], ['Segment1', 'YellowSegmentDIV1-4'], ['Segment2', 'YellowSegmentDIV2-4']] - DIV: YellowDIV2-4 - Segment1: YellowSegmentDIV1-4 - Segment2: YellowSegmentDIV2-4 -- http://mail.python.org/mailman/listinfo/python-list