<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi, I am having some difficulty trying to create a regular expression. > > Consider: > > <tag1 name="john"/> <br/> <tag2 value="adj__tall__"/> > <tag1 name="joe"/> > <tag1 name="jack"/> > <tag2 value="adj__short__"/> > > Whenever a tag1 is followed by a tag 2, I want to retrieve the values > of the tag1:name and tag2:value attributes. So my end result here > should be > john, tall > jack, short >
A pyparsing solution may not be a speed demon to run, but doesn't take too long to write. Some short explanatory comments: - makeHTMLTags returns a tuple of opening and closing tags, but this example does not use any closing tags, so simpler to just discard them (only use zero'th return value) - Your example includes not only <tag1> and <tag2> tags, but also a <br> tag, which is presumably ignorable. - The value returned from calling the searchString generator includes named fields for the different tag attributes, making it easy to access the name and value tag attributes. - The expression generated by makeHTMLTags will also handle tags with other surprising attributes that we didn't anticipate (such as "<br clear='all'/>" or "<tag2 value='adj__short__' modifier='adv__very__'/>") - Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some simple string slicing gets us the data we want The pyparsing home page is at http://pyparsing.wikispaces.com. -- Paul from pyparsing import makeHTMLTags tag1 = makeHTMLTags("tag1")[0] tag2 = makeHTMLTags("tag2")[0] br = makeHTMLTags("br")[0] # define the pattern we're looking for, in terms of tag1 and tag2 # and specify that we wish to ignore <br> tags patt = tag1 + tag2 patt.ignore(br) for tokens in patt.searchString(data): print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2]) Prints: john, tall jack, short Printing tokens.dump() gives: ['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True] - empty: True - name: jack - startTag1: ['tag1', ['name', 'jack'], True] - empty: True - name: jack - startTag2: ['tag2', ['value', 'adj__short__'], True] - empty: True - value: adj__short__ - value: adj__short__ -- http://mail.python.org/mailman/listinfo/python-list