I'm kind of new to regular expressions, and I've spent hours trying to finesse a regular expression to build a substitution.
What I'd like to do is extract data elements from HTML and structure them so that they can more readily be imported into a database. No -- sorry -- I don't want to use BeautifulSoup (though I have for other projects). Humor me, please -- I'd really like to see if this can be done with just regular expressions. Note that the output is referenced using named groups. My challenge is successfully matching the HTML tags in between the first table row, and the second table row. I'd appreciate any suggestions to improve the approach. rText = "<tr><td valign=top>8583</td><td valign=top><a href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></ tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp? lic_number=9371>Career Learning Center</a></td><td valign=top>Jefferson</td><td valign=top>70113</td></tr>" rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A- Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME: \g<zname>\n', rText) print rText LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp? lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 -- http://mail.python.org/mailman/listinfo/python-list