Screw: >>> html = """ <tr>
<td valign=top>14313 </td> <td valign=top><a href=lic_details.asp?lic_number=14313>Python Hammer Institute #2</a> </td> <td valign=top>Jefferson </td> <td valign=top>70114 </td> </tr> <tr> <td valign=top>8583 </td> <td valign=top><a href=lic_details.asp?lic_number=8583>New Screwdriver Technical Academy, Inc #4</a> </td> <td valign=top>Jefferson </td> <td valign=top>70114 </td> </tr> <tr> <td valign=top>9371 </td> <td valign=top><a href=lic_details.asp?lic_number=9371>Career RegEx Center</a> </td> <td valign=top>Jefferson </td> <td valign=top>70113 </td> </tr>""" Hammer: First remove line returns. Then remove extra spaces. Then insert a line return to restore logical rows on each </tr><tr> combination. For more information, see: http://www.qc4blog.com/?p=55 >>> s = re.sub(r'\n','', html) >>> s = re.sub(r'\s{2,}', '', s) >>> s = re.sub('(</tr>)(<tr>)', r'\1\n\2', s) >>> print s <tr><td valign=top>14313</td><td valign=top><a href=lic_details.asp? lic_number=14313>Python Hammer Institute #2</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></tr> <tr><td valign=top>8583</td><td valign=top><a href=lic_details.asp? lic_number=8583>New Screwdriver Technical Academy, Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></tr> <tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp? lic_number=9371>Career RegEx Center</a></td><td valign=top>Jefferson</ td><td valign=top>70113</td></tr> >>> p = re.compile(r"(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td >>> valign=top>)(<a >>> href=lic_details\.asp)(\?lic_number=\d+)(>)(?P<zname>[\s\S\WA-Za-z0-9]*?)(</a>)(</td>)(?:<td >>> valign=top>)(?P<zparish>[\s\WA-Za-z]+)(</td>)(<td >>> valign=top>)(?P<zzip>\d+)(</td>)(</tr>)$", re.M) >>> n = >>> p.sub(r'LICENSE:\g<zlicense>|NAME:\g<zname>|PARISH:\g<zparish>|ZIP:\g<zzip>', >>> s) >>> print n LICENSE:14313|NAME:Python Hammer Institute #2|PARISH:Jefferson|ZIP: 70114 LICENSE:8583|NAME:New Screwdriver Technical Academy, Inc #4| PARISH:Jefferson|ZIP:70114 LICENSE:9371|NAME:Career RegEx Center|PARISH:Jefferson|ZIP:70113 >>> The solution was to escape the period in the ".asp" string, e.g., "\.asp". I also had to limit the pattern in the <zname> grouping by using a "?" qualifier to limit the "greediness" of the "*" pattern metacharacter. Now, who would like to turn that re.compile pattern into a MULTILINE expression, combining the re.M and re.X flags? Documentation says that one should be able to use the bitwise OR operator (e.g., re.M | re.X), but I sure couldn't get it to work. Sometimes a hammer actually is the right tool if you hit the screw long and hard enough. I think I'll try to hit some more screws with my new hammer. Good day. On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > I'm kind of new to regular expressions, and I've spent hours trying to > finesse a regular expression to build a substitution. > > What I'd like to do is extract data elements from HTML and structure > them so that they can more readily be imported into a database. > > No -- sorry -- I don't want to use BeautifulSoup (though I have for > other projects). Humor me, please -- I'd really like to see if this > can be done with just regular expressions. > > Note that the output is referenced using named groups. > > My challenge is successfully matching the HTML tags in between the > first table row, and the second table row. > > I'd appreciate any suggestions to improve the approach. > > rText = "<tr><td valign=top>8583</td><td valign=top><a > href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, > Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></ > tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp? > lic_number=9371>Career Learning Center</a></td><td > valign=top>Jefferson</td><td valign=top>70113</td></tr>" > > rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td > valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A- > Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME: > \g<zname>\n', rText) > > print rText > > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td > valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td > valign=top>9371</td><td valign=top><a href=lic_details.asp? > lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 -- http://mail.python.org/mailman/listinfo/python-list