I'm trying to use pyparsing write a screenscraper. I've got some arbitrary HTML text I define as opener & closer. In between is the HTML data I want to extract. However, the data may contain the same characters as used in the closer (but not the exact same text, obviously). I'd like to get the *minimal* amount of data between these.
Here's an example (whitespace may differ): from pyparsing import * test=r"""<tr class="tableTopSpace"><td></td></tr> <tr class="tableTitleDark"><td class="tableTitleDark">Job Information</td></tr><tr><td><table width="100%" border="0" cellspacing="3"><tr> <td width="110" valign="top"><div align="right"><strong>Job Title: </strong></div></td> <td class="ccDisplayCell">Big Old <B STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man</td></tr> <tr> <td width="110" valign="top"><div align="right"><strong>Employer: </strong></div></td> <td width="200" nowrap class="ccDisplayCell"><table><tr><td colspan="2" valign="top">Global Megacorp</td></tr></table></td><td> <script> function escapecomp(){ } """ data=Combine(OneOrMore(Word(printables)), adjacent=False, joinString=" ") title_open=Literal(r"""<td width="110" valign="top"><div align="right"><strong>Job Title: </strong></div></td> <td class="ccDisplayCell">""") title_open.suppress() title_close=Literal(r"""</td>""") title_close.suppress() title=title_open + data + title_close title2=title_open + (data | title_close) >>> title.scanString(test).next() Traceback (most recent call last): File "<stdin>", line 1, in ? StopIteration >>> title2.scanString(test).next() ((['<td width="110" valign="top"><div align="right"><strong>Job Title:\n </strong></div></td>\n<td class="ccDisplayCell">', 'Big Old <B STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man</td> </tr> <tr> <td width="110" valign="top"><div align="right"><strong>Employer: </strong></div></td> <td width="200" nowrap class="ccDisplayCell"><table><tr><td colspan="2" valign="top">Global Megacorp</td></tr></table></td> <td> <script> function escapecomp(){ }'], {}), 182, 656) >>> I'd expected title to work, but it doesn't match at all. ;( In other test variants, title2 gives extra stuff at the end though not necessarily to the end of the string (due to unprintable characters, perhaps). I want a ParseResult more like: ['<td width="110" valign="top"><div align="right"><strong>Job Title:\n </strong></div></td>\n<td class="ccDisplayCell">', 'Big Old <B STYLE="background-color:#FFEF95">Head Honcho</B> Boss Man, '</td>'] I sort of understand why title2 works as it does (the OneOrMore just slurps up everything), but for the life of me I can't figure out how to fix it. ;) Is there a way of writing something similar to RE's ".*?" ? --Pete -- Peter Fein [EMAIL PROTECTED] 773-575-0694 Basically, if you're not a utopianist, you're a schmuck. -J. Feldman -- http://mail.python.org/mailman/listinfo/python-list