Gabriel Genellina wrote: > En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler <[EMAIL PROTECTED]> > escribió: > >> On 6/20/07, Gabriel Genellina <[EMAIL PROTECTED]> wrote: >> [snip] >> I agree that BeautifulSoup is probably the best tool for the job, but >> this doesn't sound right to me. Since the OP doesn't care about tags >> being properly nested, I don't see why a regex (albeit a tricky one) >> wouldn't work. For example: >> [snip] >> >> Granted, this misses out a few things (e.g. DOCTYPE declarations), but >> those should be straightforward to handle. > > It doesn't handle a lot of things. For this input (not very special, > just a few simple mistakes): > > <html> > <a href="http://foo.com/baz.html>click here</a> > <p>What if price<100? You lose. > <p>What if HitPoints<-10? You are dead. > <p>Assignment: target <-- any_expression > Just a few last words. > </html> > > the BeautifulSoup version gives: > > click here > What if price<100? You lose. > What if HitPoints<-10? You are dead. > Assignment: target <-- any_expression > Just a few last words. > > and the regular expression version gives: > > <a href="http://foo.com/baz.html>click here > What if priceWhat if HitPointsAssignment: target > > Clearly the BeautifulSoup version gives the "right" result, or the > "expected" one. > It's hard to get that with only a regular expression, you need more > power; and BeautifulSoup fills the gap.
Speak for yourself. If I'm writing an HTML syntax checker, I think I'll skip BeautifulSoup and use something that gives me the results that I expect, not the results that you expect. -- http://mail.python.org/mailman/listinfo/python-list