En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler <[EMAIL PROTECTED]> escribió:
> On 6/20/07, Gabriel Genellina <[EMAIL PROTECTED]> wrote: >> En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <[EMAIL PROTECTED]> >> escribió: >> >> > i have that string "<html>hello</a>world<anytag>ok" and i want to >> > extract all the text , without html tags , the result should be some >> > thing like that : helloworldok >> >> You can't use a regular expression for this task (no matter how >> complicated you write it). > [snip] > > I agree that BeautifulSoup is probably the best tool for the job, but > this doesn't sound right to me. Since the OP doesn't care about tags > being properly nested, I don't see why a regex (albeit a tricky one) > wouldn't work. For example: > > regex = re.compile(r''' > <[^!] # beginning of normal tag > ([^'">]* # unquoted text... > |'[^']*' # or single-quoted text... > |"[^"]*")* # or double-quoted text > > # end of tag > |<!-- # beginning of comment > ([^-]|-[^-])* > --\s*> # end of comment > ''', re.VERBOSE) > text = regex.sub('', html) > > Granted, this misses out a few things (e.g. DOCTYPE declarations), but > those should be straightforward to handle. It doesn't handle a lot of things. For this input (not very special, just a few simple mistakes): <html> <a href="http://foo.com/baz.html>click here</a> <p>What if price<100? You lose. <p>What if HitPoints<-10? You are dead. <p>Assignment: target <-- any_expression Just a few last words. </html> the BeautifulSoup version gives: click here What if price<100? You lose. What if HitPoints<-10? You are dead. Assignment: target <-- any_expression Just a few last words. and the regular expression version gives: <a href="http://foo.com/baz.html>click here What if priceWhat if HitPointsAssignment: target Clearly the BeautifulSoup version gives the "right" result, or the "expected" one. It's hard to get that with only a regular expression, you need more power; and BeautifulSoup fills the gap. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list