On 6/20/07, Gabriel Genellina <[EMAIL PROTECTED]> wrote: > En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <[EMAIL PROTECTED]> > escribió: > > > i have that string "<html>hello</a>world<anytag>ok" and i want to > > extract all the text , without html tags , the result should be some > > thing like that : helloworldok > > > > i have tried that : > > > > from re import findall > > > > chaine = """<html>hello</a>world<anytag>ok""" > > > > print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine) > > >>> ['html', 'hell', 'worl', 'anyt', 'ag>o'] > > > > the result is not correct ! what would be the correct regex to use ? > > You can't use a regular expression for this task (no matter how > complicated you write it). [snip]
I agree that BeautifulSoup is probably the best tool for the job, but this doesn't sound right to me. Since the OP doesn't care about tags being properly nested, I don't see why a regex (albeit a tricky one) wouldn't work. For example: regex = re.compile(r''' <[^!] # beginning of normal tag ([^'">]* # unquoted text... |'[^']*' # or single-quoted text... |"[^"]*")* # or double-quoted text > # end of tag |<!-- # beginning of comment ([^-]|-[^-])* --\s*> # end of comment ''', re.VERBOSE) text = regex.sub('', html) Granted, this misses out a few things (e.g. DOCTYPE declarations), but those should be straightforward to handle. -- David -- http://mail.python.org/mailman/listinfo/python-list