En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <[EMAIL PROTECTED]> escribió:
> i have that string "<html>hello</a>world<anytag>ok" and i want to > extract all the text , without html tags , the result should be some > thing like that : helloworldok > > i have tried that : > > from re import findall > > chaine = """<html>hello</a>world<anytag>ok""" > > print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine) > >>> ['html', 'hell', 'worl', 'anyt', 'ag>o'] > > the result is not correct ! what would be the correct regex to use ? You can't use a regular expression for this task (no matter how complicated you write it). Use BeautifulSoup, that can handle invalid HTML like yours: py> from BeautifulSoup import BeautifulSoup py> chaine = """<html>hello</a>world<anytag>ok""" py> soup = BeautifulSoup(chaine) py> soup.findAll(text=True) [u'hello', u'world', u'ok'] Get it from <http://www.crummy.com/software/BeautifulSoup/> -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list