On Apr 2, 4:05 pm, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote: > > If the OP is constrained to standard libraries, then it may be a > > question of defining what should be done more clearly. The extraneous > > spaces can be removed by tokenizing the string and rejoining the > > tokens. Replacing portions of a string with equivalents is standard > > stuff. It might be preferable to create a function that will accept > > lists of from and to strings and translate the entire string by > > successively applying the replacements. From what I've seen so far, > > that would be all the OP needs for this task. It might take a half- > > dozen lines of code, plus the from/to table definition. > > The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of > code to clean up. Because your simple replacement-approach won't help here: > > <br>foo <br> bar </br> > > Which is perfectly legal HTML, but nasty to parse. > > Diez
But it could be that he just wants all HTML tags to disappear, like in his example. A code like this might be sufficient then: re.sub(r'<[^>] +>', '', s). For whitespace, re.sub(r'\s+', ' ', s). For XML characters like é, re.sub(r'&(\w+);', lambda mo: unichr(htmlentitydefs.name2codepoint[mo.group(1)]), s) and re.sub(r'&#(\d+);', lambda mo: unichr(int(mo.group(1))), s). That's it pretty much. I'd like to see how this transformation can be done with BeautifulSoup. Well, the last two regexps can be replaced with this: unicode(BeautifulStoneSoup(s,convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]) -- http://mail.python.org/mailman/listinfo/python-list