Just Another Victim of the Ambient Morality wrote: > I've done a google search on this but, amazingly, I'm the first guy to > ever need this!
You cannot infer that from a Google search. > So, how do I convert HTML to plaintext? Something like this: > > <div>This is a string.</div> > > ...into: > > This is a string. > > Actually, the ideal would be a function that takes an HTML string and > convert it into a string that the HTML would correspond to. For instance, > converting: > > <div>This & that > or the other thing.</div> > > ...into: > > This & that or the other thing. > > ...since HTML seems to convert any amount and type of whitespace into a > single space (a bizarre design choice if I've ever seen one). So what you want to do is parse HTML and extract the text content. There are quite a few ways to do that, including lxml.html: http://codespeak.net/lxml/dev/lxmlhtml.html >>> htmldata = """<div>This & that ... or the other thing.</div> >>> from lxml import html >>> print html.fragment_fromstring(htmldata).text_content() Stefan -- http://mail.python.org/mailman/listinfo/python-list