[wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is too complex]
You haven't specified what you mean by "extracting" ASCII, but I'll assume that you want to start by eliminating html tags and comments, which is easy enough with a couple of regular expressions:
>>> import re >>> comments = re.compile('<!--.*?-->', re.DOTALL) >>> tags = re.compile('<.*?>', re.DOTALL) ... >>> def striptags(text): ... text = re.sub(comments,'', text) ... text = re.sub(tags,'', text) ... return text ... >>> def collapsenewlines(text): ... return "\n".join(line for line in text.splitlines() if line) ... >>> import urllib2 >>> f = urllib2.urlopen('http://www.python.org/') >>> source = f.read() >>> text = collapsenewlines(striptags(source)) >>>
This will of course fail if there is a "<" without a ">", probably in other cases too. But it is indifferent to whether the html is well-formed.
This leaves you with the additional task of substituting the html escaped characters e.g., " ", not all of which will have ASCII representations.
HTH
Michael
-- http://mail.python.org/mailman/listinfo/python-list