Il Sat, 10 Jul 2010 16:24:23 +0000, mattia ha scritto: > Hi all, I'm using py3k and the urllib package to download web pages. Can > you suggest me a package that can translate reserved characters in html > like "è", "ò", "é" in the corresponding correct > encoding? > > Thanks, > Mattia
Basically I'm trying to get an html page and stripping out all the tags to obtain just plain text. John Nagle and Christian Heimes somehow figured out what I'm trying to do ;-) So far what I've done, thanks to you suggestions: import lxml.html import lxml.html.clean import urllib.request import urllib.parse from html.entities import entitydefs import re import sys HEADERS = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"} def replace(m): if m.group(1) in entitydefs: return entitydefs[m.group(1)] else: return m.group(1) def test(page): req = urllib.request.Request(page, None, HEADERS) page = urllib.request.urlopen(req) charset = page.info().get_content_charset() if charset is not None: html = page.read().decode(charset) else: html = page.read().decode("iso-8859-1") html = re.sub(r"&(\w+);", replace, html) cleaner = lxml.html.clean.Cleaner(safe_attrs_only = True, style = True) html = cleaner.clean_html(html) # create the element tree tree = lxml.html.document_fromstring(html) txt = tree.text_content() for x in txt.split(): # DOS shell is not able to print characters like u'\u20ac' - why??? try: print(x) except: continue if __name__ == "__main__": if len(sys.argv) < 2: print("Usage:", sys.argv[0], "<webpage>") print("Example:", sys.argv[0], "http://www.bing.com") sys.exit() test(sys.argv[1]) Every new tips will be appreciated. Ciao, Mattia -- http://mail.python.org/mailman/listinfo/python-list