Il Sat, 10 Jul 2010 18:09:12 +0100, MRAB ha scritto: > mattia wrote: >> Hi all, I'm using py3k and the urllib package to download web pages. >> Can you suggest me a package that can translate reserved characters in >> html like "è", "ò", "é" in the corresponding >> correct encoding? >> > import re > from html.entities import entitydefs > > # The downloaded web page will be bytes, so decode it to a string. > webpage = downloaded_page.decode("iso-8859-1") > > # Then decode the HTML entities. > webpage = re.sub(r"&(\w+);", lambda m: entitydefs[m.group(1)], webpage)
Thanks, very useful, didn't know about the entitydefs dictionary. -- http://mail.python.org/mailman/listinfo/python-list