Re: Web page special characters encoding

John Nagle Sat, 10 Jul 2010 14:52:57 -0700

On 7/10/2010 2:03 PM, mattia wrote:

Il Sat, 10 Jul 2010 18:09:12 +0100, MRAB ha scritto:

mattia wrote:

Hi all, I'm using py3k and the urllib package to download web pages.
Can you suggest me a package that can translate reserved characters in
html like "&egrave;", "&ograve;", "&eacute;" in the corresponding
correct encoding?

import re
from html.entities import entitydefs

# The downloaded web page will be bytes, so decode it to a string.
webpage = downloaded_page.decode("iso-8859-1")

# Then decode the HTML entities.
webpage = re.sub(r"&(\w+);", lambda m: entitydefs[m.group(1)], webpage)


Thanks, very useful, didn't know about the entitydefs dictionary.


   You also need to decode the HTML numerical escapes.  Expect that
in real-world HTML, out of range values will occasionally appear.

                                        John Nagle

--
http://mail.python.org/mailman/listinfo/python-list

Re: Web page special characters encoding

Reply via email to