[issue4733] Add a "decode to declared encoding" version of urlopen to urllib

Ezio Melotti Wed, 19 Oct 2011 19:58:38 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

> Christian Heimes wrote:
>   There is no generic and simple way to detect the encoding of a
>   remote site. Sometimes the encoding is mentioned in the HTTP header,
>   sometimes it's embedded in the <head> section of the HTML document.


FWIW for HTML pages the encoding can be specified in at least 3 places:
* the HTTP headers: e.g. "content-type: text/html; charset=utf-8";
* the XML declaration: e.g. "<?xml version="1.0" encoding="utf-8" ?>";
* the <meta> tag: e.g. "<meta http-equiv="Content-Type" content="text/html; 
charset=utf-8">

Browsers usually follow this order while searching the encoding, meaning that 
HTTP headers have the highest priority.  The XML declaration is sometimes 
(mis)used in (X)HTML pages.

Anyway, since urlopen() is a generic function that can download anything, it 
shouldn't look at XML declarations and meta tags -- that's something parsers 
should take care of.

Regarding the implementation, wouldn't having a new method on the file-like 
object returned by urlopen better?
Maybe something like:
>>> page = urlopen(some_url)
>>> page.encoding  # get the encoding from the HTTP headers
'utf-8'
>>> page.decode()  # same as page.read().decode(page.encoding)
'...'

The advantage of having these as new methods/attribute is that you can pass the 
'page' around and other functions can get back the decoded content if/when they 
need to.  OTOH other file-like objects don't have similar methods, so it might 
get a bit confusing.

----------
versions: +Python 3.3 -Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue4733>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue4733] Add a "decode to declared encoding" version of urlopen to urllib

Reply via email to