Ezio Melotti <ezio.melo...@gmail.com> added the comment: > Christian Heimes wrote: > There is no generic and simple way to detect the encoding of a > remote site. Sometimes the encoding is mentioned in the HTTP header, > sometimes it's embedded in the <head> section of the HTML document.
FWIW for HTML pages the encoding can be specified in at least 3 places: * the HTTP headers: e.g. "content-type: text/html; charset=utf-8"; * the XML declaration: e.g. "<?xml version="1.0" encoding="utf-8" ?>"; * the <meta> tag: e.g. "<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> Browsers usually follow this order while searching the encoding, meaning that HTTP headers have the highest priority. The XML declaration is sometimes (mis)used in (X)HTML pages. Anyway, since urlopen() is a generic function that can download anything, it shouldn't look at XML declarations and meta tags -- that's something parsers should take care of. Regarding the implementation, wouldn't having a new method on the file-like object returned by urlopen better? Maybe something like: >>> page = urlopen(some_url) >>> page.encoding # get the encoding from the HTTP headers 'utf-8' >>> page.decode() # same as page.read().decode(page.encoding) '...' The advantage of having these as new methods/attribute is that you can pass the 'page' around and other functions can get back the decoded content if/when they need to. OTOH other file-like objects don't have similar methods, so it might get a bit confusing. ---------- versions: +Python 3.3 -Python 3.2 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue4733> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com