Il Tue, 17 Mar 2009 10:55:21 +0000, R. David Murray ha scritto: > mattia <ger...@gmail.com> wrote: >> Hi all, can you tell me why the module urllib.request (py3) add extra >> characters (b'fef\r\n and \r\n0\r\n\r\n') in a simple example like the >> following and urllib2 (py2.6) correctly not? >> >> py2.6 >> >>> import urllib2 >> >>> f = urllib2.urlopen("http://www.google.com").read() fd = >> >>> open("google26.html", "w") >> >>> fd.write(f) >> >>> fd.close() >> >> py3 >> >>> import urllib.request >> >>> f = urllib.request.urlopen("http://www.google.com").read() with >> >>> open("google30.html", "w") as fd: >> ... print(f, file=fd) >> ... >> >>> >> >>> >> Opening the two html pages with ff I've got different results (the >> extra characters mentioned earlier), why? > > The problem isn't a difference between urllib2 and urllib.request, it is > between fd.write and print. This produces the same result as your first > example: > > >>>> import urllib.request >>>> f = urllib.request.urlopen("http://www.google.com").read() with >>>> open("temp3.html", "wb") as fd: > ... fd.write(f) > > > The "b'....'" is the stringified representation of a bytes object, which > is what urllib.request returns in python3. Note the 'wb', which is a > critical difference from the python2.6 case. If you omit the 'b' in > python3, it will complain that you can't write bytes to the file object. > > The thing to keep in mind is that print converts its argument to string > before writing it anywhere (that's the point of using it), and that > bytes (or buffer) and string are very different types in python3.
In order to get the correct encoding I've come up with this: >>> response = urllib.request.urlopen("http://www.google.com") >>> print(response.read().decode(response.headers.get_charsets()[0])) -- http://mail.python.org/mailman/listinfo/python-list