On Dec 22, 3:41 pm, "Glenn G. Chappell" <glenn.chapp...@gmail.com> wrote: > I just ran 2to3 on a py2.5 script that does pattern matching on the > text of a web page. The resulting script crashed, because when I did > > f = urllib.request.urlopen(url) > text = f.read() > > then "text" is a bytes object, not a string, and so I can't do a > regexp on it. > > Of course, this is easy to patch: just do "f.read().decode()". > However, it strikes me as an obvious bug, which ought to be fixed. > That is, read() should return a string, as it did in py2.5.
Well, I can't agree that it's an obvious bug (in Python 3). It might be something worth raising a warning over in 2to3. It would also be a reasonable wishlist item for automatic encoding detection and conversion to a string (see below). But it's not a bug. > But apparently others disagree? This was mentioned in issue 3930 > (http://bugs.python.org/issue3930) back in September '08, but that > issue is now closed, apparently because consistent behavior was > achieved. But I figure consistently bad behavior is still bad. > > This change breaks pretty much every Python program that opens a > webpage, doesn't it? No. What if someone is using urllib retrieve (say) a JPEG image? A bytes object is what they'd want in Python 3. Also, many people were already explicitly dealing with encodings in Python 2.5; the change wouldn't affect them. > 2to3 doesn't catch it, and, in any case, why > should read() return bytes, not string? Am I missing something? It returns bytes because it doesn't know what encoding to use. This is the appropriate behavior. HOWEVER... a web page request often does know what encoding to use, since it ostensibly has to parse the header. It's reasonable that IF a url request's "Content-type" is text, and/or the "Content-encoding" is given, for urllib to have an option to automatically decode and return a string instead of bytes. (For all I know, it already can do that.) Carl Banks -- http://mail.python.org/mailman/listinfo/python-list