John Nagle wrote: > The code in urllib.quote fails on Unicode input, when > called by robotparser. > > That bit of code needs some attention. > - It still assumes ASCII goes up to 255, which hasn't been true in > Python > for a while now. > - The initialization may not be thread-safe; a table is being > initialized > on first use. The code is too clever and uncommented. > > "robotparser" was trying to check if a URL, > "http://www.highbeam.com/DynamicContent/%E2%80%9D/mysaved/privacyPref.asp%22" > could be accessed, and there are some wierd characters in there. Unicode > URLs are legal, so this is a real bug. > > Logged in as Bug #1712522.
There has been a related discussion: http://groups.google.com/group/comp.lang.python/browse_thread/thread/b331dc3625dbfc41/ce6e6a3c0635e340 IIRC the outcome was that while UTF-8 is recommended urllib.quote()/unquote() should not guess the encoding. What changes that would imply for robotparser I don't know... Peter -- http://mail.python.org/mailman/listinfo/python-list