Senthil Kumaran added the comment: Hi Eduardo,
I tested further and do observe some very strange oddities. On Mon, Sep 10, 2012 at 10:45 PM, Eduardo A. Bustamante López <rep...@bugs.python.org> wrote: > Also, I'm aware that you shouldn't normally worry about setting a specific > user-agent to fetch the file. But that's not the case of Wikipedia. In my > case, > Wikipedia returned 403 for the urllib user-agent. Yeah, this really surprised me. I would normally assume robots.txt to be readable by any agent, but I think something odd is happening. In 2.7, I do not see the problem because, the implementation is: import urllib class URLOpener(urllib.FancyURLopener): def __init__(self, *args): urllib.FancyURLopener.__init__(self, *args) self.errcode = 200 opener = URLOpener() fobj = opener.open('http://en.wikipedia.org/robots.txt') print opener.errcode This will print 200 and everything is fine. Also, look at it that robots.txt is accessible. In 3.3, the implementation is: import urllib.request try: fobj = urllib.request.urlopen('http://en.wikipedia.org/robots.txt') except urllib.error.HTTPError as err: print(err.code) This gives 403. I would normally expect this to work without any issues. But according to my analysis, what is happening is when the User-agent is set to something which has '-' in that, the server is rejecting it with 403. In the above code, what is happening underlying is this: import urllib.request opener = urllib.request.build_opener() opener.addheaders = [('User-agent', 'Python-urllib/3.3')] fobj = opener.open('http://en.wikipedia.org/robots.txt') print(fobj.getcode()) This would give 403. In order to see it work, change the addheaders line to opener.addheaders = [('', '')] opener.addheaders = [('User-agent', 'Pythonurllib/3.3')] opener.addheaders = [('User-agent', 'KillerSpamBot')] All should work (as expected). So, thing which surrprises me is, if sending "Python-urllib/3.3" is a mistake for "THAT Server". Is this a server oddity at Wikipedia part? ( Coz, I refered to hg log to see from when we are sending Python-urllib/version and it seems that it's being sent for long time). Can't see how should this be fixed in urllib. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue15851> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com