Bugs item #813986, was opened at 2003-09-28 13:06 Message generated for change (Comment added) made by nagle You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.3 Status: Open Resolution: None Priority: 6 Private: No Submitted By: Erik Demaine (edemaine) Assigned to: Martin v. Löwis (loewis) Summary: robotparser interactively prompts for username and password Initial Comment: This is a rare occurrence, but if a /robots.txt file is password-protected on an http server, robotparser interactively prompts (via raw_input) for a username and password, because that is urllib's default behavior. One example of such a URL, at least at the time of this writing, is http://www.cosc.canterbury.ac.nz/robots.txt Given that robotparser and robots.txt is all about *robots* (not interactive users), I don't think this interactive behavior is terribly appropriate. Attached is a simple patch to robotparser.py to fix this behavior, forcing urllib to return the 401 error that it ought to. Another issue is whether a 401 (Authorization Required) URL means that everything should be allowed or everything should be disallowed. I'm not sure what's "right". Reading the spec, it says 'This file must be accessible via HTTP on the local URL "/robots.txt"' which I would read to mean it should be accessible without username/password. On the other hand, the current robotparser.py code says "if self.errcode == 401 or self.errcode == 403: self.disallow_all = 1" which has the opposite effect. I'll leave deciding which is most appropriate to the powers that be. ---------------------------------------------------------------------- Comment By: John Nagle (nagle) Date: 2007-04-21 16:53 Message: Logged In: YES user_id=5571 Originator: NO The attached patch was never integrated into the distribution. This is still broken in Python 2.4 (Win32), Python 2.5 (Win32), and Python 2.5 (Linux). This stalled an overnight batch job for us. Very annoying. Reproduce with: import robotparser url = 'http://mueblesmoraleda.com' # whole site is password-protected. parser = robotparser.RobotFileParser() parser.set_url(url) parser.read() # Prompts for password ---------------------------------------------------------------------- Comment By: Wummel (calvin) Date: 2003-09-29 13:24 Message: Logged In: YES user_id=9205 http://www.robotstxt.org/wc/norobots-rfc.html specifies the 401 and 403 return code consequences as restricting the whole site (ie disallow_all = 1). For the password input, the patch looks good to me. On the long term, robotparser.py should switch to urllib2.py anyway, and it should handle Transfer-Encoding: gzip. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=813986&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com