In article <[EMAIL PROTECTED]>, John Nagle <[EMAIL PROTECTED]> wrote:
> This bug, "[ 813986 ] robotparser interactively prompts for username and > password", has been open since 2003. It killed a big batch job of ours > last night. > > Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs. > If the server asks for basic authentication on that file, "robotparser" > prompts for the password on standard input. Which is rarely what you > want. You can demonstrate this with: > > import robotparser > url = 'http://mueblesmoraleda.com' # this site is password-protected. > parser = robotparser.RobotFileParser() > parser.set_url(url) > parser.read() # Prompts for password > > That's the tandard, although silly, "urllib" behavior. John, robotparser is (IMO) suboptimal in a few other ways, too. - It doesn't handle non-ASCII characters. (They're infrequent but when writing a spider which sees thousands of robots.txt files in a short time, "infrequent" can become "daily"). - It doesn't account for BOMs in robots.txt (which are rare). - It ignores any Expires header sent with the robots.txt - It handles some ambiguous return codes (e.g. 503) that it ought to pass up to the caller. I wrote my own parser to address these problems. It probably suffers from the same urllib hang that you've found (I have not encountered it myself) and I appreciate you posting a fix. Here's the code & documentation in case you're interested: http://NikitaTheSpider.com/python/rerp/ Cheers -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more -- http://mail.python.org/mailman/listinfo/python-list