In article <[EMAIL PROTECTED]>, John Nagle <[EMAIL PROTECTED]> wrote:
> Filip Salomonsson wrote: > > On 02/10/2007, John Nagle <[EMAIL PROTECTED]> wrote: > >> But there's something in there now that robotparser doesn't like. > >> Any ideas? > > > > Wikipedia denies _all_ access for the standard urllib user agent, and > > when the robotparser gets a 401 or 403 response when trying to fetch > > robots.txt, it is equivalent to "Disallow: *". > > > > http://infix.se/2006/05/17/robotparser > > That explains it. It's an undocumented feature of "robotparser", > as is the 'errcode' variable. The documentation of "robotparser" is > silent on error handling (can it raise an exception?) and should be > updated. Hi John, Robotparser is probably following the never-approved RFC for robots.txt which is the closest thing there is to a standard. It says, "On server response indicating access restrictions (HTTP Status Code 401 or 403) a robot should regard access to the site completely restricted." http://www.robotstxt.org/wc/norobots-rfc.html If you're interested, I have a replacement for the robotparser module that works a little better (IMHO) and which you might also find better documented. I'm using it in production code: http://nikitathespider.com/python/rerp/ Happy spidering -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more -- http://mail.python.org/mailman/listinfo/python-list