> I just discovered that the "robotparser" module interprets > a 403 ("Forbidden") status on a "robots.txt" file as meaning > "all access disallowed". That's unexpected behavior.
That's specified in the "norobots RFC": http://www.robotstxt.org/norobots-rfc.txt - On server response indicating access restrictions (HTTP Status Code 401 or 403) a robot should regard access to the site completely restricted. So if a site returns 403, we should assume that it did so deliberately, and doesn't want to be indexed. > A major site ("http://www.aplus.net/robot.txt") has their > "robots.txt" file set up that way. You should try "http://www.aplus.net/robots.txt" instead, which can be accessed just fine. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list