Problem with Python's "robots.txt" file parser in module robotparser

John Nagle Wed, 11 Jul 2007 09:57:32 -0700

   Python's "robots.txt" file parser may be misinterpreting a
special case.  Given a robots.txt file like this:


        User-agent: *
        Disallow: //
        Disallow: /account/registration
        Disallow: /account/mypro
        Disallow: /account/myint
        ...

the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed.  Apparently  "Disallow: //" is being interpreted as
"Disallow: /".  Even the home page of the site is locked out. This may be 
incorrect.

This is the robots.txt file for "http://ibm.com";.
Some IBM operating systems recognize filenames starting with "//"
as a special case like a network root, so they may be trying to
handle some problem like that.

The spec for "robots.txt", at

http://www.robotstxt.org/wc/norobots.html

says "Disallow: The value of this field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL that starts with
this value will not be retrieved."  That suggests that "//" should only disallow
paths beginning with "//".

                                John Nagle
                                SiteTruth
-- 
http://mail.python.org/mailman/listinfo/python-list

Problem with Python's "robots.txt" file parser in module robotparser

Reply via email to