New submission from Brian Bernstein <bernie9...@gmail.com>: When attempting to parse a robots.txt file which has a blank line between allow/disallow rules, all rules after the blank line are ignored.
If a blank line occurs between the user-agent and its rules, all of the rules for that user-agent are ignored. I am not sure if having a blank line between rules is allowed in the spec, but I am seeing this behavior in a number of sites, for instance: http://www.whitehouse.gov/robots.txt has a blank line between the disallow rules all other lines, including the associated user-agent line, resulting in the python RobotFileParser to ignore all rules. http://www.last.fm/robots.txt appears to separate their rules with arbitrary blank lines between them. The python RobotFileParser only sees the first two rule between the user-agent and the next newline. If the parser is changed to simply ignore all blank lines, would it have any adverse affect on parsing robots.txt files? I am including a simple patch which ignores all blank lines and appears to find all rules from these robots.txt files. ---------- files: robotparser.py.patch keywords: patch messages: 146518 nosy: bernie9998 priority: normal severity: normal status: open title: robotparser.RobotFileParser ignores rules preceeded by a blank line type: behavior versions: Python 2.7 Added file: http://bugs.python.org/file23538/robotparser.py.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue13281> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com