New submission from Brian Bernstein <bernie9...@gmail.com>:

When attempting to parse a robots.txt file which has a blank line between 
allow/disallow rules, all rules after the blank line are ignored.

If a blank line occurs between the user-agent and its rules, all of the rules 
for that user-agent are ignored.

I am not sure if having a blank line between rules is allowed in the spec, but 
I am seeing this behavior in a number of sites, for instance:

http://www.whitehouse.gov/robots.txt has a blank line between the disallow 
rules all other lines, including the associated user-agent line, resulting in 
the python RobotFileParser to ignore all rules.

http://www.last.fm/robots.txt appears to separate their rules with arbitrary 
blank lines between them.  The python RobotFileParser only sees the first two 
rule between the user-agent and the next newline.

If the parser is changed to simply ignore all blank lines, would it have any 
adverse affect on parsing robots.txt files?

I am including a simple patch which ignores all blank lines and appears to find 
all rules from these robots.txt files.

----------
files: robotparser.py.patch
keywords: patch
messages: 146518
nosy: bernie9998
priority: normal
severity: normal
status: open
title: robotparser.RobotFileParser ignores rules preceeded by a blank line
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file23538/robotparser.py.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13281>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to