[issue39187] urllib.robotparser does not respect the longest match for the rule

Andre Burgaud Wed, 01 Jan 2020 20:14:29 -0800


New submission from Andre Burgaud <andre.burg...@gmail.com>:


As per the current Robots Exclusion Protocol internet draft, 
https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should 
apply the rules respecting the longest match.

urllib.robotparser relies on the order of the rules in the robots.txt file. 
Here is the section in the specs:

===================
3.2.  Longest Match

   The following example shows that in the case of a two rules, the
   longest one MUST be used for matching.  In the following case,
   /example/page/disallowed.gif MUST be used for the URI
   example.com/example/page/disallow.gif .

   <CODE BEGINS>
   User-Agent : foobot
   Allow : /example/page/
   Disallow : /example/page/disallowed.gif
   <CODE ENDS> 
===================

I'm attaching a simple test file "test_robot.py"

----------
components: Library (Lib)
files: test_robot.py
messages: 359181
nosy: gallicrooster
priority: normal
severity: normal
status: open
title: urllib.robotparser does not respect the longest match for the rule
type: behavior
versions: Python 3.8
Added file: https://bugs.python.org/file48815/test_robot.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39187>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue39187] urllib.robotparser does not respect the longest match for the rule

Reply via email to