New submission from Andre Burgaud <andre.burg...@gmail.com>:
As per the current Robots Exclusion Protocol internet draft, https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should apply the rules respecting the longest match. urllib.robotparser relies on the order of the rules in the robots.txt file. Here is the section in the specs: =================== 3.2. Longest Match The following example shows that in the case of a two rules, the longest one MUST be used for matching. In the following case, /example/page/disallowed.gif MUST be used for the URI example.com/example/page/disallow.gif . <CODE BEGINS> User-Agent : foobot Allow : /example/page/ Disallow : /example/page/disallowed.gif <CODE ENDS> =================== I'm attaching a simple test file "test_robot.py" ---------- components: Library (Lib) files: test_robot.py messages: 359181 nosy: gallicrooster priority: normal severity: normal status: open title: urllib.robotparser does not respect the longest match for the rule type: behavior versions: Python 3.8 Added file: https://bugs.python.org/file48815/test_robot.py _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue39187> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com