[issue25400] robotparser doesn't return crawl delay for default entry
New submission from Peter Wirtz: After changeset http://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None. Ex: Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.robotparser >>> parser = urllib.robotparser.RobotFileParser() >>> parser.set_url('https://www.carthage.edu/robots.txt') >>> parser.read() >>> parser.crawl_delay('test_robotparser') >>> parser.crawl_delay('*') >>> print(parser.default_entry.delay) 120 >>> Excerpt from https://www.carthage.edu/robots.txt: User-agent: * Crawl-Delay: 120 Disallow: /cgi-bin I have written a patch that solves this. With patch, output is: Python 3.6.0a0 (default:1aae9b6a6929+, Oct 9 2015, 22:08:05) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.robotparser >>> parser = urllib.robotparser.RobotFileParser() >>> parser.set_url('https://www.carthage.edu/robots.txt') >>> parser.read() >>> parser.crawl_delay('test_robotparser') 120 >>> parser.crawl_delay('*') 120 >>> print(parser.default_entry.delay) 120 >>> This also applies to the request_rate method. -- components: Library (Lib) files: robotparser_crawl_delay.patch keywords: patch messages: 252971 nosy: pwirtz priority: normal severity: normal status: open title: robotparser doesn't return crawl delay for default entry type: behavior versions: Python 3.6 Added file: http://bugs.python.org/file40777/robotparser_crawl_delay.patch ___ Python tracker <http://bugs.python.org/issue25400> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25400] robotparser doesn't return crawl delay for default entry
Peter Wirtz added the comment: This fix breaks the unit tests though. I am not sure how to go about checking those as this would be my first contribution to python and an open source project in general. -- ___ Python tracker <http://bugs.python.org/issue25400> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25400] robotparser doesn't return crawl delay for default entry
Peter Wirtz added the comment: On further inspection of the tests, it appears that the way in which the tests are written, a test case can only be tested for one useragent at a time. I will attempt to work on the tests so work correctly. Any advice would be much appreciated. -- ___ Python tracker <http://bugs.python.org/issue25400> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25400] robotparser doesn't return crawl delay for default entry
Peter Wirtz added the comment: Ok, for the mean time, I reworked the test so it appears to test correctly and tests passes. There does seem to be some magic, so I do hope I did not overlook anything. Here is the new patch. -- Added file: http://bugs.python.org/file40784/robotparser_crawl_delay_v2.patch ___ Python tracker <http://bugs.python.org/issue25400> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21475] Support the Sitemap extension in robotparser
Peter Wirtz added the comment: I would like to tackle this issue. Should I wait for issue25400 to be resolved first? -- nosy: +pwirtz ___ Python tracker <http://bugs.python.org/issue21475> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21475] Support the Sitemap extension in robotparser
Peter Wirtz added the comment: Here is a patch that provides support for the Sitemap extension. -- keywords: +patch Added file: http://bugs.python.org/file40791/robotparser_site_maps_v1.patch ___ Python tracker <http://bugs.python.org/issue21475> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com