Ryan Stokes created SOLR-12026:
----------------------------------
Summary: SimplePostTool with robots.txt
Key: SOLR-12026
URL: https://issues.apache.org/jira/browse/SOLR-12026
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: SimplePostTool
Affects Versions: 7.2
Reporter: Ryan Stokes
[First issue here, apologies in advance for missteps.]
Three things which could improve working with robots.txt:
# When fetching the corresponding robots.txt for a URL, the port is ignored
and so it defaults to :80. If nothing is listening :80, it fetches the page.
isDisallowedByRobots() could include the url.getPort() when constructing
strRobot. This helps when testing your robots on a non-standard port, such as
during development.
# Disallow directives are applied regardless of User-agent. parseRobotsTxt()
could override a Disallow which specifies SimplePostTool-crawler. This would
help when indexing your own site which you've explicitly allowed for indexing
by SimplePostTool. I don't know if that's a good practice, but it would help
in testing.
# The User-agent header when fetching robots.txt is not
"SimplePostTool-crawler" but shows as "Java/<version>". The code which sets
the header correctly from readPageFromUrl() could be reused in
isDisallowedByRobots().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]