[ https://issues.apache.org/jira/browse/SOLR-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jan Høydahl resolved SOLR-7114. ------------------------------- Resolution: Won't Fix We have a new site without this issue, and also post tool is not intended as a robust web crawler. > SimplePostTool fails crawling lucene.apache.org due to missing <html> tag > ------------------------------------------------------------------------- > > Key: SOLR-7114 > URL: https://issues.apache.org/jira/browse/SOLR-7114 > Project: Solr > Issue Type: Bug > Components: SimplePostTool > Reporter: Jan Høydahl > Assignee: Jan Høydahl > Priority: Minor > Labels: cms > > A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know > the history of this, was it intentional? I tried to fix it, but it's a bit > confusing. (This is a spinoff from SOLR-7107). > Crawling lucene.apache.org with bin/post fails with 500 errors since Tika > autodetect sees {{<head>}} as the first tag and believes it is XML :-) > I *think* we're fine if all templates referred to from {{lib/path.pm}} have > {{<html>}} tags added, and that none of them include eachother. Currently, > {{core.html}} is both a top-page and also included from > {{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}} for some > reason. > To reproduce the crawl errors: > {code} > bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html > {code} > -We could in addition improve {{SimplePostTool}} to send a content-type hint > to Tika.- *Update: The tool already does this* -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org