Jan Høydahl created SOLR-7114:
---------------------------------

             Summary: SimplePostTool fails crawling lucene.apache.org due to 
missing <html> tag
                 Key: SOLR-7114
                 URL: https://issues.apache.org/jira/browse/SOLR-7114
             Project: Solr
          Issue Type: Bug
          Components: SimplePostTool
            Reporter: Jan Høydahl
            Assignee: Jan Høydahl
            Priority: Minor
             Fix For: 5.1


A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know the 
history of this, was it intentional? I tried to fix it, but it's a bit 
confusing. (This is a spinoff from SOLR-7107).

Crawling lucene.apache.org with bin/post fails with 500 errors since Tika 
autodetect sees {{<head>}} as the first tag and believes it is XML :-)

I *think* we're fine if all templates referred to from {{lib/path.pm}} have 
{{<html>}} tags added, and that none of them include eachother. Currently, 
{{core.html}} is both a top-page and also included from 
{{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}} for some 
reason.

To reproduce the crawl errors:
{code}
bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
{code}

We could in addition improve {{SimplePostTool}} to send a content-type hint to 
Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to