[
https://issues.apache.org/jira/browse/SOLR-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl updated SOLR-7114:
------------------------------
Description:
A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know the
history of this, was it intentional? I tried to fix it, but it's a bit
confusing. (This is a spinoff from SOLR-7107).
Crawling lucene.apache.org with bin/post fails with 500 errors since Tika
autodetect sees {{<head>}} as the first tag and believes it is XML :-)
I *think* we're fine if all templates referred to from {{lib/path.pm}} have
{{<html>}} tags added, and that none of them include eachother. Currently,
{{core.html}} is both a top-page and also included from
{{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}} for some
reason.
To reproduce the crawl errors:
{code}
bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
{code}
-We could in addition improve {{SimplePostTool}} to send a content-type hint to
Tika.- *Update: The tool already does this*
was:
A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know the
history of this, was it intentional? I tried to fix it, but it's a bit
confusing. (This is a spinoff from SOLR-7107).
Crawling lucene.apache.org with bin/post fails with 500 errors since Tika
autodetect sees {{<head>}} as the first tag and believes it is XML :-)
I *think* we're fine if all templates referred to from {{lib/path.pm}} have
{{<html>}} tags added, and that none of them include eachother. Currently,
{{core.html}} is both a top-page and also included from
{{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}} for some
reason.
To reproduce the crawl errors:
{code}
bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
{code}
We could in addition improve {{SimplePostTool}} to send a content-type hint to
Tika.
> SimplePostTool fails crawling lucene.apache.org due to missing <html> tag
> -------------------------------------------------------------------------
>
> Key: SOLR-7114
> URL: https://issues.apache.org/jira/browse/SOLR-7114
> Project: Solr
> Issue Type: Bug
> Components: SimplePostTool
> Reporter: Jan Høydahl
> Assignee: Jan Høydahl
> Priority: Minor
> Labels: cms
> Fix For: 5.1
>
>
> A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know
> the history of this, was it intentional? I tried to fix it, but it's a bit
> confusing. (This is a spinoff from SOLR-7107).
> Crawling lucene.apache.org with bin/post fails with 500 errors since Tika
> autodetect sees {{<head>}} as the first tag and believes it is XML :-)
> I *think* we're fine if all templates referred to from {{lib/path.pm}} have
> {{<html>}} tags added, and that none of them include eachother. Currently,
> {{core.html}} is both a top-page and also included from
> {{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}} for some
> reason.
> To reproduce the crawl errors:
> {code}
> bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
> {code}
> -We could in addition improve {{SimplePostTool}} to send a content-type hint
> to Tika.- *Update: The tool already does this*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]