[
https://issues.apache.org/jira/browse/SOLR-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336563#comment-15336563
]
ASF GitHub Bot commented on SOLR-8981:
--------------------------------------
Github user tballison commented on the issue:
https://github.com/apache/lucene-solr/pull/44
The XHTMLContentHandler adds <body> and </body>. In out-of-the-box Tika
with the DefaultHtmlMapper, "body" tags are not in the list of "SAFE_ELEMENTS",
which means that the html's "body" tag is never passed through...so we don't
see the doubling in Tika.
The solution is to suppress the body tag in Solr's
MostlyPassthroughHtmlMapper.
> Upgrade to Tika 1.13 when it is available
> -----------------------------------------
>
> Key: SOLR-8981
> URL: https://issues.apache.org/jira/browse/SOLR-8981
> Project: Solr
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Uwe Schindler
> Priority: Minor
>
> Tika 1.13 should be out within a month. This includes PDFBox 2.0.0 and a
> number of other upgrades and improvements.
> If there are any showstoppers in 1.13 from Solr's side or requests before we
> roll 1.13, let us know.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]