[
https://issues.apache.org/jira/browse/LUCENE-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036908#comment-14036908
]
Shawn Heisey commented on LUCENE-5763:
--------------------------------------
On the {{⟨}} and {{⟩}} difference: Will a filter like
ICUFoldingFilter reduce these to the plain ascii < and > regardless of which
entity interpretation is used? If so, it won't affect me ... and perhaps it
might be something to mention in HTMLStripCharFilter javadocs.
Would it be useful at all to have a config option for the HTML version?
> HTMLStripCharFilter += HTML5
> -----------------------------
>
> Key: LUCENE-5763
> URL: https://issues.apache.org/jira/browse/LUCENE-5763
> Project: Lucene - Core
> Issue Type: Task
> Components: modules/analysis
> Reporter: Steve Rowe
> Priority: Minor
>
> HTMLStripCharFilter knows some specific things about HTML4 (like named
> character entities, which are converted to the corresponding characters), but
> not about HTML5.
> HTML5 has way more named character entities: 2,231 vs 259 by my count.
> There's probably other stuff to do, e.g. there are new tags.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]