[ 
https://issues.apache.org/jira/browse/LUCENE-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037272#comment-14037272
 ] 

Steve Rowe commented on LUCENE-5763:
------------------------------------

bq. On the {{⟨}} and {{⟩}} difference: Will a filter like 
ICUFoldingFilter reduce these to the plain ascii < and > regardless of which 
entity interpretation is used? 

No, ICUFoldingFilter doesn't fold (leaves intact) the HTML5 
{{&amp;lang;}}/{{&amp;rang;}} (left: U+27E8; right: U+27E9), but folds the 
HTML4 ones (left: U+2329; right: U+232A) to full-width CJK angle brackets 
U+3008 and U+3009, respectively...  This [2007 WHATWG 
email|http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-July/012108.html]
 mentions that earlier drafts of HTML5 mapped {{&amp;lang;}}/{{&amp;rang;}}  to 
these full-width CJK characters.

And ASCIIFoldingFilter doesn't cover either of the blocks in question, so 
wouldn't fold any of these characters.

For text search, typically punctuation like this is stripped as part of the 
tokenization process, so I don't see the folding filters' deficits here as 
terribly problematic.  

> HTMLStripCharFilter += HTML5 
> -----------------------------
>
>                 Key: LUCENE-5763
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5763
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: modules/analysis
>            Reporter: Steve Rowe
>            Priority: Minor
>
> HTMLStripCharFilter knows some specific things about HTML4 (like named 
> character entities, which are converted to the corresponding characters), but 
> not about HTML5.
> HTML5 has way more named character entities: 2,231 vs 259 by my count.
> There's probably other stuff to do, e.g. there are new tags.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to