[
https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713510#comment-16713510
]
Robert Muir commented on LUCENE-8527:
-------------------------------------
It would be really nice. I don't think the tricky part is really segmentation
at all (as far as finding breaks) but instead the problem of assigning the
proper "label" to the token (tag it as a emoji type).
So the stuff in the ICU tokenizer uses some properties to tag the "stuff
between breaks" as emoji token type versus something else. I looked at latest
jflex, it seems it would need those props? And its a little tricky, e.g.
ordinary ascii digit 7 is [:Emoji:] in unicode. So thats why the isEmoji there
is a bit crazy.
> Upgrade JFlex to 1.7.0
> ----------------------
>
> Key: LUCENE-8527
> URL: https://issues.apache.org/jira/browse/LUCENE-8527
> Project: Lucene - Core
> Issue Type: Improvement
> Components: general/build, modules/analysis
> Reporter: Steve Rowe
> Priority: Minor
>
> JFlex 1.7.0, supporting Unicode 9.0, was released recently:
> [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]