[
https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714302#comment-16714302
]
Steve Rowe edited comment on LUCENE-8527 at 12/10/18 5:20 AM:
--------------------------------------------------------------
Patch, passes most Lucene/Solr tests (see below), including the test built with
Unicode 9.0's word break test data: {{WordBreakTestUnicode_9_0_0}}.
{quote}So the stuff in the ICU tokenizer uses some properties to tag the "stuff
between breaks" as emoji token type versus something else. I looked at latest
jflex, it seems it would need those props?
{quote}
Yes, JFlex 1.7.0 doesn't have the Emoji props it needs to properly tokenize and
type as emoji, since these props' definitions are not included with
release-specific data. For Lucene's use it should be possible to script pulling
in Unicode data to augment the scanner specs, which would allow proper emoji
tokenization/typing to work. (I've made a note to add these properties to
future JFlex releases.)
Failing tests with the patch:
{{ant test -Dtestcase=TestStandardAnalyzer
-Dtests.method=testRandomHugeStringsGraphAfter -Dtests.seed=B33609C22A50A253
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=es-VE
-Dtests.timezone=Africa/Blantyre -Dtests.asserts=true
-Dtests.file.encoding=UTF-8}}
{{ant test -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStrings
-Dtests.seed=DA01A0705C379738 -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=ru-RU -Dtests.timezone=Europe/Sarajevo -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1}}
In both ^^ of these cases,
{{BaseTokenStreamTestCase.checkAnalysisConsistency()}} fails with unexpected
tokenization after randomly choosing to use a spoon-feed reader wrapper:
{{MockReaderWrapper}}. If I disable the wrapping with those seeds, the tests
pass. I'll work on making a simplified test case demonstrating the problem; I'm
not sure what's going wrong.
was (Author: steve_rowe):
Patch, passes most Lucene/Solr tests (see below), including the test built with
Unicode 9.0's word break test data: {{WordBreakTestUnicode_9_0_0}}.
{quote}So the stuff in the ICU tokenizer uses some properties to tag the "stuff
between breaks" as emoji token type versus something else. I looked at latest
jflex, it seems it would need those props?
{quote}
Yes, JFlex 1.7.0 doesn't have the Emoji props it needs to properly tokenize and
type as emoji, since these props' definitions are not included with
release-specific data. For Lucene's use it should be possible to script pulling
in Unicode data to augment the scanner specs, which would allow proper emoji
tokenization/typing to work. (I've make a note to add these properties to
future JFlex releases.)
Failing tests with the patch:
{{ant test -Dtestcase=TestStandardAnalyzer
-Dtests.method=testRandomHugeStringsGraphAfter -Dtests.seed=B33609C22A50A253
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=es-VE
-Dtests.timezone=Africa/Blantyre -Dtests.asserts=true
-Dtests.file.encoding=UTF-8}}
{{ant test -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStrings
-Dtests.seed=DA01A0705C379738 -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=ru-RU -Dtests.timezone=Europe/Sarajevo -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1}}
In both ^^ of these cases,
{{BaseTokenStreamTestCase.checkAnalysisConsistency()}} fails with unexpected
tokenization after randomly choosing to use a spoon-feed reader wrapper:
{{MockReaderWrapper}}. If I disable the wrapping with those seeds, the tests
pass. I'll work on making a simplified test case demonstrating the problem; I'm
not sure what's going wrong.
> Upgrade JFlex to 1.7.0
> ----------------------
>
> Key: LUCENE-8527
> URL: https://issues.apache.org/jira/browse/LUCENE-8527
> Project: Lucene - Core
> Issue Type: Improvement
> Components: general/build, modules/analysis
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Priority: Minor
> Attachments: LUCENE-8527.patch
>
>
> JFlex 1.7.0, supporting Unicode 9.0, was released recently:
> [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]