[
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924773#comment-16924773
]
Namgyu Kim commented on LUCENE-8966:
------------------------------------
But there is a bug I just checked :(
Input : "4......4사이즈"
Expected : [4] [......] [4] [사이즈]
Actual : [4] *[.] [.....]* [4] [사이즈]
{code:java}
// Need to pass!
public void testDuplicatePunctuation() throws IOException {
assertAnalyzesTo(analyzerWithPunctuation, "4......4사이즈",
new String[]{"4", "......", "4", "사이즈"},
new int[]{0, 1, 7, 8},
new int[]{1, 7, 8, 11},
new int[]{1, 1, 1, 1}
);
}
{code}
I think we need to fix it.
If it is okay to fix within this JIRA issue, I'll post additional patch.
Otherwise I'll create a new one.
> KoreanTokenizer should split unknown words on digits
> ----------------------------------------------------
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer
> groups characters of unknown words if they belong to the same script or an
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the
> rest in Latin) but this rule doesn't work well on digits since they are
> considered common with other scripts. For instance the input "44사이즈" is kept
> as is even though "사이즈" is part of the dictionary. We should restore the
> original behavior and splits any unknown words if a digit is followed by
> another type.
> This issue was first discovered in
> [https://github.com/elastic/elasticsearch/issues/46365]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]