Jim Ferenczi created LUCENE-8966:
------------------------------------
Summary: KoreanTokenizer should split unknown words on digits
Key: LUCENE-8966
URL: https://issues.apache.org/jira/browse/LUCENE-8966
Project: Lucene - Core
Issue Type: Improvement
Reporter: Jim Ferenczi
Since LUCENE-XXX the Korean tokenizer groups characters of unknown words if
they belong to the same script or an inherited one. This is ok for inputs like
Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work
well on digits since they are considered common with other scripts. For
instance the input "44사이즈" is kept as is even though "사이즈" is part of the
dictionary. We should restore the original behavior and splits any unknown
words if a digit is followed by another type.
This issue was first discovered in
[https://github.com/elastic/elasticsearch/issues/46365]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]