[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

Jim Ferenczi (Jira) Thu, 05 Sep 2019 02:06:06 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jim Ferenczi updated LUCENE-8966:
---------------------------------
    Description: 
Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
groups characters of unknown words if they belong to the same script or an 
inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
rest in Latin) but this rule doesn't work well on digits since they are 
considered common with other scripts. For instance the input "44사이즈" is kept as 
is even though "사이즈" is part of the dictionary. We should restore the original 
behavior and splits any unknown words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]

  was:
Since LUCENE-XXX the Korean tokenizer groups characters of unknown words if 
they belong to the same script or an inherited one. This is ok for inputs like 
Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work 
well on digits since they are considered common with other scripts. For 
instance the input "44사이즈" is kept as is even though "사이즈" is part of the 
dictionary. We should restore the original behavior and splits any unknown 
words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]


> KoreanTokenizer should split unknown words on digits
> ----------------------------------------------------
>
>                 Key: LUCENE-8966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8966
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

Reply via email to