Jim Ferenczi created LUCENE-8548:
------------------------------------

             Summary: Reevaluate scripts boundary break in Nori's tokenizer
                 Key: LUCENE-8548
                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Jim Ferenczi


This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
{noformat}
Tokens are split on different character POS types (which seem to not quite line 
up with Unicode character blocks), which leads to weird results for non-CJK 
tokens:

εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other symbol) 
+ μί/SL(Foreign language)
ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
k/SL(Foreign language) + ̚/SY(Other symbol)
Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
лтичко/SL(Foreign language) + ̄/SY(Other symbol)
don't is tokenized as don + t; same for don't (with a curly apostrophe).
אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
While it is still possible to find these words using Nori, there are many more 
chances for false positives when the tokens are split up like this. In 
particular, individual numbers and combining diacritics are indexed separately 
(e.g., in the Cyrillic example above), which can lead to a performance hit on 
large corpora like Wiktionary or Wikipedia.

Work around: use a character filter to get rid of combining diacritics before 
Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
cases, though.

Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
Combining diacritics should not trigger token splits. Non-CJK text should be 
tokenized on spaces and punctuation, not by character type shifts. 
Apostrophe-like characters should not trigger token splits (though I could see 
someone disagreeing on this one).{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to