[
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16697336#comment-16697336
]
Christophe Bismuth commented on LUCENE-8548:
--------------------------------------------
I've made some progress and opened PR
[#505|https://github.com/apache/lucene-solr/pull/505] to share them with you.
Feel free to stop me as I don't want to make you loose your time.
Here is what has been done so far:
* Break on script boundaries with built-in JDK API,
* Track character classes in a growing byte array,
* I feel a tiny bit lost when it comes to extract costs: should I call
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} for each tracked
character class?
* {{мoscow}} word is correctly parsed in the Graphviz output below ...
* ... but test failed on this
[line|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java#L199]
and I still have to understand why.
!screenshot-1.png!
> Reevaluate scripts boundary break in Nori's tokenizer
> -----------------------------------------------------
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Minor
> Attachments: screenshot-1.png, testCyrillicWord.dot.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite
> line up with Unicode character blocks), which leads to weird results for
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) +
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) +
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) +
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) +
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many
> more chances for false positives when the tokens are split up like this. In
> particular, individual numbers and combining diacritics are indexed
> separately (e.g., in the Cyrillic example above), which can lead to a
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits.
> Combining diacritics should not trigger token splits. Non-CJK text should be
> tokenized on spaces and punctuation, not by character type shifts.
> Apostrophe-like characters should not trigger token splits (though I could
> see someone disagreeing on this one).{noformat}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]