[
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701958#comment-16701958
]
Christophe Bismuth commented on LUCENE-8548:
--------------------------------------------
Thanks a lot for sharing this [~jim.ferenczi], and no worries at all as the
first iteration was an interesting journey! I think taking time to read about
Viterbi would help me some more, let's add it to my pretty own todo list :D
I diffed your patch with {{master}} and debugged new tests step-by-step, and I
think I understand the big picture. Among others, I totally missed the {{if
(isCommonOrInherited(scriptCode) && isCommonOrInherited(sc) == false)}}
condition which is essential.
I still have one more question, could you please explain what information is
contained in the {{wordIdRef}} variable and what the
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} statement does? The
debugger tells me {{wordIdRef.length}} is always equal to 36 or 42 and even
though 42 is a really great number, I'm a tiny lost in there ...
> Reevaluate scripts boundary break in Nori's tokenizer
> -----------------------------------------------------
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Minor
> Attachments: LUCENE-8548.patch, screenshot-1.png,
> testCyrillicWord.dot.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite
> line up with Unicode character blocks), which leads to weird results for
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) +
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) +
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) +
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) +
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many
> more chances for false positives when the tokens are split up like this. In
> particular, individual numbers and combining diacritics are indexed
> separately (e.g., in the Cyrillic example above), which can lead to a
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits.
> Combining diacritics should not trigger token splits. Non-CJK text should be
> tokenized on spaces and punctuation, not by character type shifts.
> Apostrophe-like characters should not trigger token splits (though I could
> see someone disagreeing on this one).{noformat}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]