[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Jim Ferenczi (JIRA) Mon, 03 Dec 2018 01:59:16 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706907#comment-16706907
 ]


Jim Ferenczi commented on LUCENE-8548:
--------------------------------------

{quote}

I still have one more question, could you please explain what information is 
contained in the {{wordIdRef}} variable and what the 
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} statement does? The 
debugger tells me {{wordIdRef.length}} is always equal to 36 or 42 and even 
though 42 is a really great number, I'm a tiny lost in there ...

{quote}

 

The wordIdRef is an indirection that maps an id to an array of word ids. These 
word ids are the candidates for the current word so for known words it is a 
pointer to the part of speech + cost + ... for each variant of the word (noun, 
verb, ...) and for character ids this is a pointer to the cost and part of 
speech for words written with this id (mapped with the unicode scripts). I 
don't know how you found 36 or 42 but unknown words should always have a single 
wordIdRef.

> Reevaluate scripts boundary break in Nori's tokenizer
> -----------------------------------------------------
>
>                 Key: LUCENE-8548
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Reply via email to