[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Jim Ferenczi (JIRA) Thu, 22 Nov 2018 08:10:58 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696072#comment-16696072
 ]


Jim Ferenczi commented on LUCENE-8548:
--------------------------------------

Yes we should not depend on the icu module. You need to implement something 
specific anyway because the tokenizer in the Nori module uses a rolling buffer 
to read the input so it has its own logic to inspect the underlying characters.

{quote}

Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameter)

{quote}

 

Currently Nori breaks on character class. The character classes are defined in 
the MeCab model 
[https://bitbucket.org/eunjeon/mecab-ko-dic/src/df15a487444d88565ea18f8250330276497cc9b9/seed/char.def?at=master&fileviewer=file-view-default]
 and we access them through the CharacterDefinition class. You can see the 
logic to group unknown words here:

[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizer.java#L729]

So instead of splitting on character class, you need to extend the logic to 
break on script boundaries. If the new block contains multiple character 
classes (that are compatible with each other) you still need to choose one 
character id to extract the costs associated to that block:

[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizer.java#L745]

In such case picking the first character id in the block should be enough.

 

 

> Reevaluate scripts boundary break in Nori's tokenizer
> -----------------------------------------------------
>
>                 Key: LUCENE-8548
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Reply via email to