Re: [jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Michael Sokolov Fri, 26 Oct 2018 06:21:49 -0700

I agree w/Robert let's not reinvent solutions that are solved elsewhere. In
an ideal world, wouldn't you want to be able to delegate tokenization of
latin script portions to StandardTokenizer? I know that's not possible
today, and I wouldn't derail the work here to try to make it happen since
it would be a big shift, but personally I'd like to see some more
discussion about composing Tokenizers


On Fri, Oct 26, 2018 at 3:53 AM Robert Muir (JIRA) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665014#comment-16665014
> ]
>
> Robert Muir commented on LUCENE-8548:
> -------------------------------------
>
> As far as the suggested fix, why reinvent the wheel? In unicode each
> character gets assigned a script integer value. But there are special
> values such as "Common" and "Inherited", etc.
>
> See [https://unicode.org/reports/tr24/] or icutokenizer code [
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java#L141
> ]
>
>
>
> > Reevaluate scripts boundary break in Nori's tokenizer
> > -----------------------------------------------------
> >
> >                 Key: LUCENE-8548
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
> >             Project: Lucene - Core
> >          Issue Type: Improvement
> >            Reporter: Jim Ferenczi
> >            Priority: Minor
> >
> > This was first reported in
> https://issues.apache.org/jira/browse/LUCENE-8526:
> > {noformat}
> > Tokens are split on different character POS types (which seem to not
> quite line up with Unicode character blocks), which leads to weird results
> for non-CJK tokens:
> > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other
> symbol) + μί/SL(Foreign language)
> > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other
> symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign
> language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other
> symbol) + k/SL(Foreign language) + ̚/SY(Other symbol)
> > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol)
> + лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> > don't is tokenized as don + t; same for don't (with a curly apostrophe).
> > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м +
> oscow
> > While it is still possible to find these words using Nori, there are
> many more chances for false positives when the tokens are split up like
> this. In particular, individual numbers and combining diacritics are
> indexed separately (e.g., in the Cyrillic example above), which can lead to
> a performance hit on large corpora like Wiktionary or Wikipedia.
> > Work around: use a character filter to get rid of combining diacritics
> before Nori processes the text. This doesn't solve the Greek, Hebrew, or
> English cases, though.
> > Suggested fix: Characters in related Unicode blocks—like "Greek" and
> "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token
> splits. Combining diacritics should not trigger token splits. Non-CJK text
> should be tokenized on spaces and punctuation, not by character type
> shifts. Apostrophe-like characters should not trigger token splits (though
> I could see someone disagreeing on this one).{noformat}
> >
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Reply via email to