[
https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807508#comment-15807508
]
Robert Muir commented on LUCENE-7622:
-------------------------------------
BM25 does not make this harder. It just normalizes term frequency in a way that
isn't as brain dead as {{sqrt}}. And unlike Crappy^H^H^H^HDefaultSimilarity,
its totally tunable without modifying source code, e.g. adjust {{k1}} parameter
to your needs.
Sorry, you are wrong: it only makes this kind of thing way easier.
> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> ----------------------------------------------------------------------------
>
> Key: LUCENE-7622
> URL: https://issues.apache.org/jira/browse/LUCENE-7622
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term
> text spans from the same position with the same position length. Such
> duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and
> they are cases that I think are actually OK, e.g.
> {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}}
> will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]