[jira] [Commented] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?

Uwe Schindler (JIRA) Sat, 07 Jan 2017 07:13:21 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807640#comment-15807640
 ]


Uwe Schindler commented on LUCENE-7622:
---------------------------------------

Hi Robert. I know that you can tune. Maybe I was a bit unclear. I wanted to say 
that unlike with stupid CrappyDefaultSim it's no longer possible to boost terms 
more or less unlimited (like a document with 10000 times the same term no 
longer beats all others). So to repeat terms at same position with a repeater 
token filter is still useful, but no longer so drastic. So sorry for being 
unclear. 🤓 Maybe I change or remove the last sentence in my comment to remove 
the misunderstanding.

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-7622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7622
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term 
> text spans from the same position with the same position length. Such 
> duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and 
> they are cases that I think are actually OK, e.g. 
> {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}} 
> will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?

Reply via email to