Michael McCandless created LUCENE-7622:
------------------------------------------

             Summary: Should BaseTokenStreamTestCase catch analyzers that 
create duplicate tokens?
                 Key: LUCENE-7622
                 URL: https://issues.apache.org/jira/browse/LUCENE-7622
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Michael McCandless


The change to BTSTC is quite simple, to catch any case where the same term text 
spans from the same position with the same position length. Such duplicate 
tokens are silly to add to the index, or to search at search time.

Yet, this change produced many failures, and I looked briefly at them, and they 
are cases that I think are actually OK, e.g. {{PatternCaptureGroupTokenFilter}} 
capturing (..)(..) on the string {{ktkt}} will create a duplicate token.

Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to