Michael McCandless created LUCENE-7622:
------------------------------------------
Summary: Should BaseTokenStreamTestCase catch analyzers that
create duplicate tokens?
Key: LUCENE-7622
URL: https://issues.apache.org/jira/browse/LUCENE-7622
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
The change to BTSTC is quite simple, to catch any case where the same term text
spans from the same position with the same position length. Such duplicate
tokens are silly to add to the index, or to search at search time.
Yet, this change produced many failures, and I looked briefly at them, and they
are cases that I think are actually OK, e.g. {{PatternCaptureGroupTokenFilter}}
capturing (..)(..) on the string {{ktkt}} will create a duplicate token.
Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]