DuplicateFilter has been mostly broken since Lucene's switch over to
segment-level filtering.
Since v2.9 the calls to Filter.getDocIdSet no longer pass a "top-level" reader
for accessing the whole index and instead pass a reader restricted to only
accessing a single segment's contents.
Becaus
https://issues.apache.org/jira/browse/LUCENE-2348 suggests there are
long-standing and probably still current issues with DuplicateFilter
and multiple segments. I'm not sure if this could explain what you
are seeing. You could try calling optimize(1) on your index writer
and see if that makes a d