heng-kuang-777 opened a new pull request, #17880: URL: https://github.com/apache/pinot/pull/17880
On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives. Fix by introducing `getSearchableDocCount()` on `TextIndexReader`, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. `TextMatchFilterOperator` now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results. Fixes #17809 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
