[PR] Add inverted-index-based distinct operator with runtime cost heuristic [pinot]

via GitHub Fri, 13 Mar 2026 01:27:50 -0700


xiangfu0 opened a new pull request, #17872:
URL: https://github.com/apache/pinot/pull/17872


   ## Summary
   - Merge inverted index logic into `DistinctOperator` with a runtime cost 
heuristic that chooses between inverted-index and scan paths based on 
dictionary cardinality vs filtered doc count
   - Default cost ratio of 5 (derived from JMH benchmarking) can be overridden 
via the `invertedIndexDistinctCostRatio` query option
   - Opt-in via: `OPTION(useIndexBasedDistinctOperator=true)`
   
   ## Details
   
   ### How it works
   When `useIndexBasedDistinctOperator=true` is set and the column has both a 
dictionary and an inverted index, `DistinctOperator` evaluates a cost heuristic 
at runtime:
   - **Inverted index path**: iterates dictionary entries, does bitmap 
intersection with filter for each — cost proportional to `dictionaryCardinality`
   - **Scan path**: iterates filtered docs through projection — cost 
proportional to `filteredDocCount`
   - Decision: use inverted index when `dictionaryCardinality * costRatio <= 
filteredDocCount`
   
   The default `costRatio=5` was derived from JMH benchmarking 
(`BenchmarkInvertedIndexDistinct`) which showed the crossover varies by 
dictionary size (~30x for small dicts, ~5-10x for large dicts).
   
   ### Query options
   | Option | Type | Default | Description |
   |--------|------|---------|-------------|
   | `useIndexBasedDistinctOperator` | boolean | `false` | Opt-in to 
inverted-index-based distinct |
   | `invertedIndexDistinctCostRatio` | int | `5` | Cost ratio for the 
heuristic; higher values bias toward scan path |
   
   ### Files changed
   - `DistinctOperator.java` — dual-path execution with cost heuristic and 
query option override
   - `DistinctPlanNode.java` — wires inverted index context to DistinctOperator 
when eligible
   - `CommonConstants.java` / `QueryOptionsUtils.java` — new query option keys 
and getters
   - `OfflineClusterIntegrationTest.java` — 3 integration tests for correctness 
validation
   - `InvertedIndexDistinctCostHeuristicTest.java` — 8 unit tests for cost 
heuristic boundary behavior
   - `BenchmarkInvertedIndexDistinct.java` — JMH benchmark for inverted index 
vs scan crossover analysis
   
   ## Test plan
   - [x] Unit tests: `InvertedIndexDistinctCostHeuristicTest` (8 tests covering 
cost ratio boundaries, path selection, correctness)
   - [x] Existing tests: `DistinctQueriesTest` passes unchanged
   - [ ] Integration tests: `OfflineClusterIntegrationTest` (3 new tests for 
inverted index distinct with/without filter)
   - [ ] Verify no regression in existing distinct query behavior (opt-in only, 
disabled by default)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add inverted-index-based distinct operator with runtime cost heuristic [pinot]

Reply via email to