xiangfu0 opened a new pull request, #17872: URL: https://github.com/apache/pinot/pull/17872
## Summary - Merge inverted index logic into `DistinctOperator` with a runtime cost heuristic that chooses between inverted-index and scan paths based on dictionary cardinality vs filtered doc count - Default cost ratio of 5 (derived from JMH benchmarking) can be overridden via the `invertedIndexDistinctCostRatio` query option - Opt-in via: `OPTION(useIndexBasedDistinctOperator=true)` ## Details ### How it works When `useIndexBasedDistinctOperator=true` is set and the column has both a dictionary and an inverted index, `DistinctOperator` evaluates a cost heuristic at runtime: - **Inverted index path**: iterates dictionary entries, does bitmap intersection with filter for each — cost proportional to `dictionaryCardinality` - **Scan path**: iterates filtered docs through projection — cost proportional to `filteredDocCount` - Decision: use inverted index when `dictionaryCardinality * costRatio <= filteredDocCount` The default `costRatio=5` was derived from JMH benchmarking (`BenchmarkInvertedIndexDistinct`) which showed the crossover varies by dictionary size (~30x for small dicts, ~5-10x for large dicts). ### Query options | Option | Type | Default | Description | |--------|------|---------|-------------| | `useIndexBasedDistinctOperator` | boolean | `false` | Opt-in to inverted-index-based distinct | | `invertedIndexDistinctCostRatio` | int | `5` | Cost ratio for the heuristic; higher values bias toward scan path | ### Files changed - `DistinctOperator.java` — dual-path execution with cost heuristic and query option override - `DistinctPlanNode.java` — wires inverted index context to DistinctOperator when eligible - `CommonConstants.java` / `QueryOptionsUtils.java` — new query option keys and getters - `OfflineClusterIntegrationTest.java` — 3 integration tests for correctness validation - `InvertedIndexDistinctCostHeuristicTest.java` — 8 unit tests for cost heuristic boundary behavior - `BenchmarkInvertedIndexDistinct.java` — JMH benchmark for inverted index vs scan crossover analysis ## Test plan - [x] Unit tests: `InvertedIndexDistinctCostHeuristicTest` (8 tests covering cost ratio boundaries, path selection, correctness) - [x] Existing tests: `DistinctQueriesTest` passes unchanged - [ ] Integration tests: `OfflineClusterIntegrationTest` (3 new tests for inverted index distinct with/without filter) - [ ] Verify no regression in existing distinct query behavior (opt-in only, disabled by default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
