zhuqi-lucas opened a new pull request, #22518:
URL: https://github.com/apache/datafusion/pull/22518
## Which issue does this PR close?
Phase 1 prototype for #22405.
## Rationale for this change
The current `skip_partial_aggregation_probe_ratio_threshold` (default 0.8)
is a single fixed knob: when measured `num_groups / input_rows` ≥ 0.8, partial
aggregation skips. This catches the high-cardinality / no-reduction case but
**misses the medium-ratio band** (~0.5–0.7) where partial aggregation is still
net-negative because *per-row cost* is high — heavy variable-length keys,
complex aggregates, etc.
ClickBench Q18 is the motivating example (issue body has the full numbers):
| Config | Q18 elapsed |
|---|---|
| Default `ratio_threshold = 0.8` | 2.72 s |
| `ratio_threshold = 0.6, probe_rows = 5000` | 1.57 s (**1.73× faster**) |
Measured ratio is 0.565 — well below 0.8 — so partial aggregation keeps
running and burns 17 s of compute across 12 partitions for only ~40 %
reduction. Lowering the global ratio threshold to 0.6 fixes Q18 but is likely
to regress lower-cost queries that currently benefit from partial agg at that
ratio.
This PR adds a **second, opt-in skip rule** based on the measured per-row
wall time, leaving the existing rule untouched.
## What changes are included in this PR?
**Three new config options** (all under `datafusion.execution`):
- `skip_partial_aggregation_use_cost_model` (bool, default **false**) —
turns the cost-aware rule on. Off by default → no behaviour change for existing
users.
- `skip_partial_aggregation_cost_ns_per_row` (u64, default 1000) — per-row
wall-time floor (ns) above which the new rule fires.
- `skip_partial_aggregation_cost_min_ratio` (f64, default 0.3) — below this
ratio partial agg is kept regardless of per-row cost (it's reducing too much to
be worth skipping).
**`SkipAggregationProbe` extended**
(`datafusion/physical-plan/src/aggregates/row_hash.rs`):
- Snapshots `baseline_metrics.elapsed_compute` at construction (cheap atomic
load).
- When the probe-rows window closes, computes `ns_per_row =
(elapsed_compute_now - snapshot) / input_rows`.
- Decision tree at the probe window:
1. **Rule 1** (existing): `ratio ≥ probe_ratio_threshold` → skip.
2. **Rule 2** (new, opt-in): `use_cost_model && ratio ≥ cost_min_ratio &&
ns_per_row > cost_ns_per_row_threshold` → skip.
- Neither rule fires → leave `is_locked = false`, re-evaluate on next batch
(existing behaviour).
The cost model leans on the fact that `elapsed_compute` already brackets
exactly the timed partial-aggregation work in this loop, so we don't add any
`Instant::now()` calls in the hot path — the timing-overhead budget (≤ 1 %) is
met automatically.
## Are these changes tested?
Five new `SkipAggregationProbe` unit tests in `row_hash.rs`:
- `skip_probe_cost_model_off_matches_legacy_ratio_check` — backwards-compat
guard.
- `skip_probe_cost_model_fires_in_medium_ratio_high_cost_band` — the Q18
case.
- `skip_probe_cost_model_does_not_fire_below_min_ratio` — high-reduction
guard.
- `skip_probe_cost_model_does_not_fire_when_cheap_per_row` — keeps cheap
aggs.
- `skip_probe_cost_model_still_honours_high_ratio_rule` — Rule 1 still wins.
Existing 100 aggregate tests + 8 aggregate SLT files still pass. `cargo
clippy -p datafusion-physical-plan --all-targets -- -D warnings` clean.
## Are there any user-facing changes?
Three additive config options. Default behaviour is **unchanged** until a
user opts in by setting `skip_partial_aggregation_use_cost_model = true`.
## Status: draft
Drafting for early review on the design / constants. The two numeric
defaults (`cost_ns_per_row = 1000`, `cost_min_ratio = 0.3`) are educated
guesses; **needs ClickBench-wide validation** to tune them and confirm no
regression on other queries before unflagging. Happy to wire that into the PR
once reviewers weigh in on the shape.
## Followups (not in this PR)
- **Phase 2**: periodic re-probe so a query whose distribution shifts
mid-stream can switch back (hash table state migration is the hard part —
single-direction skip is keep-as-is here).
- **Phase 3**: unify with the parquet adaptive-scan direction (#22450 /
arrow-rs#9968) under a shared `RuntimeStatsObserver` / decision framework.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]