Dandandan opened a new pull request, #22351:
URL: https://github.com/apache/datafusion/pull/22351
## Which issue does this PR close?
<!-- No issue yet; happy to file one if preferred. -->
- Closes #.
## Rationale for this change
ClickBench q18 groups by `extract(minute FROM
to_timestamp_seconds("EventTime"))`. With `EventTime` stored as `int64` seconds
in the partitioned ClickBench dataset, current execution first materializes a
`Timestamp(Second, _)` array via `to_timestamp_seconds`, then calls Arrow's
generic `date_part(_, DatePart::Minute)` kernel. That kernel handles all
timestamp time units and arbitrary timezones, so it does more work than is
necessary for the common `Timestamp(Second, UTC)` shape.
For `Timestamp(Second, UTC)` (and the no-timezone case, which the query path
treats as UTC), the minute is just `seconds.div_euclid(60).rem_euclid(60)`. The
fix is a narrow physical fast path that handles this case and otherwise falls
back to the existing Arrow kernel — including non-UTC timezones, where DST etc.
mean the minute can depend on the offset.
Measured on ClickBench q18 (`dfbench clickbench --query 18 --iterations 5`,
partitioned dataset, same machine):
| Build | Iterations ms | Avg ms | Delta |
| --- | --- | ---: | ---: |
| Pre-change release binary | 3249.4, 2618.7, 2070.9, 1979.3, 2029.5 |
2389.57 | baseline |
| Rebuilt with fast path | 1866.9, 2143.6, 2179.5, 1839.5, 1806.6 | 1967.22
| **-17.7%** |
## What changes are included in this PR?
Adds two new internal helpers in
`datafusion-functions/src/datetime/date_part.rs`:
- `date_part_minute(array)`: dispatches to the fast path when applicable,
else calls `date_part(array, DatePart::Minute)`.
- `timestamp_second_utc_minute(array)`: returns `Some(Int32Array)` for
`Timestamp(Second, _)` arrays whose timezone is one of `None`, `"UTC"`,
`"+00:00"`, `"+0000"`, `"+00"`. Uses Arrow's `unary` kernel to preserve the
input null buffer.
`DatePartFunc::invoke_with_args` routes the `IntervalUnit::Minute` branch
through `date_part_minute` instead of calling `date_part` directly. No behavior
change for any other unit or for non-UTC timestamps.
## Are these changes tested?
Two new unit tests in `datafusion-functions::datetime::date_part::tests`:
- `timestamp_second_utc_minute_fast_path_matches_arrow`: compares the fast
path output against `date_part(_, DatePart::Minute)` for a `Timestamp(Second,
UTC)` array covering positive, negative, boundary, and null values.
`div_euclid` / `rem_euclid` semantics are checked across the
seconds-since-epoch sign change.
- `timestamp_second_non_utc_minute_falls_back`: confirms a
`Timestamp(Second, +00:30)` input goes through Arrow and returns the
offset-shifted minute (30, 31, null).
```
cargo test -p datafusion-functions --lib datetime::date_part
cargo clippy -p datafusion-functions --all-targets --all-features -- -D
warnings
```
## Are there any user-facing changes?
No. This is an internal performance optimization for one specific
data-type/timezone combination — for all other inputs, behavior is identical to
before because the fast path falls back to the same Arrow kernel.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]