Dandandan opened a new pull request, #22351:
URL: https://github.com/apache/datafusion/pull/22351

   ## Which issue does this PR close?
   
   <!-- No issue yet; happy to file one if preferred. -->
   
   - Closes #.
   
   ## Rationale for this change
   
   ClickBench q18 groups by `extract(minute FROM 
to_timestamp_seconds("EventTime"))`. With `EventTime` stored as `int64` seconds 
in the partitioned ClickBench dataset, current execution first materializes a 
`Timestamp(Second, _)` array via `to_timestamp_seconds`, then calls Arrow's 
generic `date_part(_, DatePart::Minute)` kernel. That kernel handles all 
timestamp time units and arbitrary timezones, so it does more work than is 
necessary for the common `Timestamp(Second, UTC)` shape.
   
   For `Timestamp(Second, UTC)` (and the no-timezone case, which the query path 
treats as UTC), the minute is just `seconds.div_euclid(60).rem_euclid(60)`. The 
fix is a narrow physical fast path that handles this case and otherwise falls 
back to the existing Arrow kernel — including non-UTC timezones, where DST etc. 
mean the minute can depend on the offset.
   
   Measured on ClickBench q18 (`dfbench clickbench --query 18 --iterations 5`, 
partitioned dataset, same machine):
   
   | Build | Iterations ms | Avg ms | Delta |
   | --- | --- | ---: | ---: |
   | Pre-change release binary | 3249.4, 2618.7, 2070.9, 1979.3, 2029.5 | 
2389.57 | baseline |
   | Rebuilt with fast path | 1866.9, 2143.6, 2179.5, 1839.5, 1806.6 | 1967.22 
| **-17.7%** |
   
   ## What changes are included in this PR?
   
   Adds two new internal helpers in 
`datafusion-functions/src/datetime/date_part.rs`:
   
   - `date_part_minute(array)`: dispatches to the fast path when applicable, 
else calls `date_part(array, DatePart::Minute)`.
   - `timestamp_second_utc_minute(array)`: returns `Some(Int32Array)` for 
`Timestamp(Second, _)` arrays whose timezone is one of `None`, `"UTC"`, 
`"+00:00"`, `"+0000"`, `"+00"`. Uses Arrow's `unary` kernel to preserve the 
input null buffer.
   
   `DatePartFunc::invoke_with_args` routes the `IntervalUnit::Minute` branch 
through `date_part_minute` instead of calling `date_part` directly. No behavior 
change for any other unit or for non-UTC timestamps.
   
   ## Are these changes tested?
   
   Two new unit tests in `datafusion-functions::datetime::date_part::tests`:
   
   - `timestamp_second_utc_minute_fast_path_matches_arrow`: compares the fast 
path output against `date_part(_, DatePart::Minute)` for a `Timestamp(Second, 
UTC)` array covering positive, negative, boundary, and null values. 
`div_euclid` / `rem_euclid` semantics are checked across the 
seconds-since-epoch sign change.
   - `timestamp_second_non_utc_minute_falls_back`: confirms a 
`Timestamp(Second, +00:30)` input goes through Arrow and returns the 
offset-shifted minute (30, 31, null).
   
   ```
   cargo test -p datafusion-functions --lib datetime::date_part
   cargo clippy -p datafusion-functions --all-targets --all-features -- -D 
warnings
   ```
   
   ## Are there any user-facing changes?
   
   No. This is an internal performance optimization for one specific 
data-type/timezone combination — for all other inputs, behavior is identical to 
before because the fast path falls back to the same Arrow kernel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to