Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

via GitHub Thu, 10 Apr 2025 12:04:55 -0700


friendlymatthew commented on PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2794878604


   f906af55df1b9c9bbfa15513da2e8ca17580e783 is a proof of concept that involves 
selective retry on err. _There's benchmark results below!_
   
   ## Quick overview 
   
   The `to_char_scalar` is quite trivial-- we deal with 1 format string for N 
Date32s, the first format err will trigger a retry, treating the `Date32` array 
as a `Date64` array.
   
   The `to_char_array` is trickier since we deal with N format strings for N 
`Date32`s. _When_ a format error occurs at the ith `Date32`, we'll retry that 
specific date as a `Date64`. And rather than recast the entire input `Date32` 
array as a `Date64` array, we slice into the array and only retrieve the faulty 
`Date32`.
   
   Another method I considered was casting the input `Date32` array at _most_ 
once. When the first occurrence of a format err occurs, we'll just recast the 
entire input array as a `Date64` array. So that subsequent format strings with 
time-specifiers will format without err.
   
   I don't really like this, because I'd like to cast from `Date32` to `Date64` 
very sparingly. Also, if we have a 1000 `Date32` and a 1000 format strings, and 
it just happens that the last format string contains time-specifiers, we'd 
endure the pain of casting the entire input array as a `Date64` array just to 
format the last element.
   
   
   ## Benchmarks
   ### Selective retry vs. main
   
   The selective retry doesn't disturb the code path. When benchmarking with 
main, we see little to no variance. 
   
   ```sh
   # The selective retry code was benched first. This is the benchmark result 
when running on main.
   
   to_char_array_date_only_patterns_1000
                           time:   [136.87 µs 137.13 µs 137.49 µs]
                           change: [-0.9784% -0.7511% -0.4721%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 10 outliers among 100 measurements (10.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
     7 (7.00%) high severe
   
   to_char_scalar_date_only_pattern_1000
                           time:   [101.26 µs 104.97 µs 108.40 µs]
                           change: [-1.5505% +2.7529% +7.4144%] (p = 0.21 > 
0.05)
                           No change in performance detected.
   
   ```
   
   ### Eager cast vs. selective retry
   
   Since selective retry doesn't disturb the existing paths, those cases also 
run much faster when compared to the eager cast approach. 
   
   Plus, the selective retry approach sees a 3% improvement in the 
`to_char_scalar` than the eager cast approach.
   
   The tradeoff here is that the new features we want to have suffer and is 
slower than the eager cast approach.
   
   ```sh
   # The eager cast code was benched first. This is the benchmark result when 
running the selective retry code. 
   
   to_char_array_date_only_patterns_1000
                           time:   [137.99 µs 138.20 µs 138.44 µs]
                           change: [-21.564% -20.959% -20.538%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   to_char_array_datetime_patterns_1000
                           time:   [503.48 µs 506.55 µs 509.87 µs]
                           change: [+110.95% +112.69% +114.12%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     5 (5.00%) high mild
   
   to_char_array_mixed_patterns_1000
                           time:   [323.05 µs 325.66 µs 328.11 µs]
                           change: [+65.744% +66.705% +67.701%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   to_char_scalar_date_only_pattern_1000
                           time:   [96.153 µs 100.02 µs 103.88 µs]
                           change: [-12.058% -8.2654% -4.3595%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   to_char_scalar_datetime_pattern_1000
                           time:   [177.38 µs 186.04 µs 194.64 µs]
                           change: [-11.383% -7.3528% -3.2635%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 20 outliers among 100 measurements (20.00%)
     14 (14.00%) low mild
     6 (6.00%) high mild
   
   to_char_scalar_1000     time:   [312.03 ns 316.56 ns 320.91 ns]
                           change: [-4.3628% -3.1215% -1.7750%] (p = 0.00 < 
0.05)
                           Performance has improved.
   ```
   
   
   If you want to run these benchmarks yourself, you can run:
   
   ```sh
   # use this to compare the alternate approaches (eager cast vs. selective 
retry)
   cargo bench --package datafusion-functions "to_char"
   ```
   
   ```sh
   # use this to compare approaches to main
   cargo bench --package datafusion-functions 
"to_char_array_date_only_patterns_1000|to_char_scalar_date_only_pattern_1000"
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

Reply via email to