friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2794878604
f906af55df1b9c9bbfa15513da2e8ca17580e783 is a proof of concept that involves selective retry on err. _There's benchmark results below!_ ## Quick overview The `to_char_scalar` is quite trivial-- we deal with 1 format string for N Date32s, the first format err will trigger a retry, treating the `Date32` array as a `Date64` array. The `to_char_array` is trickier since we deal with N format strings for N `Date32`s. _When_ a format error occurs at the ith `Date32`, we'll retry that specific date as a `Date64`. And rather than recast the entire input `Date32` array as a `Date64` array, we slice into the array and only retrieve the faulty `Date32`. Another method I considered was casting the input `Date32` array at _most_ once. When the first occurrence of a format err occurs, we'll just recast the entire input array as a `Date64` array. So that subsequent format strings with time-specifiers will format without err. I don't really like this, because I'd like to cast from `Date32` to `Date64` very sparingly. Also, if we have a 1000 `Date32` and a 1000 format strings, and it just happens that the last format string contains time-specifiers, we'd endure the pain of casting the entire input array as a `Date64` array just to format the last element. ## Benchmarks ### Selective retry vs. main The selective retry doesn't disturb the code path. When benchmarking with main, we see little to no variance. ```sh # The selective retry code was benched first. This is the benchmark result when running on main. to_char_array_date_only_patterns_1000 time: [136.87 µs 137.13 µs 137.49 µs] change: [-0.9784% -0.7511% -0.4721%] (p = 0.00 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 2 (2.00%) high mild 7 (7.00%) high severe to_char_scalar_date_only_pattern_1000 time: [101.26 µs 104.97 µs 108.40 µs] change: [-1.5505% +2.7529% +7.4144%] (p = 0.21 > 0.05) No change in performance detected. ``` ### Eager cast vs. selective retry Since selective retry doesn't disturb the existing paths, those cases also run much faster when compared to the eager cast approach. Plus, the selective retry approach sees a 3% improvement in the `to_char_scalar` than the eager cast approach. The tradeoff here is that the new features we want to have suffer and is slower than the eager cast approach. ```sh # The eager cast code was benched first. This is the benchmark result when running the selective retry code. to_char_array_date_only_patterns_1000 time: [137.99 µs 138.20 µs 138.44 µs] change: [-21.564% -20.959% -20.538%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild to_char_array_datetime_patterns_1000 time: [503.48 µs 506.55 µs 509.87 µs] change: [+110.95% +112.69% +114.12%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild to_char_array_mixed_patterns_1000 time: [323.05 µs 325.66 µs 328.11 µs] change: [+65.744% +66.705% +67.701%] (p = 0.00 < 0.05) Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild to_char_scalar_date_only_pattern_1000 time: [96.153 µs 100.02 µs 103.88 µs] change: [-12.058% -8.2654% -4.3595%] (p = 0.00 < 0.05) Performance has improved. to_char_scalar_datetime_pattern_1000 time: [177.38 µs 186.04 µs 194.64 µs] change: [-11.383% -7.3528% -3.2635%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 14 (14.00%) low mild 6 (6.00%) high mild to_char_scalar_1000 time: [312.03 ns 316.56 ns 320.91 ns] change: [-4.3628% -3.1215% -1.7750%] (p = 0.00 < 0.05) Performance has improved. ``` If you want to run these benchmarks yourself, you can run: ```sh # use this to compare the alternate approaches (eager cast vs. selective retry) cargo bench --package datafusion-functions "to_char" ``` ```sh # use this to compare approaches to main cargo bench --package datafusion-functions "to_char_array_date_only_patterns_1000|to_char_scalar_date_only_pattern_1000" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org