friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2797704916
_Note: I'm sorry about the super long write ups. I'm not trying to bike shed._ I was thinking about https://github.com/apache/datafusion/pull/15361/commits/f906af55df1b9c9bbfa15513da2e8ca17580e783 and realized we could do more. The selective retry approach will extract every failed `Date32` and cast it to a `Date64`. If we have M failed `Date32`'s, we'll still recast M times. It's just that every cast will have a single element in the array. We can use the cast even more sparingly, by keeping track of consecutive failed `Date32`s. Suppose you had a 1000 format attempts, and a contiguous block of those format attempts erred because they contained datetime specifiers: ```sh |----------|xxxxxxxxxxxxx|-------| valid invalid valid ``` For every invalid `Date32`, we'll create an array of size 1 and recast that array to a `Date64` array. https://github.com/apache/datafusion/pull/15361/commits/b379fe1b2927fc5c275670c70721df947a433c13 will keep track of consecutive failed `Date32` values. By batching the failed `Date32`'s, we can call cast _once_. And of course, we preserve the order. This approach also helps with readability and saves LOC, `to_char_inner` gets recursively invoked with the newly casted timestamp array. ## Benchmarks She leaves the existing code paths undisturbed like usual. While performance has slightly regressed, I really like how this approach looks and would prefer to call `cast` sparingly and make use of the batching. ```sh # I ran the initial selective retry approach first. # This run includes the new logic mentioned above. to_char_array_date_only_patterns_1000 time: [136.92 µs 137.15 µs 137.39 µs] change: [-3.1659% -2.3990% -1.6897%] (p = 0.00 < 0.05) Performance has improved. Found 19 outliers among 100 measurements (19.00%) 7 (7.00%) high mild 12 (12.00%) high severe to_char_array_datetime_patterns_1000 time: [543.24 µs 543.84 µs 544.47 µs] change: [+4.4460% +6.0837% +7.4847%] (p = 0.00 < 0.05) Performance has regressed. to_char_array_mixed_patterns_1000 time: [339.68 µs 342.60 µs 345.66 µs] change: [-5.3101% -2.9305% -0.8615%] (p = 0.01 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe to_char_scalar_date_only_pattern_1000 time: [95.107 µs 98.972 µs 102.90 µs] change: [-5.9031% -1.2243% +3.4337%] (p = 0.60 > 0.05) No change in performance detected. to_char_scalar_datetime_pattern_1000 time: [171.09 µs 179.74 µs 188.39 µs] change: [-4.0895% +1.3179% +7.1836%] (p = 0.64 > 0.05) No change in performance detected. to_char_scalar_1000 time: [315.63 ns 319.69 ns 323.50 ns] change: [-7.9240% -6.5685% -5.1341%] (p = 0.00 < 0.05) Performance has improved. ``` cc @Omega359 @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org