friendlymatthew commented on PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2797704916

   _Note: I'm sorry about the super long write ups. I'm not trying to bike 
shed._
   
   I was thinking about 
https://github.com/apache/datafusion/pull/15361/commits/f906af55df1b9c9bbfa15513da2e8ca17580e783
 and realized we could do more.
   
   The selective retry approach will extract every failed `Date32` and cast it 
to a `Date64`. If we have M failed `Date32`'s, we'll still recast M times. It's 
just that every cast will have a single element in the array. 
   
   We can use the cast even more sparingly, by keeping track of consecutive 
failed `Date32`s. 
   
   Suppose you had a 1000 format attempts, and a contiguous block of those 
format attempts erred because they contained datetime specifiers:
   
   ```sh
   |----------|xxxxxxxxxxxxx|-------|
      valid      invalid       valid
   ```
   
   For every invalid `Date32`, we'll create an array of size 1 and recast that 
array to a `Date64` array. 
   
   
https://github.com/apache/datafusion/pull/15361/commits/b379fe1b2927fc5c275670c70721df947a433c13
 will keep track of  consecutive failed `Date32` values. By batching the failed 
`Date32`'s, we can call cast _once_. And of course, we preserve the order.
   
   This approach also helps with readability and saves LOC, `to_char_inner` 
gets recursively invoked with the newly casted timestamp array.
   
   ## Benchmarks
   
   She leaves the existing code paths undisturbed like usual. While performance 
has slightly regressed, I really like how this approach looks and would prefer 
to call `cast` sparingly and make use of the batching. 
   
   ```sh
   # I ran the initial selective retry approach first. 
   # This run includes the new logic mentioned above.
   
   to_char_array_date_only_patterns_1000
                           time:   [136.92 µs 137.15 µs 137.39 µs]
                           change: [-3.1659% -2.3990% -1.6897%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 19 outliers among 100 measurements (19.00%)
     7 (7.00%) high mild
     12 (12.00%) high severe
   
   to_char_array_datetime_patterns_1000
                           time:   [543.24 µs 543.84 µs 544.47 µs]
                           change: [+4.4460% +6.0837% +7.4847%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   
   to_char_array_mixed_patterns_1000
                           time:   [339.68 µs 342.60 µs 345.66 µs]
                           change: [-5.3101% -2.9305% -0.8615%] (p = 0.01 < 
0.05)
                           Change within noise threshold.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   to_char_scalar_date_only_pattern_1000
                           time:   [95.107 µs 98.972 µs 102.90 µs]
                           change: [-5.9031% -1.2243% +3.4337%] (p = 0.60 > 
0.05)
                           No change in performance detected.
   
   to_char_scalar_datetime_pattern_1000
                           time:   [171.09 µs 179.74 µs 188.39 µs]
                           change: [-4.0895% +1.3179% +7.1836%] (p = 0.64 > 
0.05)
                           No change in performance detected.
   
   to_char_scalar_1000     time:   [315.63 ns 319.69 ns 323.50 ns]
                           change: [-7.9240% -6.5685% -5.1341%] (p = 0.00 < 
0.05)
                           Performance has improved.
   ```
   
   
   cc @Omega359 @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to