Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

via GitHub Wed, 02 Jul 2025 02:36:45 -0700


acking-you commented on PR #16647:
URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027160766


   This is the benchmark scenario where the test data has not been modified by 
default(multi large string):
   ```sh
   Benchmarking 
bench_merge_sorted_preserving/multiple_large_string_columns_with_1m_rows: 
Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 50.2s.
   bench_merge_sorted_preserving/multiple_large_string_columns_with_1m_rows
                           time:   [5.0435 s 5.0615 s 5.0813 s]
   Found 3 outliers among 10 measurements (30.00%)
     1 (10.00%) low mild
     2 (20.00%) high severe
   
   Benchmarking 
bench_merge_sorted_preserving/multiple_u64_columns_with_1m_rows: Warming up for 
3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 8.6s or enable flat sampling.
   bench_merge_sorted_preserving/multiple_u64_columns_with_1m_rows
                           time:   [157.82 ms 160.78 ms 163.05 ms]
   
   ➜  arrow-datafusion git:(main) git checkout reuse_rows                       
                                                                                
                                                                                
                     root@VM-250-221-tencentos arrow-datafusion #
   branch 'reuse_rows' set up to track 'origin/reuse_rows'.
   Switched to a new branch 'reuse_rows'
   ➜  arrow-datafusion git:(reuse_rows) cargo bench  --bench 
sort_preserving_merge -- --sample-size=10
   Benchmarking 
bench_merge_sorted_preserving/multiple_large_string_columns_with_1m_rows: 
Warming up for 3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 51.2s.
   bench_merge_sorted_preserving/multiple_large_string_columns_with_1m_rows
                           time:   [5.0404 s 5.0613 s 5.0831 s]
                           change: [-0.5635% -0.0039% +0.5493%] (p = 0.99 > 
0.05)
                           No change in performance detected.
   
   Benchmarking 
bench_merge_sorted_preserving/multiple_u64_columns_with_1m_rows: Warming up for 
3.0000 s
   Warning: Unable to complete 10 samples in 5.0s. You may wish to increase 
target time to 8.6s or enable flat sampling.
   bench_merge_sorted_preserving/multiple_u64_columns_with_1m_rows
                           time:   [155.99 ms 157.30 ms 159.18 ms]
                           change: [-3.1635% -1.4444% +0.3068%] (p = 0.15 > 
0.05)
                           No change in performance detected.
   Found 1 outliers among 10 measurements (10.00%)
     1 (10.00%) high mild
   ```
   The performance improvement in the test data above appears to be minimal. I 
suspect this might be due to the length of the string used for testing being 
too large, making the memory allocation overhead negligible in comparison.
   
   So I tried to make the string smaller, and the test results are as follows:
   ```sh
   bench_merge_sorted_preserving/multiple_large_string_columns_with_1m_rows
                           time:   [757.06 ms 760.87 ms 764.68 ms]
   
   bench_merge_sorted_preserving/multiple_u64_columns_with_1m_rows
                           time:   [209.89 ms 210.70 ms 211.52 ms]
   
   ➜  arrow-datafusion git:(main)  git checkout reuse_rows                      
                                                                                
                                                                                
           
   bench_merge_sorted_preserving/multiple_large_string_columns_with_1m_rows
                           time:   [755.94 ms 758.84 ms 762.58 ms]
                           change: [-0.9202% -0.2676% +0.4455%] (p = 0.47 > 
0.05)
                           No change in performance detected.
   Found 1 outliers among 10 measurements (10.00%)
     1 (10.00%) high mild
   
   bench_merge_sorted_preserving/multiple_u64_columns_with_1m_rows
                           time:   [209.22 ms 210.43 ms 212.07 ms]
                           change: [-0.8397% -0.1278% +0.7042%] (p = 0.78 > 
0.05)
                           No change in performance detected.
   Found 1 outliers among 10 measurements (10.00%)
     1 (10.00%) high severe
   ```
   The performance improvement compared to before is indeed more noticeable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

Reply via email to