Kontinuation commented on PR #14644:
URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2659161217

   I had another interesting observation: spilling sort can be faster than 
memory unbounded sort in datafusion.
   
   I tried running sort-tpch Q3 using this PR with 
https://github.com/apache/datafusion/pull/14642 cherry-picked onto it, and 
configured `parquet.schema_force_view_types = false` to mitigate 
https://github.com/apache/datafusion/issues/12136#issuecomment-2656400964. Here 
are the test results obtained on a cloud instance with `Intel(R) Xeon(R) 
Platinum 8269CY CPU @ 2.50GHz` CPU:
   
   ```
   $./target/release/dfbench sort-tpch --iterations 1 --path 
benchmarks/data/tpch_sf10 --memory-limit 1000M -q 3 -n1
   Q3 iteration 0 took 93339.0 ms and returned 59986052 rows
   Q3 avg time: 93339.00 ms
   $./target/release/dfbench sort-tpch --iterations 1 --path 
benchmarks/data/tpch_sf10 --memory-limit 500M -q 3 -n1
   Q3 iteration 0 took 81831.2 ms and returned 59986052 rows
   Q3 avg time: 81831.18 ms
   $./target/release/dfbench sort-tpch --iterations 1 --path 
benchmarks/data/tpch_sf10 --memory-limit 200M -q 3 -n1
   Q3 iteration 0 took 77046.4 ms and returned 59986052 rows
   Q3 avg time: 77046.36 ms
   $./target/release/dfbench sort-tpch --iterations 1 --path 
benchmarks/data/tpch_sf10 -q 3 -n1
   Q3 iteration 0 took 170416.1 ms and returned 59986052 rows
   Q3 avg time: 170416.10 ms
   ```
   
   When running without memory limit, we are merging tons of small sorted 
streams, this seems to be bad for performance. Memory limit enforces us to do 
merging before ingesting all the batches, so we are doing several smaller 
merges first and do a final merge at last to produce the result set. Coalescing 
batches into larger streams before merging seems to be a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to