[I] Benchmark / program to test Spilling Joins [datafusion]

via GitHub Wed, 09 Apr 2025 21:59:32 -0700


alamb opened a new issue, #15664:
URL: https://github.com/apache/datafusion/issues/15664


   ### Is your feature request related to a problem or challenge?
   
   - Part of  https://github.com/apache/datafusion/issues/15271
   
   There are many interesting ideas on how to improve DataFusion while spilling 
for example https://github.com/apache/datafusion/issues/15271 from @2010YOUY01  
and others. 
   
   What I think we really need next to make progress in this area is a 
benchmark / agreed upon way of measuring our progress so that we can improve and
   
   ### Describe the solution you'd like
   
   I would like a documented command / set of commands that is:
   1. Easy to run (and thus fast to test / iterate on)
   2. Exercises the spilling feature at different levels of memory pressure
   3. Spends most of its time sorting/spilling/merging (not generating output 
for example)
   
   ### Describe alternatives you've considered
   
   idea 1: can use some `datafusion-cli` features / flags and document them
   
   Idea 2: Add a new suite to bench.sh / `dfbench`: 
https://github.com/apache/datafusion/tree/main/benchmarks
   
   
   As for what to do I suggest something relatively simple like sorting the  
TPCH lineitem table with 200MB, 500MB,  1GB, 5GB and 10GB of memory for example
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Benchmark / program to test Spilling Joins [datafusion]

Reply via email to