[I] Cost Model [datafusion]

via GitHub Thu, 23 Oct 2025 16:38:26 -0700


NGA-TRAN opened a new issue, #18259:
URL: https://github.com/apache/datafusion/issues/18259


   This task is part of epic #18249
   
   ## Cost Model
   
   Do we want to pursue a traditional, complex cost model that estimates all 
work in a plan, or take a simpler approach? In practice, even the most detailed 
cost models often prove inaccurate despite significant effort.
   
   Do we want to adopt a more intuitive approach—similar to the join ranking 
strategy. Consider the following examples:
   
   - Is a plan with two merge joins better than one merge join and one hash 
join? How should we assign weights and make comparisons?
   - Is a merge join on a single stream consistently faster than a partitioned 
hash join across multiple streams? How do we evaluate and rank these scenarios?
   - Instead of using exact byte sizes, could we categorize input sizes as 
small, medium, or large and assign weights accordingly?
   
   This could serve as a quick and practical research project: define relevant 
properties, assign weights and criteria, and run simple experiments to compare 
estimated costs against actual runtimes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Cost Model [datafusion]

Reply via email to