Re: [I] Pluggable expression-level statistics estimation (ExpressionAnalyzer) [datafusion]

via GitHub Tue, 24 Mar 2026 11:37:00 -0700


asolimando commented on issue #21120:
URL: https://github.com/apache/datafusion/issues/21120#issuecomment-4120538366


   > This work looks really exciting, thank you for making it happen!
   > 
   > One future challenge I see is that the planned work follows a bottom-up 
approach: we first build the infrastructure, then evolve the optimizer. The 
issue is that some low-level design decisions (e.g., using algorithm X to 
estimate NDV for expression Y) can be difficult to reason about, unless the 
reviewer is a CBO expert.
   > 
   > Perhaps we could also provide a top-down write-up, starting from the end 
goal (making certain workloads faster) and working step by step down to the 
local algorithm choices. A TLDR with references would likely help.
   
   Thanks for your reply @2010YOUY01, I agree that this proposal could use a 
document to explain the end goal and sketch how this improvement would lead us 
there, I will try to work on that in parallel.
   
   In general, for work around statistics today it's a bit the chicken-egg 
problem: we don't leverage statistics fully in planning as propagation is 
incomplete, and we can't prove with existing benchmarks that changes help as 
their output is not consumed for planning in most cases. And that's where 
https://github.com/apache/datafusion/pull/20292 could help while we improve the 
current situation.
   
   This proposal is, however, a little different from the other efforts tracked 
by existing epics like https://github.com/apache/datafusion/issues/8227 and 
https://github.com/apache/datafusion/issues/20766, it aims at introducing 
tooling to enable override mechanism for downstream projects (with just a 
reasonable default). This links to @paleolimbot's 
[interest](https://github.com/apache/datafusion/discussions/21017#discussioncomment-16228504)
 (stats propagation for a specific type of stats) and your work on 
https://github.com/apache/datafusion/pull/19609: since people use statistics in 
a very different way, we need to provide an override mechanisms.
   
   To make things a little more concrete, I could take a few examples from our 
existing benchmarks (probably TPC-DS), and showcase how planning could be 
improved with better statistics and propagation, like a motivating example.
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Pluggable expression-level statistics estimation (ExpressionAnalyzer) [datafusion]

Reply via email to