asolimando commented on issue #21120: URL: https://github.com/apache/datafusion/issues/21120#issuecomment-4120538366
> This work looks really exciting, thank you for making it happen! > > One future challenge I see is that the planned work follows a bottom-up approach: we first build the infrastructure, then evolve the optimizer. The issue is that some low-level design decisions (e.g., using algorithm X to estimate NDV for expression Y) can be difficult to reason about, unless the reviewer is a CBO expert. > > Perhaps we could also provide a top-down write-up, starting from the end goal (making certain workloads faster) and working step by step down to the local algorithm choices. A TLDR with references would likely help. Thanks for your reply @2010YOUY01, I agree that this proposal could use a document to explain the end goal and sketch how this improvement would lead us there, I will try to work on that in parallel. In general, for work around statistics today it's a bit the chicken-egg problem: we don't leverage statistics fully in planning as propagation is incomplete, and we can't prove with existing benchmarks that changes help as their output is not consumed for planning in most cases. And that's where https://github.com/apache/datafusion/pull/20292 could help while we improve the current situation. This proposal is, however, a little different from the other efforts tracked by existing epics like https://github.com/apache/datafusion/issues/8227 and https://github.com/apache/datafusion/issues/20766, it aims at introducing tooling to enable override mechanism for downstream projects (with just a reasonable default). This links to @paleolimbot's [interest](https://github.com/apache/datafusion/discussions/21017#discussioncomment-16228504) (stats propagation for a specific type of stats) and your work on https://github.com/apache/datafusion/pull/19609: since people use statistics in a very different way, we need to provide an override mechanisms. To make things a little more concrete, I could take a few examples from our existing benchmarks (probably TPC-DS), and showcase how planning could be improved with better statistics and propagation, like a motivating example. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
