2010YOUY01 commented on PR #19609: URL: https://github.com/apache/datafusion/pull/19609#issuecomment-4701511864
> Apologies for the slightly ignorant comment, I am wandering in from the outside, but has any consideration been given to how this will perform for very wide/generated plans of the kind you might see in ML pipelines, etc? It might be appreciated for such cases to have this behaviour be controllable via a config toggle for users. In practice, partitioned storage is usually clustered by a small number of columns: often a single sort/order column, or sometimes 2–3 Z-order columns. So for complex-expression pruning, we want to only evaluate expressions with those clustering columns. Evaluating expressions involving unrelated columns on very wide tables could be wasteful, most of the time those columns are unlikely to contribute useful pruning power. For that reason, making this behavior configurable is definitely desirable, and also it's doable. This is just my guess about ML workloads and very wide tables concerns, so I might not get it correctly. I'd be interested to learn more if that's the case! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
