2010YOUY01 commented on PR #19609:
URL: https://github.com/apache/datafusion/pull/19609#issuecomment-4701511864

   > Apologies for the slightly ignorant comment, I am wandering in from the 
outside, but has any consideration been given to how this will perform for very 
wide/generated plans of the kind you might see in ML pipelines, etc? It might 
be appreciated for such cases to have this behaviour be controllable via a 
config toggle for users.
   
   In practice, partitioned storage is usually clustered by a small number of 
columns: often a single sort/order column, or sometimes 2–3 Z-order columns.
   
   So for complex-expression pruning, we want to only evaluate expressions with 
those clustering columns. Evaluating expressions involving unrelated columns on 
very wide tables could be wasteful, most of the time those columns are unlikely 
to contribute useful pruning power.
   
   For that reason, making this behavior configurable is definitely desirable, 
and also it's doable.
   
   This is just my guess about ML workloads and very wide tables concerns, so I 
might not get it correctly. I'd be interested to learn more if that's the case!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to