adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708360151
> I'm not familiar with how DF is handling this currently, but a selectivity estimate based approach at plan time might be a good place to start. The answer is: we are not. The only similar thing we do is use the column sizes (from parquet metadata) to reorder the filters. I don’t think we have enough information to do anything useful from statistics (this is probably why we haven’t done so yet) but if arrow-rs at least exposed the selectivity of filters after each file is read (ideally each batch?) we could at least have runtime filter selectivity statistics so as we open more files we adapt our approach using the options you described above. A further step would be for arrow-rs to allow us to rebuild/reshuffle our approach within a scan but that may require more API churn. Adjusting between files should be pretty simple. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
