adriangb commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3715509979
Thanks so much @sdf-jkl, that's super useful info! The ClickHouse resources seem to be more in line with parquet row group pruning using statistics, which happens before this process. What we are talking about here is more so how to process the filtering during the scan, which would be after the `PREWHERE` / row group stats. One long term vision for this is that we could "seed" the filter sensitivities (instead of assuming they're all unknown). That's basically what you are proposing in `Before seeing your PR and comments in https://github.com/apache/datafusion/issues/3463 I was thinking about using more simple heuristics for sorting predicates.` We discussed that a bit in https://github.com/apache/datafusion/issues/3463#issuecomment-3708382916. TLDR is I think yes using column statistics, sizes, a global cache, etc. would be better than making no assumptions as this PR currently does, but we can improve that later. My goal for now is that performance is ~ no worse than without filter pushdown when there are no selective filters but that when there are selective filters we can take advantage of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
