adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708209559
Just throwing ideas at the wall just in case it helps. I feel like the fundamental problem (and I may be wrong about this) is that filter pushdown has a rather large I/O and CPU efficiency cost: instead of doing one I/O operation and then processing all of the filter at once we do one filter at a time, triggering multiple I/O operations, doing less CPU work at a time, read duplicate data in projections and have to deal with all of the masks (not cheap apparently). This is all worth it when the filters are *very* selective, but if they are not selective it's just not worth it. So far I think (could be wrong about this) most efforts have focused on (1) making filter pushdown more efficient to evaluate and (2) caching to minimize duplicate I/O. Has there been any attempts to keep track of filter selectivity and use that to our advantage? For example we could track filter selectivity for each filter and use that to: - Bump non-selective filters or filters with pathological selection masks into the scan phase - Re-order filters to optimize for total scan size instead of the scan size of each filter (large but selective > small but non-selective) Sorry if this has been discussed before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
