adriangb commented on issue #3463:
URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708209559

   Just throwing ideas at the wall just in case it helps.
   
   I feel like the fundamental problem (and I may be wrong about this) is that 
filter pushdown has a rather large I/O and CPU efficiency cost: instead of 
doing one I/O operation and then processing all of the filter at once we do one 
filter at a time, triggering multiple I/O operations, doing less CPU work at a 
time, read duplicate data in projections and have to deal with all of the masks 
(not cheap apparently). This is all worth it when the filters are *very* 
selective, but if they are not selective it's just not worth it.
   
   So far I think (could be wrong about this) most efforts have focused on (1) 
making filter pushdown more efficient to evaluate and (2) caching to minimize 
duplicate I/O.
   
   Has there been any attempts to keep track of filter selectivity and use that 
to our advantage? For example we could track filter selectivity for each filter 
and use that to:
   - Bump non-selective filters or filters with pathological selection masks 
into the scan phase
   - Re-order filters to optimize for total scan size instead of the scan size 
of each filter (large but selective > small but non-selective)
   
   Sorry if this has been discussed before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to