adriangb commented on PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2763043386

   @alamb I've achieved 2/3 goals:
   - I added wrapping of a `DynamicFilterSource` in a `PhysicalExpr` such that 
it can dynamically update itself to prune rows using filter pushdown _even on a 
single file_. After various experiments I went with 
https://github.com/apache/datafusion/pull/15301/commits/1f2fcd832ce7432e9ecb23d698ebf83823f48405#diff-cb5bce55559bce1aacd37171f2501f660cd564eb83dac28d331e39bef98aa227
 which is a hybrid of explicitly passing around a `DynamicFilterSource` and an 
opaque `PhysicalExpr`: the idea is to explicitly pass around the dynamic filter 
source so that operators can opt-into it explicitly and do special handling 
such as taking a snapshot for serializing across the wire or 
`PruningPredicate`, but are still able to convert it to a `PhysicalExpr` when 
needed. It was necessary to take quite a bit of care with the `PhyscialExpr` 
wrapper implementation because e.g. filter pushdown remaps column indices by 
rewriting children so we need to do dynamic rewriting of children. But after it 
was all wired
  up I think it is pretty nice.
   - I switched the dynamic aspect from polling (you ask `DynamicFilterSource` 
for new filters) to a push model (when the TopK updates it pushes the new 
filters into the shared state). I think this should be more performant, and it 
completely removes locks on the TopK heap. This could be made even more 
efficient if we bring back the `supports_dynamic_filter_pushdown() -> bool` 
method since we can avoid doing some work and setting up references / locks if 
we can know ahead of time if anything will need it or not.
   
   I was not able to get the global TopK thing working because it's not 
actually a global TopK doing the work at the top level: it's a 
`SortPreservingMergeExec` which I think in theory we can implement this 
optimization for but this PR is large enough as is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to