zhuqi-lucas commented on PR #22450:
URL: https://github.com/apache/datafusion/pull/22450#issuecomment-4702240019
> Did some quick review but need to find more time to look in details.
>
> I want to double check:
>
> 1. Logic for deciding when we need to go down a different path. I see this
is using `DynamicFilterTracker::classify` but I'm not sure if we should always
or optionally create a RowGroupPruner`, etc.
> 2. Logic for splitting runs. I thought the APIs we added in arrow-rs are
flexible enough to not require splitting runs, they allow pausing a run at row
group boundaries and changing it (e.g. skipping subsequent row groups) but
maybe it doesn't have the right apis...
Thanks @adriangb for the review!
(2) splitting runs: Already gone — 7e1aae26a collapses to a single decoder
paused at RG boundaries, rebuilt via
decoder.into_builder().with_row_groups(remaining).build() to skip pruned RGs.
RowGroupRun /
split_runs are deleted. The arrow-rs API is exactly enough.
(1) when to create RowGroupPruner: Agreed, current gate is too loose.
Static predicates were already fully consumed by prune_by_statistics at open
time, so a runtime pruner over them is wasted work. Will tighten
to classify(&predicate) == Watching, matching the fmt_extra marker.
Inline on snapshot_generation: Stale — the pruner already uses the
DynamicFilterTracker watcher from #22460. Will fix the comment in the same
follow-up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]