alamb commented on issue #23197:
URL: https://github.com/apache/datafusion/issues/23197#issuecomment-4835392373
> Here is my guess on why such design decision was made before: because the
original window aggregation implementation was fairly naive: it buffered all
input before continuing execution. As a result, a downstream implementation was
upstreamed to support streaming and out-of-order input, where assuming fully
sorted input was not possible for their use cases. Given that goal, the
additional complexity made sense because it improved memory efficiency.
Yes I think that is mostly accurate (I think @akurmustafa / @mustafasrepo
did a lot of the initial work)
I think `BoundedWindowExec` was designed to support streaming sql engines
(e.g. where the ExecutionPlan was started and the incrementally produced
results as data was incrementally pushed through it)
> However, DataFusion's primary use case is still batched analytics. Under
that assumption, we can require the window operator to consume input that has
already been repartitioned and sorted by the window partition/order keys. This
makes the implementation both simpler and faster, and we can achieve the same
memory efficiency given bounded window frames.
I agree it makes the implementation simpler -- I suspect a fully specialized
operator could be made faster than spilling / resorting, but I also think we
could work on that after we have exhausted all options for making
BoundedWindowAggExec faster
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]