goldmedal commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2877380478
> That's a use case, but there are others too. Maybe one runs a forecast model, which is a little too complicated to "embed" into the query engine. In that case, we may still want the engine to maintain the internal state and do the bookkeeping on groups etc., but offload the numerics computation elsewhere where it is implemented. I tried to conceive scenarios for using async aggregate functions, but I believe the use cases depend heavily on user needs. My initial thought is similar to @alamb's, intending to batch invoke an external function to process a batch. For aggregation, both computation and accumulation would run within the external function. So, it would be a single-stage aggregation, where a batch processed by AsyncFuncExec would result in an aggregated outcome (the intermediate process depends on the implementation of the external function). I'm not sure if I misunderstood anything; if so, please feel free to correct me. Regarding the scenario you mentioned, which might require maintaining multi-stage aggregation (partial, final, final partitioned...), I think with the approach in this PR, we would need to provide a new physical plan (possibly called `AsyncAggregateExec`) to handle this scenario. It might need to accept async function inputs and allow passing accumulators into the async function. I haven't thought of a very clear use case yet, but users should still be able to define the behavior of `update_async` and `merge_async`. However, to some extent, this is like redoing the aggregation logic. Perhaps the solution mentioned by @berkaysynnada , adding `evaluate_async` to `PhysicalExpr`, would be a more fundamental approach, but this might involve changes to the entire physical expression evaluation. I don't have a strong opinion on this point yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org