goldmedal commented on PR #14837:
URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2877380478

   > That's a use case, but there are others too. Maybe one runs a forecast 
model, which is a little too complicated to "embed" into the query engine. In 
that case, we may still want the engine to maintain the internal state and do 
the bookkeeping on groups etc., but offload the numerics computation elsewhere 
where it is implemented.
   
   I tried to conceive scenarios for using async aggregate functions, but I 
believe the use cases depend heavily on user needs.
   
   My initial thought is similar to @alamb's, intending to batch invoke an 
external function to process a batch. For aggregation, both computation and 
accumulation would run within the external function. So, it would be a 
single-stage aggregation, where a batch processed by AsyncFuncExec would result 
in an aggregated outcome (the intermediate process depends on the 
implementation of the external function).
   
   I'm not sure if I misunderstood anything; if so, please feel free to correct 
me.
   Regarding the scenario you mentioned, which might require maintaining 
multi-stage aggregation (partial, final, final partitioned...), I think with 
the approach in this PR, we would need to provide a new physical plan (possibly 
called `AsyncAggregateExec`) to handle this scenario.
   
   It might need to accept async function inputs and allow passing accumulators 
into the async function.
   I haven't thought of a very clear use case yet, but users should still be 
able to define the behavior of `update_async` and `merge_async`.
   
   However, to some extent, this is like redoing the aggregation logic. Perhaps 
the solution mentioned by @berkaysynnada , adding `evaluate_async` to 
`PhysicalExpr`, would be a more fundamental approach, but this might involve 
changes to the entire physical expression evaluation. I don't have a strong 
opinion on this point yet.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to