metegenez commented on issue #12779:
URL: https://github.com/apache/datafusion/issues/12779#issuecomment-2404451412
> @metegenez, yes I agree that's a complicated.
>
> I think there are two case:
>
> 1. `select date_trunc(..., timestamp) bucket, avg(data) from records where
timestamp > now() - interval '7 days'` - I think we can solve this (albeit
approximately) by filtering on the `bucket` column to exclude values more than
7 days ago - e.g. run another `FilterExec` like `bucket > {dynamic lower bound
expr}`
> 2. For queries like `select avg(data) from records where timestamp > now()
- interval '7 days'` it's fundamentally impossible to exclude some of the old
data - you have one number like `avg(data) = 42`, and you can't filter out part
of it. I think solution here would be to rewrite (in the logical or physical
plan) into two aggregations, where the inner aggregation is of the form `select
date_trunc(..., timestamp) bucket, avg(data) from records where timestamp >
now() - interval '7 days'`, and the outer aggregation aggregates the result,
then apply the same filtering technique on the inner grouped agregation.
>
> Luckily the first case which is somewhat easier is more common I think.
The first method is the well-known pre-aggregation, which is widely used.
It's possible to automate this by writing a logical plan analyzer for the
operation, though it may not support all functions out of the box. Basic
functions like count, min, max, sum, last, and first are straightforward, but
more complex ones, like median, add complications.
In the second case, I recall implementing a retract mechanism for certain
aggregation functions to support streaming window operations. While it's not
achievable through SQL syntax, adding a physical operator to retract unwanted
data might be a viable solution, especially if the filtered column is ordered.
To sum up, the first method offers more flexibility when adjusting time
ranges, such as moving from the last hour to the last six hours, as long as the
data has already been pre-aggregated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]