Re: [I] `datafusion-query-cache` - caching intermediate results for faster repeated queries [datafusion]

via GitHub Thu, 10 Oct 2024 01:33:18 -0700


metegenez commented on issue #12779:
URL: https://github.com/apache/datafusion/issues/12779#issuecomment-2404451412


   > @metegenez, yes I agree that's a complicated.
   > 
   > I think there are two case:
   > 
   > 1. `select date_trunc(..., timestamp) bucket, avg(data) from records where 
timestamp > now() - interval '7 days'` - I think we can solve this (albeit 
approximately) by filtering on the `bucket` column to exclude values more than 
7 days ago - e.g. run another `FilterExec` like `bucket > {dynamic lower bound 
expr}`
   > 2. For queries like `select avg(data) from records where timestamp > now() 
- interval '7 days'` it's fundamentally impossible to exclude some of the old 
data - you have one number like `avg(data) = 42`, and you can't filter out part 
of it. I think solution here would be to rewrite (in the logical or physical 
plan) into two aggregations, where the inner aggregation is of the form `select 
date_trunc(..., timestamp) bucket, avg(data) from records where timestamp > 
now() - interval '7 days'`, and the outer aggregation aggregates the result, 
then apply the same filtering technique on the inner grouped agregation.
   > 
   > Luckily the first case which is somewhat easier is more common I think.
   
   The first method is the well-known pre-aggregation, which is widely used. 
It's possible to automate this by writing a logical plan analyzer for the 
operation, though it may not support all functions out of the box. Basic 
functions like count, min, max, sum, last, and first are straightforward, but 
more complex ones, like median, add complications.
   
   In the second case, I recall implementing a retract mechanism for certain 
aggregation functions to support streaming window operations. While it's not 
achievable through SQL syntax, adding a physical operator to retract unwanted 
data might be a viable solution, especially if the filtered column is ordered.
   
   To sum up, the first method offers more flexibility when adjusting time 
ranges, such as moving from the last hour to the last six hours, as long as the 
data has already been pre-aggregated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] `datafusion-query-cache` - caching intermediate results for faster repeated queries [datafusion]

Reply via email to