niebayes opened a new issue, #16453:
URL: https://github.com/apache/datafusion/issues/16453

   Our database recently implemented a windowing mechanism similar to Apache 
Flink, allowing data to be segmented as needed.
   For example, we support the following SQL statements:
   
   ```sql
   SELECT avg(value) FROM t GROUP BY tumble_window(ts, INTERVAL '10' MINUTE);
   SELECT avg(value) FROM t GROUP BY session_window(ts, INTERVAL '30' SECOND);
   ```
   
   Our implementation is relatively straightforward:
   
   * We first design some dummy window UDFs, such as `tumble_window`, 
`session_window`, `slide_window`, etc.
   * Then we implement an optimizer that transforms the dummy window UDFs into 
a `WindowAssign` operator. It also rewrites the `Aggregate` operator by 
replacing the `GROUP BY` expressions with the window ID.
   * The `WindowAssign` operator assigns one or more window IDs to each input 
row, indicating which windows the row belongs to. It then performs a Cartesian 
product between each row and its assigned windows, producing an output batch 
that includes an extra `window_id` column.
   * The `Aggregate` operator then groups by the `window_id` column and 
performs aggregation accordingly.
   
   However, at the moment, we can only output the aggregation results per 
window. We cannot output additional window metadata such as `window_start`, 
`window_end`, or `window_duration`.
   For example, we would like to support the following query:
   
   ```sql
   SELECT window_start, window_end, window_duration, avg(value) FROM t GROUP BY 
session_window(ts, INTERVAL '30' SECOND);
   ```
   
   But this is not feasible, because DataFusion requires that projected columns 
must exist in the table schema.
   
   We also considered another approach: designing some dummy aggregate UDFs 
such as `window_start()`, `window_end()`, etc., and then writing the query as:
   
   ```sql
   SELECT window_start(), window_end(), window_duration(), avg(value) FROM t 
GROUP BY session_window(ts, INTERVAL '30' SECOND);
   ```
   
   But this approach is also not viable, because DataFusion requires aggregate 
functions to have at least one argument.
   
   Of course, we could write:
   
   ```sql
   SELECT window_start(ts), window_end(ts), window_duration(ts), avg(value) 
FROM t GROUP BY session_window(ts, INTERVAL '30' SECOND);
   ```
   
   But this feels unnatural to us.
   
   Is there any other way we can implement the functionality we want?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to