niebayes opened a new issue, #16453: URL: https://github.com/apache/datafusion/issues/16453
Our database recently implemented a windowing mechanism similar to Apache Flink, allowing data to be segmented as needed. For example, we support the following SQL statements: ```sql SELECT avg(value) FROM t GROUP BY tumble_window(ts, INTERVAL '10' MINUTE); SELECT avg(value) FROM t GROUP BY session_window(ts, INTERVAL '30' SECOND); ``` Our implementation is relatively straightforward: * We first design some dummy window UDFs, such as `tumble_window`, `session_window`, `slide_window`, etc. * Then we implement an optimizer that transforms the dummy window UDFs into a `WindowAssign` operator. It also rewrites the `Aggregate` operator by replacing the `GROUP BY` expressions with the window ID. * The `WindowAssign` operator assigns one or more window IDs to each input row, indicating which windows the row belongs to. It then performs a Cartesian product between each row and its assigned windows, producing an output batch that includes an extra `window_id` column. * The `Aggregate` operator then groups by the `window_id` column and performs aggregation accordingly. However, at the moment, we can only output the aggregation results per window. We cannot output additional window metadata such as `window_start`, `window_end`, or `window_duration`. For example, we would like to support the following query: ```sql SELECT window_start, window_end, window_duration, avg(value) FROM t GROUP BY session_window(ts, INTERVAL '30' SECOND); ``` But this is not feasible, because DataFusion requires that projected columns must exist in the table schema. We also considered another approach: designing some dummy aggregate UDFs such as `window_start()`, `window_end()`, etc., and then writing the query as: ```sql SELECT window_start(), window_end(), window_duration(), avg(value) FROM t GROUP BY session_window(ts, INTERVAL '30' SECOND); ``` But this approach is also not viable, because DataFusion requires aggregate functions to have at least one argument. Of course, we could write: ```sql SELECT window_start(ts), window_end(ts), window_duration(ts), avg(value) FROM t GROUP BY session_window(ts, INTERVAL '30' SECOND); ``` But this feels unnatural to us. Is there any other way we can implement the functionality we want? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org