Hi Shaoxuan, user-defined aggregates would be a great addition to the Table API / SQL. I completely agree that the current (internal) interface is not well suited as an external interface and needs to be redesigned if exposed to users.
We need to careful think about this new interface and how we can integrate it with the DataStream (and DataSet) API to support all required operations, esp. with respect to null aggregates and support for combining / merging. I agree that for efficient execution, we should avoid WindowFunctions (large state) and FoldFunction (not mergeable). If we need a new interface in the DataStream API, we need to discuss this in more detail. I think we need a bit more information about the proposed UDAGG interface to discuss how this can be mapped to DataStream operators. Support for retraction will be required for our future plans with the streaming Table API / SQL interface. Looking forward to your proposal, Fabian 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>: > Hello everyone, > > I am writing this email to propose a new User Defined Aggregate interface. > We were trying to leverage the existing Aggregate interface, but > unfortunately we realized that it is not sufficient to meet all our needs. > Here are the obstacles we have observed: > 1) The current aggregate interface is not very concise to users. One needs > to know the design details of the intermediate Row buffer before implements > an Aggregate. Seven functions are needed even for a simple Count aggregate. > We'd better to make the UDAGG interface much more concisely. > 2) the current aggregate function can be only applied on one single column. > There are many scenarios which require the aggregate function taking > multiple columns as the inputs. > 3) “Retraction” is not covered in the current Aggregate. > > For #1, I am thinking instead of letting users to manipulate the > intermediate buffer, we could potentially put the entire Aggregate instance > or a subclass instance of Aggregate to the Row buffer, such that the user > does not need to know how the Aggregate state is maintained by the > framework. > But to achieve this goal, we probably need a new dataStream API. The > existing reduce API does not work with two different types of inputs (in > this proposal, the inputs will be upstream values, and the instance of the > current accumulated Aggregate), while the fold API is not able to merge the > two Aggregate results (which is usually needed for merging two session > windows). > > For #3, besides the aggregate itself, there are a few other things need to > be taken care of to fully support the retractions. I will share a separate > concrete proposal about how to generate and process retractions, and how it > works along with this new proposed UDAGG. > > I would like really appreciate if you can share your opinions on this > proposal, especially for the needed dataStream API for #1. Also, if there > is any other good things you think to be better added for UDAGG, please > feel free to share with us. I will draft my proposal in a google doc and > share to the flink DEV group very soon. > > Thanks, > Shaoxuan >