Thanks for the clarification Shaoxuan. Cheers, Fabian
2017-01-22 4:08 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>: > Hi Fabian, > Thanks for the carefully checking on the proposal. > Yes, code generation is in my plan. As shown in "2.3 UDAGG interface", the > input and return types of the new proposed UDAGG functions are dynamically > given by the users ("[user defined xxx inputs/types]"). All embed built-in > functions for this new API have to be generated via codegen. I will update > Jira and doc. > > Thanks, > Shaoxuan > > > On Sat, Jan 21, 2017 at 7:29 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > > Hi Shaoxuan, > > > > thanks a lot for this great design doc. > > I think user defined aggregation functions are a very important feature > for > > the Table API and SQL. > > > > Have you thought about how the aggregation functions will be embedded in > > Flink functions? > > At the moment, we have a generic Flink function which is configured with > > aggregation functions, i.e., we do not leverage code generation here. > > Do you plan to embed built-in and user-defined aggregations functions > that > > implement the proposed API with code generation? > > > > Can you maybe extend the JIRA or design document with this information? > > > > Thank you, > > Fabian > > > > 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>: > > > > > Hi everyone, > > > I have drafted the design doc (link is provided below) for UDAGG, and > > > created the JIRA (FLINK-5564) to track the progress of this design. > > > Special thanks to Stephan and Fabian for their advice and help. > > > > > > Please check the design doc, feel free to share your comments in the > > google > > > doc: > > > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6 > > > 7yXOypY7Uh5gIOK2r-U/edit > > > > > > Regards, > > > Shaoxuan > > > > > > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <fhue...@gmail.com> > > wrote: > > > > > > > Hi Shaoxuan, > > > > > > > > user-defined aggregates would be a great addition to the Table API / > > SQL. > > > > I completely agree that the current (internal) interface is not well > > > suited > > > > as an external interface and needs to be redesigned if exposed to > > users. > > > > > > > > We need to careful think about this new interface and how we can > > > integrate > > > > it with the DataStream (and DataSet) API to support all required > > > > operations, esp. with respect to null aggregates and support for > > > combining > > > > / merging. > > > > I agree that for efficient execution, we should avoid WindowFunctions > > > > (large state) and FoldFunction (not mergeable). If we need a new > > > interface > > > > in the DataStream API, we need to discuss this in more detail. > > > > I think we need a bit more information about the proposed UDAGG > > interface > > > > to discuss how this can be mapped to DataStream operators. > > > > > > > > Support for retraction will be required for our future plans with the > > > > streaming Table API / SQL interface. > > > > > > > > Looking forward to your proposal, > > > > Fabian > > > > > > > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>: > > > > > > > > > Hello everyone, > > > > > > > > > > I am writing this email to propose a new User Defined Aggregate > > > > interface. > > > > > We were trying to leverage the existing Aggregate interface, but > > > > > unfortunately we realized that it is not sufficient to meet all our > > > > needs. > > > > > Here are the obstacles we have observed: > > > > > 1) The current aggregate interface is not very concise to users. > One > > > > needs > > > > > to know the design details of the intermediate Row buffer before > > > > implements > > > > > an Aggregate. Seven functions are needed even for a simple Count > > > > aggregate. > > > > > We'd better to make the UDAGG interface much more concisely. > > > > > 2) the current aggregate function can be only applied on one single > > > > column. > > > > > There are many scenarios which require the aggregate function > taking > > > > > multiple columns as the inputs. > > > > > 3) “Retraction” is not covered in the current Aggregate. > > > > > > > > > > For #1, I am thinking instead of letting users to manipulate the > > > > > intermediate buffer, we could potentially put the entire Aggregate > > > > instance > > > > > or a subclass instance of Aggregate to the Row buffer, such that > the > > > user > > > > > does not need to know how the Aggregate state is maintained by the > > > > > framework. > > > > > But to achieve this goal, we probably need a new dataStream API. > The > > > > > existing reduce API does not work with two different types of > inputs > > > (in > > > > > this proposal, the inputs will be upstream values, and the instance > > of > > > > the > > > > > current accumulated Aggregate), while the fold API is not able to > > merge > > > > the > > > > > two Aggregate results (which is usually needed for merging two > > session > > > > > windows). > > > > > > > > > > For #3, besides the aggregate itself, there are a few other things > > need > > > > to > > > > > be taken care of to fully support the retractions. I will share a > > > > separate > > > > > concrete proposal about how to generate and process retractions, > and > > > how > > > > it > > > > > works along with this new proposed UDAGG. > > > > > > > > > > I would like really appreciate if you can share your opinions on > this > > > > > proposal, especially for the needed dataStream API for #1. Also, if > > > there > > > > > is any other good things you think to be better added for UDAGG, > > please > > > > > feel free to share with us. I will draft my proposal in a google > doc > > > and > > > > > share to the flink DEV group very soon. > > > > > > > > > > Thanks, > > > > > Shaoxuan > > > > > > > > > > > > > > >