Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Shaoxuan Wang Sat, 21 Jan 2017 19:09:04 -0800

Hi Fabian,
Thanks for the carefully checking on the proposal.
Yes, code generation is in my plan. As shown in "2.3 UDAGG interface", the
input and return types of the new proposed UDAGG functions are dynamically
given by the users ("[user defined xxx inputs/types]"). All embed built-in
functions for this new API have to be generated via codegen. I will update
Jira and doc.


Thanks,
Shaoxuan


On Sat, Jan 21, 2017 at 7:29 AM, Fabian Hueske <[email protected]> wrote:

> Hi Shaoxuan,
>
> thanks a lot for this great design doc.
> I think user defined aggregation functions are a very important feature for
> the Table API and SQL.
>
> Have you thought about how the aggregation functions will be embedded in
> Flink functions?
> At the moment, we have a generic Flink function which is configured with
> aggregation functions, i.e., we do not leverage code generation here.
> Do you plan to embed built-in and user-defined aggregations functions that
> implement the proposed API with code generation?
>
> Can you maybe extend the JIRA or design document with this information?
>
> Thank you,
> Fabian
>
> 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[email protected]>:
>
> > Hi everyone,
> > I have drafted the design doc (link is provided below) for UDAGG, and
> > created the JIRA (FLINK-5564) to track the progress of this design.
> > Special thanks to Stephan and Fabian for their advice and help.
> >
> > Please check the design doc, feel free to share your comments in the
> google
> > doc:
> > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6
> > 7yXOypY7Uh5gIOK2r-U/edit
> >
> > Regards,
> > Shaoxuan
> >
> > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[email protected]>
> wrote:
> >
> > > Hi Shaoxuan,
> > >
> > > user-defined aggregates would be a great addition to the Table API /
> SQL.
> > > I completely agree that the current (internal) interface is not well
> > suited
> > > as an external interface and needs to be redesigned if exposed to
> users.
> > >
> > > We need to careful think about this new interface and how we can
> > integrate
> > > it with the DataStream (and DataSet) API to support all required
> > > operations, esp. with respect to null aggregates and support for
> > combining
> > > / merging.
> > > I agree that for efficient execution, we should avoid WindowFunctions
> > > (large state) and FoldFunction (not mergeable). If we need a new
> > interface
> > > in the DataStream API, we need to discuss this in more detail.
> > > I think we need a bit more information about the proposed UDAGG
> interface
> > > to discuss how this can be mapped to DataStream operators.
> > >
> > > Support for retraction will be required for our future plans with the
> > > streaming Table API / SQL interface.
> > >
> > > Looking forward to your proposal,
> > > Fabian
> > >
> > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[email protected]>:
> > >
> > > > Hello everyone,
> > > >
> > > > I am writing this email to propose a new User Defined Aggregate
> > > interface.
> > > > We were trying to leverage the existing Aggregate interface, but
> > > > unfortunately we realized that it is not sufficient to meet all our
> > > needs.
> > > > Here are the obstacles we have observed:
> > > > 1) The current aggregate interface is not very concise to users. One
> > > needs
> > > > to know the design details of the intermediate Row buffer before
> > > implements
> > > > an Aggregate. Seven functions are needed even for a simple Count
> > > aggregate.
> > > > We'd better to make the UDAGG interface much more concisely.
> > > > 2) the current aggregate function can be only applied on one single
> > > column.
> > > > There are many scenarios which require the aggregate function taking
> > > > multiple columns as the inputs.
> > > > 3) “Retraction” is not covered in the current Aggregate.
> > > >
> > > > For #1, I am thinking instead of letting users to manipulate the
> > > > intermediate buffer, we could potentially put the entire Aggregate
> > > instance
> > > > or a subclass instance of Aggregate to the Row buffer, such that the
> > user
> > > > does not need to know how the Aggregate state is maintained by the
> > > > framework.
> > > > But to achieve this goal, we probably need a new dataStream API. The
> > > > existing reduce API does not work with two different types of inputs
> > (in
> > > > this proposal, the inputs will be upstream values, and the instance
> of
> > > the
> > > > current accumulated Aggregate), while the fold API is not able to
> merge
> > > the
> > > > two Aggregate results (which is usually needed for merging two
> session
> > > > windows).
> > > >
> > > > For #3, besides the aggregate itself, there are a few other things
> need
> > > to
> > > > be taken care of to fully support the retractions. I will share a
> > > separate
> > > > concrete proposal about how to generate and process retractions, and
> > how
> > > it
> > > > works along with this new proposed UDAGG.
> > > >
> > > > I would like really appreciate if you can share your opinions on this
> > > > proposal, especially for the needed dataStream API for #1. Also, if
> > there
> > > > is any other good things you think to be better added for UDAGG,
> please
> > > > feel free to share with us. I will draft my proposal in a google doc
> > and
> > > > share to the flink DEV group very soon.
> > > >
> > > > Thanks,
> > > > Shaoxuan
> > > >
> > >
> >
>

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Reply via email to