Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Fabian Hueske Mon, 23 Jan 2017 02:38:54 -0800

Thanks for the clarification Shaoxuan.

Cheers, Fabian


2017-01-22 4:08 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>:

> Hi Fabian,
> Thanks for the carefully checking on the proposal.
> Yes, code generation is in my plan. As shown in "2.3 UDAGG interface", the
> input and return types of the new proposed UDAGG functions are dynamically
> given by the users ("[user defined xxx inputs/types]"). All embed built-in
> functions for this new API have to be generated via codegen. I will update
> Jira and doc.
>
> Thanks,
> Shaoxuan
>
>
> On Sat, Jan 21, 2017 at 7:29 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
> > Hi Shaoxuan,
> >
> > thanks a lot for this great design doc.
> > I think user defined aggregation functions are a very important feature
> for
> > the Table API and SQL.
> >
> > Have you thought about how the aggregation functions will be embedded in
> > Flink functions?
> > At the moment, we have a generic Flink function which is configured with
> > aggregation functions, i.e., we do not leverage code generation here.
> > Do you plan to embed built-in and user-defined aggregations functions
> that
> > implement the proposed API with code generation?
> >
> > Can you maybe extend the JIRA or design document with this information?
> >
> > Thank you,
> > Fabian
> >
> > 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>:
> >
> > > Hi everyone,
> > > I have drafted the design doc (link is provided below) for UDAGG, and
> > > created the JIRA (FLINK-5564) to track the progress of this design.
> > > Special thanks to Stephan and Fabian for their advice and help.
> > >
> > > Please check the design doc, feel free to share your comments in the
> > google
> > > doc:
> > > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6
> > > 7yXOypY7Uh5gIOK2r-U/edit
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <fhue...@gmail.com>
> > wrote:
> > >
> > > > Hi Shaoxuan,
> > > >
> > > > user-defined aggregates would be a great addition to the Table API /
> > SQL.
> > > > I completely agree that the current (internal) interface is not well
> > > suited
> > > > as an external interface and needs to be redesigned if exposed to
> > users.
> > > >
> > > > We need to careful think about this new interface and how we can
> > > integrate
> > > > it with the DataStream (and DataSet) API to support all required
> > > > operations, esp. with respect to null aggregates and support for
> > > combining
> > > > / merging.
> > > > I agree that for efficient execution, we should avoid WindowFunctions
> > > > (large state) and FoldFunction (not mergeable). If we need a new
> > > interface
> > > > in the DataStream API, we need to discuss this in more detail.
> > > > I think we need a bit more information about the proposed UDAGG
> > interface
> > > > to discuss how this can be mapped to DataStream operators.
> > > >
> > > > Support for retraction will be required for our future plans with the
> > > > streaming Table API / SQL interface.
> > > >
> > > > Looking forward to your proposal,
> > > > Fabian
> > > >
> > > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>:
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > I am writing this email to propose a new User Defined Aggregate
> > > > interface.
> > > > > We were trying to leverage the existing Aggregate interface, but
> > > > > unfortunately we realized that it is not sufficient to meet all our
> > > > needs.
> > > > > Here are the obstacles we have observed:
> > > > > 1) The current aggregate interface is not very concise to users.
> One
> > > > needs
> > > > > to know the design details of the intermediate Row buffer before
> > > > implements
> > > > > an Aggregate. Seven functions are needed even for a simple Count
> > > > aggregate.
> > > > > We'd better to make the UDAGG interface much more concisely.
> > > > > 2) the current aggregate function can be only applied on one single
> > > > column.
> > > > > There are many scenarios which require the aggregate function
> taking
> > > > > multiple columns as the inputs.
> > > > > 3) “Retraction” is not covered in the current Aggregate.
> > > > >
> > > > > For #1, I am thinking instead of letting users to manipulate the
> > > > > intermediate buffer, we could potentially put the entire Aggregate
> > > > instance
> > > > > or a subclass instance of Aggregate to the Row buffer, such that
> the
> > > user
> > > > > does not need to know how the Aggregate state is maintained by the
> > > > > framework.
> > > > > But to achieve this goal, we probably need a new dataStream API.
> The
> > > > > existing reduce API does not work with two different types of
> inputs
> > > (in
> > > > > this proposal, the inputs will be upstream values, and the instance
> > of
> > > > the
> > > > > current accumulated Aggregate), while the fold API is not able to
> > merge
> > > > the
> > > > > two Aggregate results (which is usually needed for merging two
> > session
> > > > > windows).
> > > > >
> > > > > For #3, besides the aggregate itself, there are a few other things
> > need
> > > > to
> > > > > be taken care of to fully support the retractions. I will share a
> > > > separate
> > > > > concrete proposal about how to generate and process retractions,
> and
> > > how
> > > > it
> > > > > works along with this new proposed UDAGG.
> > > > >
> > > > > I would like really appreciate if you can share your opinions on
> this
> > > > > proposal, especially for the needed dataStream API for #1. Also, if
> > > there
> > > > > is any other good things you think to be better added for UDAGG,
> > please
> > > > > feel free to share with us. I will draft my proposal in a google
> doc
> > > and
> > > > > share to the flink DEV group very soon.
> > > > >
> > > > > Thanks,
> > > > > Shaoxuan
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Reply via email to