Hi jingcheng, I agree with Kurt's point. As you said "the user must know the keys of the output of UDTAGG clearly". If I understand correctly, the key information is strongly relative to the UDTAGG implementation. Users may call `flatAggregate` on a UDTAGG instance with different keys which may result in a wrong result. So I think it would be better to couple key information with UDTAGG interface (i.e. "Approach 3" in your design doc).
Regards, Jark On Thu, 4 Jul 2019 at 18:06, Kurt Young <ykt...@gmail.com> wrote: > Hi Jincheng, > > Thanks for the clarification. I think you just pointed out my concern by > yourself: > > > When a user uses a User-defined table aggregate function (UDTAGG), he > must understand the behavior of the UDTAGG, including the return type and > the characteristics of the returned data. such as the key fields. > > This indicates that the UDTAGG is somehow be classified to different types, > one will no key, one with key information. So the developer of the UDTAGG > should choose which type of this function should be. In this case, > my question would be, why don't we have explicit information about keys > such as we split UDTAGG to keyed UDTAGG and non-keyed UDTAGG. So the user > and the framework will have a better understanding of > this UDTAGG. `withKeys` solution is letting user to choose the key and it > seems it will only work correctly only if the user choose the *right* key > this UDTAGG has. > > Let me know if this makes sense to you. > > Best, > Kurt > > > On Thu, Jul 4, 2019 at 4:32 PM jincheng sun <sunjincheng...@gmail.com> > wrote: > > > Hi All, > > > > @Kurt Young <ykt...@gmail.com> one user-defined table aggregate function > > can be used in both with(out) keys case, and we do not introduce any > other > > aggregations. just like the explanation from @Hequn. > > > > @Hequn Cheng <chenghe...@gmail.com> thanks for your explanation! > > > > One thing should be mentioned here: > > > > When a user uses a User-defined table aggregate function (UDTAGG), he > must > > understand the behavior of the UDTAGG, including the return type and the > > characteristics of the returned data. such as the key fields. So > although > > using `withKeys` approach it is not rigorous enough(we do not need) but > > intuitive enough, considering that if `flatAggregate` is followed by an > > `upsertSink`, then the user must know the keys of the output of UDTAGG > > clearly, otherwise the keys of `upsertSink` cannot be defined. So I still > > prefer the `withKeys` solution by now. > > > > Looking forward to any feedback from all of you! > > > > Best, > > Jincheng > > > > > > > > Hequn Cheng <chenghe...@gmail.com> 于2019年7月1日周一 下午5:35写道: > > > >> Hi Kurt, > >> > >> Thanks for your questions. Here are my thoughts. > >> > >> > if I want to write such kind function, should I make sure that this > >> function is used with some keys? > >> The key information may not be used. We can also use RetractSink to emit > >> the table directly. > >> > >> > If I need a use case to calculate topn without key, should I write > >> another function or I can reuse previous one. > >> For the TopN example, you can reuse the previous function if you don't > >> care > >> about the key information. > >> > >> So, the key information is only an indicator(or a description), not an > >> operator, as Jincheng mentioned above. > >> We do not need to change the function logic and it will not add any > other > >> aggregations. > >> > >> BTW, we have three approaches in the document. Approach 1 defines keys > on > >> API level as we think it's more common to define keys on Table. > >> While approach 3 defines keys in the TableAggregateFunction which is > more > >> precise but it is not very clear for Table users. So, we should take all > >> these into consideration, and make the decision in this discussion > thread. > >> > >> You can take a look at the document and welcome any suggestions or other > >> better solutions. > >> > >> Best, Hequn > >> > >> > >> On Mon, Jul 1, 2019 at 12:13 PM Kurt Young <ykt...@gmail.com> wrote: > >> > >> > Hi Jincheng, > >> > > >> > Thanks for the clarification. Take 'TopN' for example, if I want to > >> write > >> > such kind function, > >> > should I make sure that this function is used with some keys? If I > need > >> a > >> > use case to calculate > >> > topn without key, should I write another function or I can reuse > >> previous > >> > one. > >> > > >> > I'm not sure about the idea of this does not involve semantic changes. > >> To > >> > me, it sounds like > >> > we are doing another nested aggregation inside the table > >> > which TableAggregateFunction emits. > >> > > >> > Maybe I'm not familiar with this function enough, hope you can help me > >> to > >> > understand. > >> > > >> > Best, > >> > Kurt > >> > > >> > > >> > On Mon, Jul 1, 2019 at 11:59 AM jincheng sun < > sunjincheng...@gmail.com> > >> > wrote: > >> > > >> > > Hi Kurt, > >> > > > >> > > Thanks for your questions, I am glad to share my thoughts here: > >> > > > >> > > My question is, will that effect the logic ofTableAggregateFunction > >> user > >> > > > wrote? Should the user know that there will a key and make some > >> changes > >> > > to > >> > > > this function? > >> > > > >> > > > >> > > No, the keys information depends on the implementation of the > >> > > TableAggregateFunction. > >> > > For example, for a `topN` user defined TableAggregateFunction, we > can > >> > only > >> > > use the `keys` if the `topN` contains `rankid` in the output. You > can > >> > > treat the > >> > > `keys` like an indicator. > >> > > > >> > > If not, how will framework deal with the output of user's > >> > > > TableAggregateFunction. if user output multiple rows with the > same > >> > key, > >> > > > should be latter one replace previous ones? > >> > > > >> > > > >> > > If a TableAggregateFunction outputs multiple rows with the same key, > >> the > >> > > latter one should replace the previous one, either with upsert mode > or > >> > > retract mode. i.e., Whether the user defines the Key or not, the > Flink > >> > > framework should ensure the correctness of the semantics. > >> > > > >> > > At present, the problem we are discussing does not involve semantic > >> > > changes. The definition of keys is to support non-window > >> flatAggregate on > >> > > upsert mode. (The upsert mode is already supported in the flink > >> > framework. > >> > > The current discussion only needs to inform the framework that the > >> keys > >> > > information, which is the `withKeys` API we discussing.) > >> > > > >> > > Welcome any other feedbacks :) > >> > > > >> > > Best, > >> > > Jincheng > >> > > > >> > > Kurt Young <ykt...@gmail.com> 于2019年7月1日周一 上午9:23写道: > >> > > > >> > > > Hi, > >> > > > > >> > > > I have a question about the key information of > >> TableAggregateFunction. > >> > > > IIUC, you need to define > >> > > > something like primary key or unique key in the result table of > >> > > > TableAggregateFunction, and also > >> > > > need a way to let user configure this through the API. My question > >> is, > >> > > will > >> > > > that effect the logic of > >> > > > TableAggregateFunction user wrote? Should the user know that there > >> > will a > >> > > > key and make some changes > >> > > > to this function? > >> > > > > >> > > > If so, what's the semantic the user should learned. If not, how > will > >> > > > framework deal with the output of user's > >> > > > TableAggregateFunction. For example, if user output multiple rows > >> with > >> > > the > >> > > > same key, should be latter one > >> > > > replace previous ones? > >> > > > > >> > > > Best, > >> > > > Kurt > >> > > > > >> > > > > >> > > > On Mon, Jul 1, 2019 at 7:19 AM jincheng sun < > >> sunjincheng...@gmail.com> > >> > > > wrote: > >> > > > > >> > > > > Hi hequn, Thanks for the reply! I think `withKeys` solution is > our > >> > > better > >> > > > > choice! > >> > > > > > >> > > > > > >> > > > > Hequn Cheng <chenghe...@gmail.com> 于2019年6月26日周三 下午5:11写道: > >> > > > > > >> > > > > > Hi Jincheng, > >> > > > > > > >> > > > > > Thanks for raising the discussion! > >> > > > > > The key information is very important for query optimizations. > >> It > >> > > would > >> > > > > be > >> > > > > > nice if we can use upsert mode to achieve better performance. > >> > > > > > > >> > > > > > +1 for the `withKeys` proposal. :) > >> > > > > > > >> > > > > > Best, Hequn > >> > > > > > > >> > > > > > > >> > > > > > On Wed, Jun 26, 2019 at 4:37 PM jincheng sun < > >> > > sunjincheng...@gmail.com > >> > > > > > >> > > > > > wrote: > >> > > > > > > >> > > > > > > Hi all, > >> > > > > > > > >> > > > > > > With the continuous efforts from the community, we already > >> > > supported > >> > > > > > > `flatAggregate`[1] on TableAPI in retract mode. I think It's > >> > better > >> > > > to > >> > > > > > add > >> > > > > > > upsert mode for `flatAggregate`. > >> > > > > > > > >> > > > > > > The result table of streaming non-window `flatAggregate` is > a > >> > table > >> > > > > > > contains updates. We can, of course, use a > >> > > RetractStreamTableSink[2] > >> > > > to > >> > > > > > > emit the table, but we can get better performance in upsert > >> mode. > >> > > > > > However, > >> > > > > > > due to the lack of keys, we can’t use an > >> UpsertStreamTableSink to > >> > > > emit > >> > > > > > the > >> > > > > > > table. We don’t have this problem for a normal aggregate as > it > >> > > emits > >> > > > a > >> > > > > > > single row for each group, so the unique keys are exactly > the > >> > same > >> > > > with > >> > > > > > the > >> > > > > > > group keys. While for a `flatAggregate`, its pretty > difference > >> > that > >> > > > due > >> > > > > > to > >> > > > > > > emits multi rows(a “sub-table”) for a single group. To solve > >> this > >> > > > > > problem, > >> > > > > > > we need to find a way to define keys on flatAggregate, so > >> that we > >> > > can > >> > > > > > also > >> > > > > > > use upsert sink to emit the result table after > flatAggregate. > >> > > > > > > > >> > > > > > > So, Aljoscha, Hequn and I prepared a design document for how > >> to > >> > > > define > >> > > > > > the > >> > > > > > > update keys for `flatAggregate` in upsert mode. The detail > >> can > >> > be > >> > > > > found > >> > > > > > > here: > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://docs.google.com/document/d/183qHo8PDG-xserDi_AbGP6YX9aPY0rVr80p3O3Gyz6U/edit?usp=sharing > >> > > > > > > > >> > > > > > > I appreciate it if you can have look at the document and any > >> > > comments > >> > > > > are > >> > > > > > > welcome! > >> > > > > > > > >> > > > > > > > >> > > > > > > Best, > >> > > > > > > > >> > > > > > > Jincheng > >> > > > > > > > >> > > > > > > > >> > > > > > > [1] > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97552739 > >> > > > > > > > >> > > > > > > [2] > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/table/sourceSinks.html#defining-a-streamtablesource > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >