Thanks for the update. The proposal looks good to me now.
Best, Jark On Tue, 5 Jan 2021 at 14:44, Jingsong Li <jingsongl...@gmail.com> wrote: > Thanks for your proposal! Sebastian. > > +1 for SupportsAggregatePushDown. The above wonderful discussion has > solved many of my concerns. > > ## Semantic problems > > We may need to add some mechanisms or comments, because as far as I know, > the semantics of each database is actually different, which may need to be > reflected in your specific implementation. > > For example, the AVG output types of various databases may be different. > For example, MySQL outputs double, this is different from Flink. What > should we do? (Lucky, avg will be splitted into sum and count, But we also > need care about decimal and others) > > ## The phase of push-down rule > > I strongly recommend that you do not put it in the Volcano phase, which > may make the cost calculation very troublesome. > So in PHYSICAL_REWRITE? > > ## About interface > > For scalability, I slightly recommend that we introduce an `Aggregate` > interface, it contains `List<CallExpression> aggregateExpressions, int[] > groupingFields, DataType producedDataType` fields. In this way, we can add > fields easily without breaking compatibility. > > I think the current design is very good, just put forward some ideas. > > Best, > Jingsong > > On Tue, Jan 5, 2021 at 1:55 PM Sebastian Liu <liuyang0...@gmail.com> > wrote: > >> Hi Jark, >> >> Thx for your further feedback and help. The interface of >> SupportsAggregatePushDown may indeed need some adjustments. >> >> For (1) Agree: Yeah, the upstream only need to know if the TableSource can >> handle all of the aggregates. >> It's better to just return a boolean type to indicate whether all of >> aggregates push down was successful or not. [Resolved in proposal] >> >> For (2) Agree: The aggOutputDataType represent the produced data type of >> the new table source to make sure that the new table source can >> connect with the related exchange node. The format of this >> aggOutputDataType is groupedFields's type + agg function's return type. >> The reason for adding this parameter in this function is also to >> facilitate >> the user to build the final output type. I have changed this parameter >> to be producedDataType. [Resolved in proposal] >> >> For (3) Agree: Indeed, groupSet may mislead users, I have changed to use >> groupingFields. [Resolved in proposal] >> >> Thx again for the suggestion, looking for the further discussion. >> >> Jark Wu <imj...@gmail.com> 于2021年1月5日周二 下午12:05写道: >> >> > I'm also +1 for idea#2. >> > >> > Regarding to the updated interface, >> > >> > Result applyAggregates(List<CallExpression> aggregateExpressions, >> > int[] groupSet, DataType aggOutputDataType); >> > >> > final class Result { >> > private final List<CallExpression> acceptedAggregates; >> > private final List<CallExpression> remainingAggregates; >> > } >> > >> > I have following comments: >> > >> > 1) Do we need the composite Result return type? Is a boolean return type >> > enough? >> > From my understanding, all of the aggregates should be accepted, >> > otherwise the pushdown should fail. >> > Therefore, users don't need to distinguish which aggregates are >> > "accepted". >> > >> > 2) Does the `aggOutputDataType` represent the produced data type of the >> > new source, or just the return type of all the agg functions? >> > I would prefer to `producedDataType` just like >> > `SupportsReadingMetadata` to reduce the effort for users to concat a >> final >> > output type. >> > The return type of each agg function can be obtained from the >> > `CallExpression`. >> > >> > 3) What do you think about renaming `groupSet` to `grouping` or >> > `groupedFields` ? >> > The `groupSet` may confuse users that it relates to "grouping sets". >> > >> > >> > What do you think? >> > >> > Best, >> > Jark >> > >> > >> > >> > On Tue, 5 Jan 2021 at 11:04, Kurt Young <ykt...@gmail.com> wrote: >> > >> >> Sorry for the typo -_-! >> >> I meant idea #2. >> >> >> >> Best, >> >> Kurt >> >> >> >> >> >> On Tue, Jan 5, 2021 at 10:59 AM Sebastian Liu <liuyang0...@gmail.com> >> >> wrote: >> >> >> >>> Hi Kurt, >> >>> >> >>> Thx a lot for your feedback. If local aggregation is more like a >> >>> physical operator rather than logical >> >>> operator, I think your suggestion should be idea #2 which handle all >> in >> >>> the physical optimization phase? >> >>> >> >>> Looking forward for the further discussion. >> >>> >> >>> >> >>> Kurt Young <ykt...@gmail.com> 于2021年1月5日周二 上午9:52写道: >> >>> >> >>>> Local aggregation is more like a physical operator rather than >> logical >> >>>> operator. I would suggest going with idea #1. >> >>>> >> >>>> Best, >> >>>> Kurt >> >>>> >> >>>> >> >>>> On Wed, Dec 30, 2020 at 8:31 PM Sebastian Liu <liuyang0...@gmail.com >> > >> >>>> wrote: >> >>>> >> >>>> > Hi Jark, Thx a lot for your quick reply and valuable suggestions. >> >>>> > For (1): Agree: Since we are in the period of upgrading the new >> table >> >>>> > source api, >> >>>> > we really should consider the new interface for the new optimize >> >>>> rule. If >> >>>> > the new rule >> >>>> > doesn't use the new api, we'll have to upgrade it sooner or later. >> I >> >>>> have >> >>>> > change to use >> >>>> > the ability interface for the SupportsAggregatePushDown definition >> in >> >>>> above >> >>>> > proposal. >> >>>> > >> >>>> > For (2): Agree: Change to use CallExpression is a better choice, >> and >> >>>> have >> >>>> > resolved this >> >>>> > comment in the proposal. >> >>>> > >> >>>> > For (3): I suggest we first support the JDBC connector, as we don't >> >>>> have >> >>>> > Druid connector >> >>>> > and ES connector just has sink api at present. >> >>>> > >> >>>> > But perhaps the biggest question may be whether we should use idea >> 1 >> >>>> or >> >>>> > idea 2 in proposal. >> >>>> > >> >>>> > What do you think? After we reach the agreement on the proposal, >> our >> >>>> team >> >>>> > can drive to >> >>>> > complete this feature. >> >>>> > >> >>>> > Jark Wu <imj...@gmail.com> 于2020年12月29日周二 下午2:58写道: >> >>>> > >> >>>> > > Hi Sebastian, >> >>>> > > >> >>>> > > Thanks for the proposal. I think this is a great improvement for >> >>>> Flink >> >>>> > SQL. >> >>>> > > I went through the design doc and have following thoughts: >> >>>> > > >> >>>> > > 1) Flink has deprecated the legacy TableSource in 1.11 and >> proposed >> >>>> a new >> >>>> > > set of DynamicTableSource interfaces. Could you update your >> >>>> proposal to >> >>>> > > use the new interfaces? >> >>>> > > Follow the existing ability interfaces, e.g. >> >>>> > > SupportsFilterPushDown, SupportsProjectionPushDown. >> >>>> > > >> >>>> > > 2) Personally, I think CallExpression would be a better >> >>>> representation >> >>>> > than >> >>>> > > separate `FunctionDefinition` and args. Because, it would be >> easier >> >>>> to >> >>>> > know >> >>>> > > what's the index and type of the arguments. >> >>>> > > >> >>>> > > 3) It would be better to list which connectors will be supported >> in >> >>>> the >> >>>> > > plan? >> >>>> > > >> >>>> > > Best, >> >>>> > > Jark >> >>>> > > >> >>>> > > >> >>>> > > On Tue, 29 Dec 2020 at 00:49, Sebastian Liu < >> liuyang0...@gmail.com> >> >>>> > wrote: >> >>>> > > >> >>>> > > > Hi all, >> >>>> > > > >> >>>> > > > I'd like to discuss a new feature for the Blink Planner. >> >>>> > > > Aggregate operator of Flink SQL is currently fully done at >> Flink >> >>>> layer. >> >>>> > > > With the developing of storage, many downstream storage of >> Flink >> >>>> SQL >> >>>> > has >> >>>> > > > the ability to deal with Aggregation operator. >> >>>> > > > Pushing down Aggregate to data source layer will improve >> >>>> performance >> >>>> > from >> >>>> > > > the perspective of the network IO and computation overhead. >> >>>> > > > >> >>>> > > > I have drafted a design doc for this new feature. >> >>>> > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://docs.google.com/document/d/1kGwC_h4qBNxF2eMEz6T6arByOB8yilrPLqDN0QBQXW4/edit?usp=sharing >> >>>> > > > >> >>>> > > > Any comment or discussion is welcome. >> >>>> > > > >> >>>> > > > -- >> >>>> > > > >> >>>> > > > *With kind regards >> >>>> > > > ------------------------------------------------------------ >> >>>> > > > Sebastian Liu 刘洋 >> >>>> > > > Institute of Computing Technology, Chinese Academy of Science >> >>>> > > > Mobile\WeChat: +86—15201613655 >> >>>> > > > E-mail: liuyang0...@gmail.com <liuyang0...@gmail.com> >> >>>> > > > QQ: 3239559* >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> > >> >>>> > -- >> >>>> > >> >>>> > *With kind regards >> >>>> > ------------------------------------------------------------ >> >>>> > Sebastian Liu 刘洋 >> >>>> > Institute of Computing Technology, Chinese Academy of Science >> >>>> > Mobile\WeChat: +86—15201613655 >> >>>> > E-mail: liuyang0...@gmail.com <liuyang0...@gmail.com> >> >>>> > QQ: 3239559* >> >>>> > >> >>>> >> >>> >> >>> >> >>> -- >> >>> >> >>> *With kind regards >> >>> ------------------------------------------------------------ >> >>> Sebastian Liu 刘洋 >> >>> Institute of Computing Technology, Chinese Academy of Science >> >>> Mobile\WeChat: +86—15201613655 >> >>> E-mail: liuyang0...@gmail.com <liuyang0...@gmail.com> >> >>> QQ: 3239559* >> >>> >> >>> >> >> -- >> >> *With kind regards >> ------------------------------------------------------------ >> Sebastian Liu 刘洋 >> Institute of Computing Technology, Chinese Academy of Science >> Mobile\WeChat: +86—15201613655 >> E-mail: liuyang0...@gmail.com <liuyang0...@gmail.com> >> QQ: 3239559* >> > > > -- > Best, Jingsong Lee >