Thanks for the updates, Jing! +1 to start the vote. Best, Jark
On Fri, 22 Jul 2022 at 20:10, Jing Ge <j...@ververica.com> wrote: > Hi, > > I have updated the FLIP. Please check again. If there are no other > concerns, I will start voting. Thank you all for your support! > > Best regards, > Jing > > On Fri, Jul 22, 2022 at 1:32 PM Jing Ge <j...@ververica.com> wrote: > > > Thanks Jark, fair enough, I will update the FLIP accordingly. > > > > Best regards, > > Jing > > > > On Fri, Jul 22, 2022 at 6:07 AM Jark Wu <imj...@gmail.com> wrote: > > > >> Hi Jing, > >> > >> I have some concerns about the isBulkGetSupported() approach. > >> 1. Catalog developers need to learn the contract between > >> `isBulkGetSupported()` and bulk get methods > >> 2. The contract of isBulkGetSupported() is fragile. Because developers > may > >> forget to override > >> `isBulkGetSupported` and be confused the implemented bulk methods are > >> not called. > >> 3. It's not extensible when we have more bulk methods in future > releases. > >> > >> I have discussed with Godfrey offline and come up with (maybe) a better > >> solution. > >> The default implementation of bulk methods should iterate on the single > >> get > >> method and wrap the result. > >> In this way, the planner can always call the bulk methods no matter > >> whether > >> the Catalog implements the bulk method or not. > >> This can also address the above concerns. > >> > >> What do you think? > >> > >> Best, > >> Jark > >> > >> > >> > >> > >> > >> On Thu, 21 Jul 2022 at 21:13, Jing Ge <j...@ververica.com> wrote: > >> > >> > Thanks Jingsong and Jark. I will create another FLIP to cover the > >> > optimization topic that partitions and partition stats could be > fetched > >> > from catalog in one single call. > >> > > >> > Thanks for the hint w.r.t. the compatibility issue. I have updated the > >> FLIP > >> > to provide all methods as default interface methods. A new > >> > method isBulkGetSupported() has been added. Would you like to take a > >> look > >> > at it again? > >> > > >> > Best regards, > >> > Jing > >> > > >> > On Wed, Jul 20, 2022 at 11:23 AM Jark Wu <imj...@gmail.com> wrote: > >> > > >> > > I agree with Jingsong. > >> > > > >> > > There are use cases to get partitions and partition stats in a > single > >> > call > >> > > to reduce the IO cost. > >> > > For example, extending Catalog#listPartitions to > >> > > Catalog#listPartitionsWithStats, > >> > > and extending Catalog#listPartitionsByFilter to > >> > > Catalog#listPartitionsWithStatsByFilter. > >> > > This allows the planner to choose the cheapest one to request > >> partitions > >> > > and stats. > >> > > But of course, this can be introduced in the future. > >> > > > >> > > Besides, I found that the FLIP doesn't cover compatibility properly. > >> > > Introducing two new methods in Catalog > >> > > will break all third-party catalog implementations. On the other > hand, > >> > the > >> > > planner should > >> > > have a way to identify whether the Catalog supports bulk get. > >> > > > >> > > Best, > >> > > Jark > >> > > > >> > > On Wed, 20 Jul 2022 at 16:43, Jingsong Li <jingsongl...@gmail.com> > >> > wrote: > >> > > > >> > > > Thanks for your reply. > >> > > > > >> > > > - Consider bulkGetPartitionStatistics, partition statistics are > >> > > > already in HiveMetastoreClient.listPartitions. But on our side, we > >> > > > need Catalog.getPartitions first, and then > >> > > > Catalog.bulkGetPartitionStatistics. > >> > > > > >> > > > - Consider bulkGetPartitionColumnStatistics, yes, as you said, we > >> need > >> > > it. > >> > > > > >> > > > But to unify them, the current FLIP is acceptable. > >> > > > > >> > > > Best, > >> > > > Jingsong > >> > > > > >> > > > On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <j...@ververica.com> > wrote: > >> > > > > > >> > > > > Hi Jingsong, > >> > > > > > >> > > > > Thanks for clarifying it. Are you suggesting a new method or > >> changing > >> > > the > >> > > > > name of the methods described in the FLIP? > >> > > > > Please see my answers and further questions below. > >> > > > > > >> > > > > Best regards, > >> > > > > Jing > >> > > > > > >> > > > > On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li < > >> jingsongl...@gmail.com> > >> > > > wrote: > >> > > > > > >> > > > > > Hi Jing, > >> > > > > > > >> > > > > > I understand that the statistics for partitions are currently > >> only > >> > > > > > used by Hive, so we can look at the Hive implementation: > >> > > > > > > >> > > > > > See HiveCatalog.getPartitionStatistics. > >> > > > > > To get the statistics, we actually get them from the > >> > > > > > org.apache.hadoop.hive.metastore.api.Partition object. > >> > > > > > > >> > > > > > >> > > > > Correct. Both new methods will do the same thing but in bulk > mode. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > According to HiveMetastore's API, partition-related operations > >> > > > > > actually get the partition as well as the statistics > >> information. > >> > > > > > > >> > > > > > >> > > > > I am not really sure which API it is. Methods in > >> HiveMetaStoreClient > >> > > that > >> > > > > are dealing with partitions will either return partitions or > >> > > > statisticObjs, > >> > > > > e.g. listPartitions(...) or getPartitionColumnStatistics(...) > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > So if the current partition statistics are just for Hive, can > we > >> > > > > > consider unifying it with Hive? > >> > > > > > > >> > > > > > >> > > > > Yes, we are on the same page. This FLIP is trying to unify it > with > >> > > > > HiveMetaStoreClient. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > For example, in PushPartitionIntoTableSourceScanRule, just use > >> > > > > > `listPartitionWithStats`, and adjust table statistics from > >> > > partitions. > >> > > > > > > >> > > > > > >> > > > > Does the bulkGetPartitionStatistics work in this case too? Or, > do > >> you > >> > > > mean > >> > > > > you need both partitions and related statistics returned by a > new > >> > > method > >> > > > > called `listPartitionWithStats`? > >> > > > > > >> > > > > > >> > > > > > Best, > >> > > > > > Jingsong > >> > > > > > > >> > > > > > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <j...@ververica.com> > >> > wrote: > >> > > > > > > > >> > > > > > > Thanks Jingsong for the suggestion. > >> > > > > > > > >> > > > > > > Do you mean using a different naming convention? There is a > >> > thought > >> > > > and > >> > > > > > > description in the FLIP about using "list" or "bulkGet": > >> > > > > > > > >> > > > > > > - bulkGetPartitionStatistics(...) has been chosen over > >> > > > > > > listPartitionStatistics(...), because, comparing to > >> database > >> > and > >> > > > > > partition > >> > > > > > > that are static and can be listed, statistics are more > >> dynamic > >> > > and > >> > > > > > will > >> > > > > > > need more computation logic to create, therefore using > >> "get" > >> > is > >> > > > > > > semantically more feasible than list. The "bulk" gives > >> users > >> > the > >> > > > hint > >> > > > > > that > >> > > > > > > this method will work in the bulk mode and return a > >> collection > >> > > of > >> > > > > > instances. > >> > > > > > > > >> > > > > > > > >> > > > > > > As a reference, we can see that no method in > MetaStoreClient, > >> > that > >> > > > > > > calculates statistics, uses the "list" naming convention. > >> > > > > > > > >> > > > > > > Best regards, > >> > > > > > > Jing > >> > > > > > > > >> > > > > > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li < > >> > > jingsongl...@gmail.com> > >> > > > > > wrote: > >> > > > > > > > >> > > > > > > > Thanks for starting this discussion. > >> > > > > > > > > >> > > > > > > > Have we considered introducing a listPartitionWithStats() > in > >> > > > Catalog? > >> > > > > > > > > >> > > > > > > > Best, > >> > > > > > > > Jingsong > >> > > > > > > > > >> > > > > > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu < > imj...@gmail.com> > >> > > wrote: > >> > > > > > > > > > >> > > > > > > > > Hi Jing, > >> > > > > > > > > > >> > > > > > > > > Thanks for starting this discussion. The bulk fetch is a > >> > great > >> > > > > > > > improvement > >> > > > > > > > > for the optimizer. > >> > > > > > > > > The FLIP looks good to me. > >> > > > > > > > > > >> > > > > > > > > Best, > >> > > > > > > > > Jark > >> > > > > > > > > > >> > > > > > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge < > j...@ververica.com> > >> > > wrote: > >> > > > > > > > > > >> > > > > > > > > > Hi devs, > >> > > > > > > > > > > >> > > > > > > > > > After having multiple discussions with Jark and > >> Goldfrey, > >> > I'd > >> > > > like > >> > > > > > to > >> > > > > > > > start > >> > > > > > > > > > a discussion on the mailing list w.r.t. FLIP-247[1], > >> which > >> > > will > >> > > > > > > > > > significantly improve the performance by providing the > >> bulk > >> > > > fetch > >> > > > > > > > > > capability for table and column statistics. > >> > > > > > > > > > > >> > > > > > > > > > Currently the statistics information about tables can > >> only > >> > be > >> > > > > > fetched > >> > > > > > > > from > >> > > > > > > > > > the catalog by each given partition iteratively. Since > >> > > getting > >> > > > > > > > statistics > >> > > > > > > > > > information from catalogs is a very heavy operation, > in > >> > order > >> > > > to > >> > > > > > > > improve > >> > > > > > > > > > the query performance, we’d better provide > >> functionality to > >> > > > fetch > >> > > > > > the > >> > > > > > > > > > statistics information of a table for all given > >> partitions > >> > in > >> > > > one > >> > > > > > shot. > >> > > > > > > > > > > >> > > > > > > > > > Based on the manual performance test, for 2000 > >> partitions, > >> > > the > >> > > > cost > >> > > > > > > > will be > >> > > > > > > > > > improved from 10s to 2s. The improvement result is > 500%. > >> > > > > > > > > > > >> > > > > > > > > > [1] > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions > >> > > > > > > > > > > >> > > > > > > > > > Best regards, > >> > > > > > > > > > Jing > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > > >> > > >> > > >