Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-22 Thread Jark Wu
Thanks for the updates, Jing! +1 to start the vote. Best, Jark On Fri, 22 Jul 2022 at 20:10, Jing Ge wrote: > Hi, > > I have updated the FLIP. Please check again. If there are no other > concerns, I will start voting. Thank you all for your support! > > Best regards, > Jing > > On Fri, Jul 22,

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-22 Thread Jing Ge
Hi, I have updated the FLIP. Please check again. If there are no other concerns, I will start voting. Thank you all for your support! Best regards, Jing On Fri, Jul 22, 2022 at 1:32 PM Jing Ge wrote: > Thanks Jark, fair enough, I will update the FLIP accordingly. > > Best regards, > Jing > > O

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-22 Thread Jing Ge
Thanks Jark, fair enough, I will update the FLIP accordingly. Best regards, Jing On Fri, Jul 22, 2022 at 6:07 AM Jark Wu wrote: > Hi Jing, > > I have some concerns about the isBulkGetSupported() approach. > 1. Catalog developers need to learn the contract between > `isBulkGetSupported()` and bu

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-21 Thread Jark Wu
Hi Jing, I have some concerns about the isBulkGetSupported() approach. 1. Catalog developers need to learn the contract between `isBulkGetSupported()` and bulk get methods 2. The contract of isBulkGetSupported() is fragile. Because developers may forget to override `isBulkGetSupported` and be c

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-21 Thread Jing Ge
Thanks Jingsong and Jark. I will create another FLIP to cover the optimization topic that partitions and partition stats could be fetched from catalog in one single call. Thanks for the hint w.r.t. the compatibility issue. I have updated the FLIP to provide all methods as default interface methods

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-20 Thread Jark Wu
I agree with Jingsong. There are use cases to get partitions and partition stats in a single call to reduce the IO cost. For example, extending Catalog#listPartitions to Catalog#listPartitionsWithStats, and extending Catalog#listPartitionsByFilter to Catalog#listPartitionsWithStatsByFilter. This a

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-20 Thread Jingsong Li
Thanks for your reply. - Consider bulkGetPartitionStatistics, partition statistics are already in HiveMetastoreClient.listPartitions. But on our side, we need Catalog.getPartitions first, and then Catalog.bulkGetPartitionStatistics. - Consider bulkGetPartitionColumnStatistics, yes, as you said, w

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-20 Thread Jing Ge
Hi Jingsong, Thanks for clarifying it. Are you suggesting a new method or changing the name of the methods described in the FLIP? Please see my answers and further questions below. Best regards, Jing On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li wrote: > Hi Jing, > > I understand that the statis

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-19 Thread Jingsong Li
Hi Jing, I understand that the statistics for partitions are currently only used by Hive, so we can look at the Hive implementation: See HiveCatalog.getPartitionStatistics. To get the statistics, we actually get them from the org.apache.hadoop.hive.metastore.api.Partition object. According to Hi

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-19 Thread Jing Ge
Thanks Jingsong for the suggestion. Do you mean using a different naming convention? There is a thought and description in the FLIP about using "list" or "bulkGet": - bulkGetPartitionStatistics(...) has been chosen over listPartitionStatistics(...), because, comparing to database and partit

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-15 Thread godfrey he
Hi Jing, Thanks for the driving this, LGTM. Best, Godfrey Jingsong Li 于2022年7月15日周五 11:38写道: > > Thanks for starting this discussion. > > Have we considered introducing a listPartitionWithStats() in Catalog? > > Best, > Jingsong > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu wrote: > > > > Hi Ji

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-14 Thread Jingsong Li
Thanks for starting this discussion. Have we considered introducing a listPartitionWithStats() in Catalog? Best, Jingsong On Fri, Jul 15, 2022 at 10:08 AM Jark Wu wrote: > > Hi Jing, > > Thanks for starting this discussion. The bulk fetch is a great improvement > for the optimizer. > The FLIP l

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-14 Thread Jark Wu
Hi Jing, Thanks for starting this discussion. The bulk fetch is a great improvement for the optimizer. The FLIP looks good to me. Best, Jark On Fri, 8 Jul 2022 at 17:36, Jing Ge wrote: > Hi devs, > > After having multiple discussions with Jark and Goldfrey, I'd like to start > a discussion on

[DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-08 Thread Jing Ge
Hi devs, After having multiple discussions with Jark and Goldfrey, I'd like to start a discussion on the mailing list w.r.t. FLIP-247[1], which will significantly improve the performance by providing the bulk fetch capability for table and column statistics. Currently the statistics information a