Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Jing Ge Fri, 22 Jul 2022 05:08:58 -0700

Hi,

I have updated the FLIP. Please check again. If there are no other
concerns, I will start voting. Thank you all for your support!


Best regards,
Jing

On Fri, Jul 22, 2022 at 1:32 PM Jing Ge <[email protected]> wrote:

> Thanks Jark, fair enough, I will update the FLIP accordingly.
>
> Best regards,
> Jing
>
> On Fri, Jul 22, 2022 at 6:07 AM Jark Wu <[email protected]> wrote:
>
>> Hi Jing,
>>
>> I have some concerns about the isBulkGetSupported() approach.
>> 1. Catalog developers need to learn the contract between
>> `isBulkGetSupported()` and bulk get methods
>> 2. The contract of isBulkGetSupported() is fragile. Because developers may
>> forget to override
>>    `isBulkGetSupported` and be confused the implemented bulk methods are
>> not called.
>> 3. It's not extensible when we have more bulk methods in future releases.
>>
>> I have discussed with Godfrey offline and come up with (maybe) a better
>> solution.
>> The default implementation of bulk methods should iterate on the single
>> get
>> method and wrap the result.
>> In this way, the planner can always call the bulk methods no matter
>> whether
>> the Catalog implements the bulk method or not.
>> This can also address the above concerns.
>>
>> What do you think?
>>
>> Best,
>> Jark
>>
>>
>>
>>
>>
>> On Thu, 21 Jul 2022 at 21:13, Jing Ge <[email protected]> wrote:
>>
>> > Thanks Jingsong and Jark. I will create another FLIP to cover the
>> > optimization topic that partitions and partition stats could be fetched
>> > from catalog in one single call.
>> >
>> > Thanks for the hint w.r.t. the compatibility issue. I have updated the
>> FLIP
>> > to provide all methods as default interface methods. A new
>> > method isBulkGetSupported() has been added. Would you like to take a
>> look
>> > at it again?
>> >
>> > Best regards,
>> > Jing
>> >
>> > On Wed, Jul 20, 2022 at 11:23 AM Jark Wu <[email protected]> wrote:
>> >
>> > > I agree with Jingsong.
>> > >
>> > > There are use cases to get partitions and partition stats in a single
>> > call
>> > > to reduce the IO cost.
>> > > For example, extending Catalog#listPartitions to
>> > > Catalog#listPartitionsWithStats,
>> > > and extending Catalog#listPartitionsByFilter to
>> > > Catalog#listPartitionsWithStatsByFilter.
>> > > This allows the planner to choose the cheapest one to request
>> partitions
>> > > and stats.
>> > > But of course, this can be introduced in the future.
>> > >
>> > > Besides, I found that the FLIP doesn't cover compatibility properly.
>> > > Introducing two new methods in Catalog
>> > > will break all third-party catalog implementations. On the other hand,
>> > the
>> > > planner should
>> > > have a way to identify whether the Catalog supports bulk get.
>> > >
>> > > Best,
>> > > Jark
>> > >
>> > > On Wed, 20 Jul 2022 at 16:43, Jingsong Li <[email protected]>
>> > wrote:
>> > >
>> > > > Thanks for your reply.
>> > > >
>> > > > - Consider bulkGetPartitionStatistics, partition statistics are
>> > > > already in HiveMetastoreClient.listPartitions. But on our side, we
>> > > > need Catalog.getPartitions first, and then
>> > > > Catalog.bulkGetPartitionStatistics.
>> > > >
>> > > > - Consider bulkGetPartitionColumnStatistics, yes, as you said, we
>> need
>> > > it.
>> > > >
>> > > > But to unify them, the current FLIP is acceptable.
>> > > >
>> > > > Best,
>> > > > Jingsong
>> > > >
>> > > > On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <[email protected]> wrote:
>> > > > >
>> > > > > Hi Jingsong,
>> > > > >
>> > > > > Thanks for clarifying it. Are you suggesting a new method or
>> changing
>> > > the
>> > > > > name of the methods described in the FLIP?
>> > > > > Please see my answers and further questions below.
>> > > > >
>> > > > > Best regards,
>> > > > > Jing
>> > > > >
>> > > > > On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <
>> [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > > Hi Jing,
>> > > > > >
>> > > > > > I understand that the statistics for partitions are currently
>> only
>> > > > > > used by Hive, so we can look at the Hive implementation:
>> > > > > >
>> > > > > > See HiveCatalog.getPartitionStatistics.
>> > > > > > To get the statistics, we actually get them from the
>> > > > > > org.apache.hadoop.hive.metastore.api.Partition object.
>> > > > > >
>> > > > >
>> > > > > Correct. Both new methods will do the same thing but in bulk mode.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > According to HiveMetastore's API, partition-related operations
>> > > > > > actually get the partition as well as the statistics
>> information.
>> > > > > >
>> > > > >
>> > > > > I am not really sure which API it is. Methods in
>> HiveMetaStoreClient
>> > > that
>> > > > > are dealing with partitions will either return partitions or
>> > > > statisticObjs,
>> > > > > e.g. listPartitions(...) or getPartitionColumnStatistics(...)
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > So if the current partition statistics are just for Hive, can we
>> > > > > > consider unifying it with Hive?
>> > > > > >
>> > > > >
>> > > > > Yes, we are on the same page. This FLIP is trying to unify it with
>> > > > > HiveMetaStoreClient.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > For example, in PushPartitionIntoTableSourceScanRule, just use
>> > > > > > `listPartitionWithStats`, and adjust table statistics from
>> > > partitions.
>> > > > > >
>> > > > >
>> > > > > Does the bulkGetPartitionStatistics work in this case too? Or, do
>> you
>> > > > mean
>> > > > > you need both partitions and related statistics returned by a new
>> > > method
>> > > > > called `listPartitionWithStats`?
>> > > > >
>> > > > >
>> > > > > > Best,
>> > > > > > Jingsong
>> > > > > >
>> > > > > > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <[email protected]>
>> > wrote:
>> > > > > > >
>> > > > > > > Thanks Jingsong for the suggestion.
>> > > > > > >
>> > > > > > > Do you mean using a different naming convention? There is a
>> > thought
>> > > > and
>> > > > > > > description in the FLIP about using "list" or "bulkGet":
>> > > > > > >
>> > > > > > >    - bulkGetPartitionStatistics(...) has been chosen over
>> > > > > > >    listPartitionStatistics(...), because, comparing to
>> database
>> > and
>> > > > > > partition
>> > > > > > >    that are static and can be listed, statistics are more
>> dynamic
>> > > and
>> > > > > > will
>> > > > > > >    need more computation logic to create, therefore using
>> "get"
>> > is
>> > > > > > >    semantically more feasible than list. The "bulk" gives
>> users
>> > the
>> > > > hint
>> > > > > > that
>> > > > > > >    this method will work in the bulk mode and return a
>> collection
>> > > of
>> > > > > > instances.
>> > > > > > >
>> > > > > > >
>> > > > > > > As a reference, we can see that no method in MetaStoreClient,
>> > that
>> > > > > > > calculates statistics, uses the "list" naming convention.
>> > > > > > >
>> > > > > > > Best regards,
>> > > > > > > Jing
>> > > > > > >
>> > > > > > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <
>> > > [email protected]>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Thanks for starting this discussion.
>> > > > > > > >
>> > > > > > > > Have we considered introducing a listPartitionWithStats() in
>> > > > Catalog?
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > Jingsong
>> > > > > > > >
>> > > > > > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <[email protected]>
>> > > wrote:
>> > > > > > > > >
>> > > > > > > > > Hi Jing,
>> > > > > > > > >
>> > > > > > > > > Thanks for starting this discussion. The bulk fetch is a
>> > great
>> > > > > > > > improvement
>> > > > > > > > > for the optimizer.
>> > > > > > > > > The FLIP looks good to me.
>> > > > > > > > >
>> > > > > > > > > Best,
>> > > > > > > > > Jark
>> > > > > > > > >
>> > > > > > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <[email protected]>
>> > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi devs,
>> > > > > > > > > >
>> > > > > > > > > > After having multiple discussions with Jark and
>> Goldfrey,
>> > I'd
>> > > > like
>> > > > > > to
>> > > > > > > > start
>> > > > > > > > > > a discussion on the mailing list w.r.t. FLIP-247[1],
>> which
>> > > will
>> > > > > > > > > > significantly improve the performance by providing the
>> bulk
>> > > > fetch
>> > > > > > > > > > capability for table and column statistics.
>> > > > > > > > > >
>> > > > > > > > > > Currently the statistics information about tables can
>> only
>> > be
>> > > > > > fetched
>> > > > > > > > from
>> > > > > > > > > > the catalog by each given partition iteratively. Since
>> > > getting
>> > > > > > > > statistics
>> > > > > > > > > > information from catalogs is a very heavy operation, in
>> > order
>> > > > to
>> > > > > > > > improve
>> > > > > > > > > > the query performance, we’d better provide
>> functionality to
>> > > > fetch
>> > > > > > the
>> > > > > > > > > > statistics information of a table for all given
>> partitions
>> > in
>> > > > one
>> > > > > > shot.
>> > > > > > > > > >
>> > > > > > > > > > Based on the manual performance test, for 2000
>> partitions,
>> > > the
>> > > > cost
>> > > > > > > > will be
>> > > > > > > > > > improved from 10s to 2s. The improvement result is 500%.
>> > > > > > > > > >
>> > > > > > > > > > [1]
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
>> > > > > > > > > >
>> > > > > > > > > > Best regards,
>> > > > > > > > > > Jing
>> > > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Reply via email to