Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Jing Ge Thu, 21 Jul 2022 06:13:17 -0700

Thanks Jingsong and Jark. I will create another FLIP to cover the
optimization topic that partitions and partition stats could be fetched
from catalog in one single call.


Thanks for the hint w.r.t. the compatibility issue. I have updated the FLIP
to provide all methods as default interface methods. A new
method isBulkGetSupported() has been added. Would you like to take a look
at it again?

Best regards,
Jing

On Wed, Jul 20, 2022 at 11:23 AM Jark Wu <[email protected]> wrote:

> I agree with Jingsong.
>
> There are use cases to get partitions and partition stats in a single call
> to reduce the IO cost.
> For example, extending Catalog#listPartitions to
> Catalog#listPartitionsWithStats,
> and extending Catalog#listPartitionsByFilter to
> Catalog#listPartitionsWithStatsByFilter.
> This allows the planner to choose the cheapest one to request partitions
> and stats.
> But of course, this can be introduced in the future.
>
> Besides, I found that the FLIP doesn't cover compatibility properly.
> Introducing two new methods in Catalog
> will break all third-party catalog implementations. On the other hand, the
> planner should
> have a way to identify whether the Catalog supports bulk get.
>
> Best,
> Jark
>
> On Wed, 20 Jul 2022 at 16:43, Jingsong Li <[email protected]> wrote:
>
> > Thanks for your reply.
> >
> > - Consider bulkGetPartitionStatistics, partition statistics are
> > already in HiveMetastoreClient.listPartitions. But on our side, we
> > need Catalog.getPartitions first, and then
> > Catalog.bulkGetPartitionStatistics.
> >
> > - Consider bulkGetPartitionColumnStatistics, yes, as you said, we need
> it.
> >
> > But to unify them, the current FLIP is acceptable.
> >
> > Best,
> > Jingsong
> >
> > On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <[email protected]> wrote:
> > >
> > > Hi Jingsong,
> > >
> > > Thanks for clarifying it. Are you suggesting a new method or changing
> the
> > > name of the methods described in the FLIP?
> > > Please see my answers and further questions below.
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <[email protected]>
> > wrote:
> > >
> > > > Hi Jing,
> > > >
> > > > I understand that the statistics for partitions are currently only
> > > > used by Hive, so we can look at the Hive implementation:
> > > >
> > > > See HiveCatalog.getPartitionStatistics.
> > > > To get the statistics, we actually get them from the
> > > > org.apache.hadoop.hive.metastore.api.Partition object.
> > > >
> > >
> > > Correct. Both new methods will do the same thing but in bulk mode.
> > >
> > >
> > > >
> > > > According to HiveMetastore's API, partition-related operations
> > > > actually get the partition as well as the statistics information.
> > > >
> > >
> > > I am not really sure which API it is. Methods in HiveMetaStoreClient
> that
> > > are dealing with partitions will either return partitions or
> > statisticObjs,
> > > e.g. listPartitions(...) or getPartitionColumnStatistics(...)
> > >
> > >
> > > >
> > > > So if the current partition statistics are just for Hive, can we
> > > > consider unifying it with Hive?
> > > >
> > >
> > > Yes, we are on the same page. This FLIP is trying to unify it with
> > > HiveMetaStoreClient.
> > >
> > >
> > > >
> > > > For example, in PushPartitionIntoTableSourceScanRule, just use
> > > > `listPartitionWithStats`, and adjust table statistics from
> partitions.
> > > >
> > >
> > > Does the bulkGetPartitionStatistics work in this case too? Or, do you
> > mean
> > > you need both partitions and related statistics returned by a new
> method
> > > called `listPartitionWithStats`?
> > >
> > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <[email protected]> wrote:
> > > > >
> > > > > Thanks Jingsong for the suggestion.
> > > > >
> > > > > Do you mean using a different naming convention? There is a thought
> > and
> > > > > description in the FLIP about using "list" or "bulkGet":
> > > > >
> > > > >    - bulkGetPartitionStatistics(...) has been chosen over
> > > > >    listPartitionStatistics(...), because, comparing to database and
> > > > partition
> > > > >    that are static and can be listed, statistics are more dynamic
> and
> > > > will
> > > > >    need more computation logic to create, therefore using "get" is
> > > > >    semantically more feasible than list. The "bulk" gives users the
> > hint
> > > > that
> > > > >    this method will work in the bulk mode and return a collection
> of
> > > > instances.
> > > > >
> > > > >
> > > > > As a reference, we can see that no method in MetaStoreClient, that
> > > > > calculates statistics, uses the "list" naming convention.
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <
> [email protected]>
> > > > wrote:
> > > > >
> > > > > > Thanks for starting this discussion.
> > > > > >
> > > > > > Have we considered introducing a listPartitionWithStats() in
> > Catalog?
> > > > > >
> > > > > > Best,
> > > > > > Jingsong
> > > > > >
> > > > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <[email protected]>
> wrote:
> > > > > > >
> > > > > > > Hi Jing,
> > > > > > >
> > > > > > > Thanks for starting this discussion. The bulk fetch is a great
> > > > > > improvement
> > > > > > > for the optimizer.
> > > > > > > The FLIP looks good to me.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jark
> > > > > > >
> > > > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <[email protected]>
> wrote:
> > > > > > >
> > > > > > > > Hi devs,
> > > > > > > >
> > > > > > > > After having multiple discussions with Jark and Goldfrey, I'd
> > like
> > > > to
> > > > > > start
> > > > > > > > a discussion on the mailing list w.r.t. FLIP-247[1], which
> will
> > > > > > > > significantly improve the performance by providing the bulk
> > fetch
> > > > > > > > capability for table and column statistics.
> > > > > > > >
> > > > > > > > Currently the statistics information about tables can only be
> > > > fetched
> > > > > > from
> > > > > > > > the catalog by each given partition iteratively. Since
> getting
> > > > > > statistics
> > > > > > > > information from catalogs is a very heavy operation, in order
> > to
> > > > > > improve
> > > > > > > > the query performance, we’d better provide functionality to
> > fetch
> > > > the
> > > > > > > > statistics information of a table for all given partitions in
> > one
> > > > shot.
> > > > > > > >
> > > > > > > > Based on the manual performance test, for 2000 partitions,
> the
> > cost
> > > > > > will be
> > > > > > > > improved from 10s to 2s. The improvement result is 500%.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jing
> > > > > > > >
> > > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Reply via email to