Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Jing Ge Fri, 22 Jul 2022 04:33:28 -0700

Thanks Jark, fair enough, I will update the FLIP accordingly.

Best regards,
Jing


On Fri, Jul 22, 2022 at 6:07 AM Jark Wu <[email protected]> wrote:

> Hi Jing,
>
> I have some concerns about the isBulkGetSupported() approach.
> 1. Catalog developers need to learn the contract between
> `isBulkGetSupported()` and bulk get methods
> 2. The contract of isBulkGetSupported() is fragile. Because developers may
> forget to override
>    `isBulkGetSupported` and be confused the implemented bulk methods are
> not called.
> 3. It's not extensible when we have more bulk methods in future releases.
>
> I have discussed with Godfrey offline and come up with (maybe) a better
> solution.
> The default implementation of bulk methods should iterate on the single get
> method and wrap the result.
> In this way, the planner can always call the bulk methods no matter whether
> the Catalog implements the bulk method or not.
> This can also address the above concerns.
>
> What do you think?
>
> Best,
> Jark
>
>
>
>
>
> On Thu, 21 Jul 2022 at 21:13, Jing Ge <[email protected]> wrote:
>
> > Thanks Jingsong and Jark. I will create another FLIP to cover the
> > optimization topic that partitions and partition stats could be fetched
> > from catalog in one single call.
> >
> > Thanks for the hint w.r.t. the compatibility issue. I have updated the
> FLIP
> > to provide all methods as default interface methods. A new
> > method isBulkGetSupported() has been added. Would you like to take a look
> > at it again?
> >
> > Best regards,
> > Jing
> >
> > On Wed, Jul 20, 2022 at 11:23 AM Jark Wu <[email protected]> wrote:
> >
> > > I agree with Jingsong.
> > >
> > > There are use cases to get partitions and partition stats in a single
> > call
> > > to reduce the IO cost.
> > > For example, extending Catalog#listPartitions to
> > > Catalog#listPartitionsWithStats,
> > > and extending Catalog#listPartitionsByFilter to
> > > Catalog#listPartitionsWithStatsByFilter.
> > > This allows the planner to choose the cheapest one to request
> partitions
> > > and stats.
> > > But of course, this can be introduced in the future.
> > >
> > > Besides, I found that the FLIP doesn't cover compatibility properly.
> > > Introducing two new methods in Catalog
> > > will break all third-party catalog implementations. On the other hand,
> > the
> > > planner should
> > > have a way to identify whether the Catalog supports bulk get.
> > >
> > > Best,
> > > Jark
> > >
> > > On Wed, 20 Jul 2022 at 16:43, Jingsong Li <[email protected]>
> > wrote:
> > >
> > > > Thanks for your reply.
> > > >
> > > > - Consider bulkGetPartitionStatistics, partition statistics are
> > > > already in HiveMetastoreClient.listPartitions. But on our side, we
> > > > need Catalog.getPartitions first, and then
> > > > Catalog.bulkGetPartitionStatistics.
> > > >
> > > > - Consider bulkGetPartitionColumnStatistics, yes, as you said, we
> need
> > > it.
> > > >
> > > > But to unify them, the current FLIP is acceptable.
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <[email protected]> wrote:
> > > > >
> > > > > Hi Jingsong,
> > > > >
> > > > > Thanks for clarifying it. Are you suggesting a new method or
> changing
> > > the
> > > > > name of the methods described in the FLIP?
> > > > > Please see my answers and further questions below.
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > > On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <
> [email protected]>
> > > > wrote:
> > > > >
> > > > > > Hi Jing,
> > > > > >
> > > > > > I understand that the statistics for partitions are currently
> only
> > > > > > used by Hive, so we can look at the Hive implementation:
> > > > > >
> > > > > > See HiveCatalog.getPartitionStatistics.
> > > > > > To get the statistics, we actually get them from the
> > > > > > org.apache.hadoop.hive.metastore.api.Partition object.
> > > > > >
> > > > >
> > > > > Correct. Both new methods will do the same thing but in bulk mode.
> > > > >
> > > > >
> > > > > >
> > > > > > According to HiveMetastore's API, partition-related operations
> > > > > > actually get the partition as well as the statistics information.
> > > > > >
> > > > >
> > > > > I am not really sure which API it is. Methods in
> HiveMetaStoreClient
> > > that
> > > > > are dealing with partitions will either return partitions or
> > > > statisticObjs,
> > > > > e.g. listPartitions(...) or getPartitionColumnStatistics(...)
> > > > >
> > > > >
> > > > > >
> > > > > > So if the current partition statistics are just for Hive, can we
> > > > > > consider unifying it with Hive?
> > > > > >
> > > > >
> > > > > Yes, we are on the same page. This FLIP is trying to unify it with
> > > > > HiveMetaStoreClient.
> > > > >
> > > > >
> > > > > >
> > > > > > For example, in PushPartitionIntoTableSourceScanRule, just use
> > > > > > `listPartitionWithStats`, and adjust table statistics from
> > > partitions.
> > > > > >
> > > > >
> > > > > Does the bulkGetPartitionStatistics work in this case too? Or, do
> you
> > > > mean
> > > > > you need both partitions and related statistics returned by a new
> > > method
> > > > > called `listPartitionWithStats`?
> > > > >
> > > > >
> > > > > > Best,
> > > > > > Jingsong
> > > > > >
> > > > > > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <[email protected]>
> > wrote:
> > > > > > >
> > > > > > > Thanks Jingsong for the suggestion.
> > > > > > >
> > > > > > > Do you mean using a different naming convention? There is a
> > thought
> > > > and
> > > > > > > description in the FLIP about using "list" or "bulkGet":
> > > > > > >
> > > > > > >    - bulkGetPartitionStatistics(...) has been chosen over
> > > > > > >    listPartitionStatistics(...), because, comparing to database
> > and
> > > > > > partition
> > > > > > >    that are static and can be listed, statistics are more
> dynamic
> > > and
> > > > > > will
> > > > > > >    need more computation logic to create, therefore using "get"
> > is
> > > > > > >    semantically more feasible than list. The "bulk" gives users
> > the
> > > > hint
> > > > > > that
> > > > > > >    this method will work in the bulk mode and return a
> collection
> > > of
> > > > > > instances.
> > > > > > >
> > > > > > >
> > > > > > > As a reference, we can see that no method in MetaStoreClient,
> > that
> > > > > > > calculates statistics, uses the "list" naming convention.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Jing
> > > > > > >
> > > > > > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <
> > > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for starting this discussion.
> > > > > > > >
> > > > > > > > Have we considered introducing a listPartitionWithStats() in
> > > > Catalog?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jingsong
> > > > > > > >
> > > > > > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <[email protected]>
> > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Jing,
> > > > > > > > >
> > > > > > > > > Thanks for starting this discussion. The bulk fetch is a
> > great
> > > > > > > > improvement
> > > > > > > > > for the optimizer.
> > > > > > > > > The FLIP looks good to me.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jark
> > > > > > > > >
> > > > > > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <[email protected]>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi devs,
> > > > > > > > > >
> > > > > > > > > > After having multiple discussions with Jark and Goldfrey,
> > I'd
> > > > like
> > > > > > to
> > > > > > > > start
> > > > > > > > > > a discussion on the mailing list w.r.t. FLIP-247[1],
> which
> > > will
> > > > > > > > > > significantly improve the performance by providing the
> bulk
> > > > fetch
> > > > > > > > > > capability for table and column statistics.
> > > > > > > > > >
> > > > > > > > > > Currently the statistics information about tables can
> only
> > be
> > > > > > fetched
> > > > > > > > from
> > > > > > > > > > the catalog by each given partition iteratively. Since
> > > getting
> > > > > > > > statistics
> > > > > > > > > > information from catalogs is a very heavy operation, in
> > order
> > > > to
> > > > > > > > improve
> > > > > > > > > > the query performance, we’d better provide functionality
> to
> > > > fetch
> > > > > > the
> > > > > > > > > > statistics information of a table for all given
> partitions
> > in
> > > > one
> > > > > > shot.
> > > > > > > > > >
> > > > > > > > > > Based on the manual performance test, for 2000
> partitions,
> > > the
> > > > cost
> > > > > > > > will be
> > > > > > > > > > improved from 10s to 2s. The improvement result is 500%.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Jing
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Reply via email to