Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Jingsong Li Tue, 19 Jul 2022 19:28:09 -0700

Hi Jing,

I understand that the statistics for partitions are currently only
used by Hive, so we can look at the Hive implementation:


See HiveCatalog.getPartitionStatistics.
To get the statistics, we actually get them from the
org.apache.hadoop.hive.metastore.api.Partition object.

According to HiveMetastore's API, partition-related operations
actually get the partition as well as the statistics information.

So if the current partition statistics are just for Hive, can we
consider unifying it with Hive?

For example, in PushPartitionIntoTableSourceScanRule, just use
`listPartitionWithStats`, and adjust table statistics from partitions.

Best,
Jingsong

On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <[email protected]> wrote:
>
> Thanks Jingsong for the suggestion.
>
> Do you mean using a different naming convention? There is a thought and
> description in the FLIP about using "list" or "bulkGet":
>
>    - bulkGetPartitionStatistics(...) has been chosen over
>    listPartitionStatistics(...), because, comparing to database and partition
>    that are static and can be listed, statistics are more dynamic and will
>    need more computation logic to create, therefore using "get" is
>    semantically more feasible than list. The "bulk" gives users the hint that
>    this method will work in the bulk mode and return a collection of 
> instances.
>
>
> As a reference, we can see that no method in MetaStoreClient, that
> calculates statistics, uses the "list" naming convention.
>
> Best regards,
> Jing
>
> On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <[email protected]> wrote:
>
> > Thanks for starting this discussion.
> >
> > Have we considered introducing a listPartitionWithStats() in Catalog?
> >
> > Best,
> > Jingsong
> >
> > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <[email protected]> wrote:
> > >
> > > Hi Jing,
> > >
> > > Thanks for starting this discussion. The bulk fetch is a great
> > improvement
> > > for the optimizer.
> > > The FLIP looks good to me.
> > >
> > > Best,
> > > Jark
> > >
> > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <[email protected]> wrote:
> > >
> > > > Hi devs,
> > > >
> > > > After having multiple discussions with Jark and Goldfrey, I'd like to
> > start
> > > > a discussion on the mailing list w.r.t. FLIP-247[1], which will
> > > > significantly improve the performance by providing the bulk fetch
> > > > capability for table and column statistics.
> > > >
> > > > Currently the statistics information about tables can only be fetched
> > from
> > > > the catalog by each given partition iteratively. Since getting
> > statistics
> > > > information from catalogs is a very heavy operation, in order to
> > improve
> > > > the query performance, we’d better provide functionality to fetch the
> > > > statistics information of a table for all given partitions in one shot.
> > > >
> > > > Based on the manual performance test, for 2000 partitions, the cost
> > will be
> > > > improved from 10s to 2s. The improvement result is 500%.
> > > >
> > > > [1]
> > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> >

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Reply via email to