Hi Jingsong,

Thanks for clarifying it. Are you suggesting a new method or changing the
name of the methods described in the FLIP?
Please see my answers and further questions below.

Best regards,
Jing

On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <jingsongl...@gmail.com> wrote:

> Hi Jing,
>
> I understand that the statistics for partitions are currently only
> used by Hive, so we can look at the Hive implementation:
>
> See HiveCatalog.getPartitionStatistics.
> To get the statistics, we actually get them from the
> org.apache.hadoop.hive.metastore.api.Partition object.
>

Correct. Both new methods will do the same thing but in bulk mode.


>
> According to HiveMetastore's API, partition-related operations
> actually get the partition as well as the statistics information.
>

I am not really sure which API it is. Methods in HiveMetaStoreClient that
are dealing with partitions will either return partitions or statisticObjs,
e.g. listPartitions(...) or getPartitionColumnStatistics(...)


>
> So if the current partition statistics are just for Hive, can we
> consider unifying it with Hive?
>

Yes, we are on the same page. This FLIP is trying to unify it with
HiveMetaStoreClient.


>
> For example, in PushPartitionIntoTableSourceScanRule, just use
> `listPartitionWithStats`, and adjust table statistics from partitions.
>

Does the bulkGetPartitionStatistics work in this case too? Or, do you mean
you need both partitions and related statistics returned by a new method
called `listPartitionWithStats`?


> Best,
> Jingsong
>
> On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <j...@ververica.com> wrote:
> >
> > Thanks Jingsong for the suggestion.
> >
> > Do you mean using a different naming convention? There is a thought and
> > description in the FLIP about using "list" or "bulkGet":
> >
> >    - bulkGetPartitionStatistics(...) has been chosen over
> >    listPartitionStatistics(...), because, comparing to database and
> partition
> >    that are static and can be listed, statistics are more dynamic and
> will
> >    need more computation logic to create, therefore using "get" is
> >    semantically more feasible than list. The "bulk" gives users the hint
> that
> >    this method will work in the bulk mode and return a collection of
> instances.
> >
> >
> > As a reference, we can see that no method in MetaStoreClient, that
> > calculates statistics, uses the "list" naming convention.
> >
> > Best regards,
> > Jing
> >
> > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <jingsongl...@gmail.com>
> wrote:
> >
> > > Thanks for starting this discussion.
> > >
> > > Have we considered introducing a listPartitionWithStats() in Catalog?
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <imj...@gmail.com> wrote:
> > > >
> > > > Hi Jing,
> > > >
> > > > Thanks for starting this discussion. The bulk fetch is a great
> > > improvement
> > > > for the optimizer.
> > > > The FLIP looks good to me.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <j...@ververica.com> wrote:
> > > >
> > > > > Hi devs,
> > > > >
> > > > > After having multiple discussions with Jark and Goldfrey, I'd like
> to
> > > start
> > > > > a discussion on the mailing list w.r.t. FLIP-247[1], which will
> > > > > significantly improve the performance by providing the bulk fetch
> > > > > capability for table and column statistics.
> > > > >
> > > > > Currently the statistics information about tables can only be
> fetched
> > > from
> > > > > the catalog by each given partition iteratively. Since getting
> > > statistics
> > > > > information from catalogs is a very heavy operation, in order to
> > > improve
> > > > > the query performance, we’d better provide functionality to fetch
> the
> > > > > statistics information of a table for all given partitions in one
> shot.
> > > > >
> > > > > Based on the manual performance test, for 2000 partitions, the cost
> > > will be
> > > > > improved from 10s to 2s. The improvement result is 500%.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > >
>

Reply via email to