Thanks for the updates, Jing!
+1 to start the vote.

Best,
Jark

On Fri, 22 Jul 2022 at 20:10, Jing Ge <j...@ververica.com> wrote:

> Hi,
>
> I have updated the FLIP. Please check again. If there are no other
> concerns, I will start voting. Thank you all for your support!
>
> Best regards,
> Jing
>
> On Fri, Jul 22, 2022 at 1:32 PM Jing Ge <j...@ververica.com> wrote:
>
> > Thanks Jark, fair enough, I will update the FLIP accordingly.
> >
> > Best regards,
> > Jing
> >
> > On Fri, Jul 22, 2022 at 6:07 AM Jark Wu <imj...@gmail.com> wrote:
> >
> >> Hi Jing,
> >>
> >> I have some concerns about the isBulkGetSupported() approach.
> >> 1. Catalog developers need to learn the contract between
> >> `isBulkGetSupported()` and bulk get methods
> >> 2. The contract of isBulkGetSupported() is fragile. Because developers
> may
> >> forget to override
> >>    `isBulkGetSupported` and be confused the implemented bulk methods are
> >> not called.
> >> 3. It's not extensible when we have more bulk methods in future
> releases.
> >>
> >> I have discussed with Godfrey offline and come up with (maybe) a better
> >> solution.
> >> The default implementation of bulk methods should iterate on the single
> >> get
> >> method and wrap the result.
> >> In this way, the planner can always call the bulk methods no matter
> >> whether
> >> the Catalog implements the bulk method or not.
> >> This can also address the above concerns.
> >>
> >> What do you think?
> >>
> >> Best,
> >> Jark
> >>
> >>
> >>
> >>
> >>
> >> On Thu, 21 Jul 2022 at 21:13, Jing Ge <j...@ververica.com> wrote:
> >>
> >> > Thanks Jingsong and Jark. I will create another FLIP to cover the
> >> > optimization topic that partitions and partition stats could be
> fetched
> >> > from catalog in one single call.
> >> >
> >> > Thanks for the hint w.r.t. the compatibility issue. I have updated the
> >> FLIP
> >> > to provide all methods as default interface methods. A new
> >> > method isBulkGetSupported() has been added. Would you like to take a
> >> look
> >> > at it again?
> >> >
> >> > Best regards,
> >> > Jing
> >> >
> >> > On Wed, Jul 20, 2022 at 11:23 AM Jark Wu <imj...@gmail.com> wrote:
> >> >
> >> > > I agree with Jingsong.
> >> > >
> >> > > There are use cases to get partitions and partition stats in a
> single
> >> > call
> >> > > to reduce the IO cost.
> >> > > For example, extending Catalog#listPartitions to
> >> > > Catalog#listPartitionsWithStats,
> >> > > and extending Catalog#listPartitionsByFilter to
> >> > > Catalog#listPartitionsWithStatsByFilter.
> >> > > This allows the planner to choose the cheapest one to request
> >> partitions
> >> > > and stats.
> >> > > But of course, this can be introduced in the future.
> >> > >
> >> > > Besides, I found that the FLIP doesn't cover compatibility properly.
> >> > > Introducing two new methods in Catalog
> >> > > will break all third-party catalog implementations. On the other
> hand,
> >> > the
> >> > > planner should
> >> > > have a way to identify whether the Catalog supports bulk get.
> >> > >
> >> > > Best,
> >> > > Jark
> >> > >
> >> > > On Wed, 20 Jul 2022 at 16:43, Jingsong Li <jingsongl...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Thanks for your reply.
> >> > > >
> >> > > > - Consider bulkGetPartitionStatistics, partition statistics are
> >> > > > already in HiveMetastoreClient.listPartitions. But on our side, we
> >> > > > need Catalog.getPartitions first, and then
> >> > > > Catalog.bulkGetPartitionStatistics.
> >> > > >
> >> > > > - Consider bulkGetPartitionColumnStatistics, yes, as you said, we
> >> need
> >> > > it.
> >> > > >
> >> > > > But to unify them, the current FLIP is acceptable.
> >> > > >
> >> > > > Best,
> >> > > > Jingsong
> >> > > >
> >> > > > On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <j...@ververica.com>
> wrote:
> >> > > > >
> >> > > > > Hi Jingsong,
> >> > > > >
> >> > > > > Thanks for clarifying it. Are you suggesting a new method or
> >> changing
> >> > > the
> >> > > > > name of the methods described in the FLIP?
> >> > > > > Please see my answers and further questions below.
> >> > > > >
> >> > > > > Best regards,
> >> > > > > Jing
> >> > > > >
> >> > > > > On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <
> >> jingsongl...@gmail.com>
> >> > > > wrote:
> >> > > > >
> >> > > > > > Hi Jing,
> >> > > > > >
> >> > > > > > I understand that the statistics for partitions are currently
> >> only
> >> > > > > > used by Hive, so we can look at the Hive implementation:
> >> > > > > >
> >> > > > > > See HiveCatalog.getPartitionStatistics.
> >> > > > > > To get the statistics, we actually get them from the
> >> > > > > > org.apache.hadoop.hive.metastore.api.Partition object.
> >> > > > > >
> >> > > > >
> >> > > > > Correct. Both new methods will do the same thing but in bulk
> mode.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > According to HiveMetastore's API, partition-related operations
> >> > > > > > actually get the partition as well as the statistics
> >> information.
> >> > > > > >
> >> > > > >
> >> > > > > I am not really sure which API it is. Methods in
> >> HiveMetaStoreClient
> >> > > that
> >> > > > > are dealing with partitions will either return partitions or
> >> > > > statisticObjs,
> >> > > > > e.g. listPartitions(...) or getPartitionColumnStatistics(...)
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > So if the current partition statistics are just for Hive, can
> we
> >> > > > > > consider unifying it with Hive?
> >> > > > > >
> >> > > > >
> >> > > > > Yes, we are on the same page. This FLIP is trying to unify it
> with
> >> > > > > HiveMetaStoreClient.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > For example, in PushPartitionIntoTableSourceScanRule, just use
> >> > > > > > `listPartitionWithStats`, and adjust table statistics from
> >> > > partitions.
> >> > > > > >
> >> > > > >
> >> > > > > Does the bulkGetPartitionStatistics work in this case too? Or,
> do
> >> you
> >> > > > mean
> >> > > > > you need both partitions and related statistics returned by a
> new
> >> > > method
> >> > > > > called `listPartitionWithStats`?
> >> > > > >
> >> > > > >
> >> > > > > > Best,
> >> > > > > > Jingsong
> >> > > > > >
> >> > > > > > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <j...@ververica.com>
> >> > wrote:
> >> > > > > > >
> >> > > > > > > Thanks Jingsong for the suggestion.
> >> > > > > > >
> >> > > > > > > Do you mean using a different naming convention? There is a
> >> > thought
> >> > > > and
> >> > > > > > > description in the FLIP about using "list" or "bulkGet":
> >> > > > > > >
> >> > > > > > >    - bulkGetPartitionStatistics(...) has been chosen over
> >> > > > > > >    listPartitionStatistics(...), because, comparing to
> >> database
> >> > and
> >> > > > > > partition
> >> > > > > > >    that are static and can be listed, statistics are more
> >> dynamic
> >> > > and
> >> > > > > > will
> >> > > > > > >    need more computation logic to create, therefore using
> >> "get"
> >> > is
> >> > > > > > >    semantically more feasible than list. The "bulk" gives
> >> users
> >> > the
> >> > > > hint
> >> > > > > > that
> >> > > > > > >    this method will work in the bulk mode and return a
> >> collection
> >> > > of
> >> > > > > > instances.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > As a reference, we can see that no method in
> MetaStoreClient,
> >> > that
> >> > > > > > > calculates statistics, uses the "list" naming convention.
> >> > > > > > >
> >> > > > > > > Best regards,
> >> > > > > > > Jing
> >> > > > > > >
> >> > > > > > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <
> >> > > jingsongl...@gmail.com>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Thanks for starting this discussion.
> >> > > > > > > >
> >> > > > > > > > Have we considered introducing a listPartitionWithStats()
> in
> >> > > > Catalog?
> >> > > > > > > >
> >> > > > > > > > Best,
> >> > > > > > > > Jingsong
> >> > > > > > > >
> >> > > > > > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <
> imj...@gmail.com>
> >> > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > Hi Jing,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for starting this discussion. The bulk fetch is a
> >> > great
> >> > > > > > > > improvement
> >> > > > > > > > > for the optimizer.
> >> > > > > > > > > The FLIP looks good to me.
> >> > > > > > > > >
> >> > > > > > > > > Best,
> >> > > > > > > > > Jark
> >> > > > > > > > >
> >> > > > > > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <
> j...@ververica.com>
> >> > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hi devs,
> >> > > > > > > > > >
> >> > > > > > > > > > After having multiple discussions with Jark and
> >> Goldfrey,
> >> > I'd
> >> > > > like
> >> > > > > > to
> >> > > > > > > > start
> >> > > > > > > > > > a discussion on the mailing list w.r.t. FLIP-247[1],
> >> which
> >> > > will
> >> > > > > > > > > > significantly improve the performance by providing the
> >> bulk
> >> > > > fetch
> >> > > > > > > > > > capability for table and column statistics.
> >> > > > > > > > > >
> >> > > > > > > > > > Currently the statistics information about tables can
> >> only
> >> > be
> >> > > > > > fetched
> >> > > > > > > > from
> >> > > > > > > > > > the catalog by each given partition iteratively. Since
> >> > > getting
> >> > > > > > > > statistics
> >> > > > > > > > > > information from catalogs is a very heavy operation,
> in
> >> > order
> >> > > > to
> >> > > > > > > > improve
> >> > > > > > > > > > the query performance, we’d better provide
> >> functionality to
> >> > > > fetch
> >> > > > > > the
> >> > > > > > > > > > statistics information of a table for all given
> >> partitions
> >> > in
> >> > > > one
> >> > > > > > shot.
> >> > > > > > > > > >
> >> > > > > > > > > > Based on the manual performance test, for 2000
> >> partitions,
> >> > > the
> >> > > > cost
> >> > > > > > > > will be
> >> > > > > > > > > > improved from 10s to 2s. The improvement result is
> 500%.
> >> > > > > > > > > >
> >> > > > > > > > > > [1]
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> >> > > > > > > > > >
> >> > > > > > > > > > Best regards,
> >> > > > > > > > > > Jing
> >> > > > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to