Re: [Proposal] Partition stats in Iceberg

Ajantha Bhat Tue, 07 Feb 2023 07:02:22 -0800

Hi Hrishi and Mayur, thanks for the inputs.

To get things moving I have frozen the scope of phase 1 implementation.
(Recently added the delete file stats to phase 1 too). You can find the
scope in the "Design for approval" section of the design doc.


That said, once we have phase 1 implemented, It should be very easy to add
a few more fields to it like the latest sequence number or last modified
time per partition.
I will be opening up the discussion about phase 2 schema again once phase 1
implementation is done.

Thanks,
Ajantha

On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava <
mayur.srivast...@twosigma.com> wrote:

> +1 for the initiative.
>
>
>
> We’ve been exploring options for storing last-modified-time per partition.
> It an important building block for data pipelines – especially if there is
> a dependency between jobs with strong consistency requirements.
>
>
>
> Is partition stats a good place for storing last-modified-time per
> partition?
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Ajantha Bhat <ajanthab...@gmail.com>
> *Sent:* Monday, January 23, 2023 11:56 AM
> *To:* dev@iceberg.apache.org
> *Subject:* Re: [Proposal] Partition stats in Iceberg
>
>
>
> Hi All,
>
> In the same design document (
> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
> ),
> I have added a section called
> *"Design for approval".  *It also contains a potential PR breakdown for
> the phase 1 implementation and future development scope.
> Please take a look and please vote if you think the design is ok.
>
> Thanks,
> Ajantha
>
>
>
> On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:
>
> A big thanks to everyone who was involved in the review and the
> discussions so far.
>
> Please find the meeting minutes from the last iceberg sync about the
> partition stats.
>     a. Writers should not write the partition stats or any stats as of
> now.
>         Because it requires bumping the spec to V3. (We can have it as
> part of the v3 spec later on. But not anytime soon).
>     b. So, there can be an async way of generating the stats like ANALYZE
> table or call procedure.
>         Which will compute the stats till the current snapshot and store
> it as a partition stats file.
>     c. In phase 1, partition stats will just store the row_count and
> file_count per partition value as mentioned in the design document.
>         Later it can be enhanced to store puffin file location and other
> metrics per partition value.
>     d. These tuples are stored in a single sorted Avro/parquet file (we
> need to finalize this).
>     e. Each time "analyze table" will rewrite the whole stats file as
> keeping multiple delta files will just make the read path messy.
>         Also, even with million rows, it can be of a few MB size.
>         Once the writers start writing the stats (V3 spec), we can revisit
> storing as the delta files if there are any performance issues.
>
> The next immediate plan is to
>     a. Get these PRs merged (open points in existing StatictisFile
> interface added during Puffin)
>         #6267 <https://github.com/apache/iceberg/pull/6267>, #6090
> <https://github.com/apache/iceberg/pull/6090>, #6091
> <https://github.com/apache/iceberg/pull/6091>
>     b. Figure out how to give accurate stats with row-level deletes and
> how to mask dropped partition values from stats.
>         https://github.com/apache/iceberg/issues/6042
>     c. Standardize the `StatictisFile` interface to hold the parquet/Avro
> stats file (instead of always assuming it as a Puffin file)
>         and introduce a `StatisticsType` enum.
>     d. Conclude the storage format and get approval for the design.
>
> I will wait another week or two for some more people to take a look at the
> document
>
> before jumping into the implementation.
>
> Thanks,
> Ajantha.
>
>
>
> On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
> Hi Ryan,
>
> are you saying that you think the partition-level stats should not be
> required? I think that would be best.
>
> I think there is some confusion here. Partition-level stats are
> required (hence the proposal).
> But does the writer always write it? (with the append/delete/replace
> operation)
> or writer skips writing it and then the user generates it using DML like
> "Analyze table" was the point of discussion.
> I think we can have both options with the writer stats writing controlled
> by a table property "write.stats.enabled"
>
>
>
> I’m all for improving the interface for retrieving stats. It’s a separate
> issue
>
> Agree. Let us discuss it in a separate thread.
>
> Thanks,
> Ajantha
>
>
>
> On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote:
>
> Ajantha, are you saying that you think the partition-level stats should
> not be required? I think that would be best.
>
> I’m all for improving the interface for retrieving stats. It’s a separate
> issue, but I think that Iceberg should provide both access to the Puffin
> files and metadata as well as a higher-level interface for retrieving
> information like a column’s NDV. Something like this:
>
> int ndv = 
> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x");
>
>
>
> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
> Hi Ryan,
> Thanks a lot for the review and suggestions.
>
> but I think there is also a decision that we need to make before that:
> Should Iceberg require writers to maintain the partition stats?
>
> I think I would prefer to take a lazy approach and not assume that writers
> will keep the partition stats up to date,
>
> in which case we need a way to know which parts of a table are newer than
> the most recent stats.
>
>
>
> This is a common problem for existing table-level puffin stats too.  Not
> just for partition stats.
> As mentioned in the "integration with the current code" section point 8),
> I was planning to introduce a table property "write.stats.enabled" with a
> default value set to false.
>
> And as per point 7), I was planning to introduce an "ANALYZE table" or
> "CALL procedure" SQL (maybe table-level API too) to asynchronously
> compute the stats on demand from the previous checkpoints.
>
> But currently, `TableMetadata` doesn't have a clean Interface to provide
> the statistics file for the current snapshot.
> If stats are not present, we need another interface to provide a last
> successful snapshot id for which stats was computed.
> Also, there is some confusion around reusing the statistics file (because
> the spec only has a computed snapshot id, not the referenced snapshot id).
> I am planning to open up a PR to handle these interface updates this week.
> (same things as you suggested in the last Iceberg sync).
> This should serve as a good foundation to get insights for lazy &
> incremental stats computing.
>
>
> Thanks,
> Ajantha
>
>
>
> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote:
>
> Thanks for writing this up, Ajantha! I think that we have all the upstream
> pieces in place to work on this so it's great to have a proposal.
>
>
>
> The proposal does a good job of summarizing the choices for how to store
> the data, but I think there is also a decision that we need to make before
> that: Should Iceberg require writers to maintain the partition stats?
>
>
>
> If we do want writers to participate, then we may want to make choices
> that are easier for writers. But I think that is going to be a challenge.
> Adding requirements for writers would mean that we need to bump the spec
> version. Otherwise, we aren't guaranteed that writers will update the files
> correctly. I think I would prefer to take a lazy approach and not assume
> that writers will keep the partition stats up to date, in which case we
> need a way to know which parts of a table are newer than the most recent
> stats.
>
>
>
> Ryan
>
>
>
> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
> Thanks Piotr for taking a look at it.
> I have replied to all the comments in the document.
> I might need your support in standardising the existing `StatisticsFile`
> interface to adopt partition stats as mentioned in the design.
>
>
>
> *We do need more eyes on the design. Once I get approval for the design, I
> can start the implementation.  *
> Thanks,
> Ajantha
>
>
>
> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
> Hi Ajantha,
>
>
>
> this is very interesting document, thank you for your work on this!
>
> I've added a few comments there.
>
>
>
> I have one high-level design comment so I thought it would be nicer to
> everyone if I re-post it here
>
>
>
> is "partition" the right level of keeping the stats?
> We do this in Hive, but was it an accidental choice? or just the only
> thing that was possible to be implemented many years ago?
>
>
>
> Iceberg allows to have higher number of partitions compared to Hive,
> because it scales better. But that means partition-level may or may not be
> the right granularity.
>
>
> A self-optimizing system would gather stats on "per query unit" basis --
> for example if i partition by [ day x country ], but usually query by day,
> the days are the "query unit" and from stats perspective country can be
> ignored.
> Having more fine-grained partitions may lead to expensive planning time,
> so it's not theoretical problem.
>
>
> I am not saying we should implement all this logic right now, but I think
> we should decouple partitioning scheme from stats partitions, to allow
>  query engine to become smarter.
>
>
>
>
>
> cc @Alexander Jo <alex...@starburstdata.com>
>
>
>
> Best
>
> PF
>
>
>
>
>
>
>
>
>
> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
> Hi Community,
> I did a proposal write-up for the partition stats in Iceberg.
> Please have a look and let me know what you think. I would like to work on
> it.
>
>
> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>
> Requirement background snippet from the above document.
>
> For some query engines that use cost-based-optimizer instead or along with
> rule-based-optimizer (like Dremio, Trino, etc), at the planning time,
> it is good to know the partition level stats like total rows per partition
> and total files per partition to take decisions for CBO (
> like deciding on the join reordering and join type, identifying the
> parallelism).
> Currently, the only way to do this is to read the partition info from 
> data_file
> in manifest_entry of the manifest file and compute partition-level
> statistics (the same thing that ‘partitions’ metadata table is doing *[see
> **Appendix A*
> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
> *]*).
> Doing this on each query is expensive. *Hence, this is a proposal for
> computing and storing partition-level stats for Iceberg tables and using
> them during queries.*
>
>
>
> Thanks,
> Ajantha
>
>
>
>
> --
>
> Ryan Blue
>
> Tabular
>
>
>
>
> --
>
> Ryan Blue
>
> Tabular
>
>

Re: [Proposal] Partition stats in Iceberg

Reply via email to