Re: [Proposal] Partition stats in Iceberg

Ajantha Bhat Mon, 23 Jan 2023 08:56:24 -0800

Hi All,

In the same design document (
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
),
I have added a section called
*"Design for approval". *It also contains a potential PR breakdown for the
phase 1 implementation and future development scope.
Please take a look and please vote if you think the design is ok.


Thanks,
Ajantha



On Mon, Dec 5, 2022 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> A big thanks to everyone who was involved in the review and the
> discussions so far.
>
> Please find the meeting minutes from the last iceberg sync about the
> partition stats.
>     a. Writers should not write the partition stats or any stats as of
> now.
>         Because it requires bumping the spec to V3. (We can have it as
> part of the v3 spec later on. But not anytime soon).
>     b. So, there can be an async way of generating the stats like ANALYZE
> table or call procedure.
>         Which will compute the stats till the current snapshot and store
> it as a partition stats file.
>     c. In phase 1, partition stats will just store the row_count and
> file_count per partition value as mentioned in the design document.
>         Later it can be enhanced to store puffin file location and other
> metrics per partition value.
>     d. These tuples are stored in a single sorted Avro/parquet file (we
> need to finalize this).
>     e. Each time "analyze table" will rewrite the whole stats file as
> keeping multiple delta files will just make the read path messy.
>         Also, even with million rows, it can be of a few MB size.
>         Once the writers start writing the stats (V3 spec), we can revisit
> storing as the delta files if there are any performance issues.
>
> The next immediate plan is to
>     a. Get these PRs merged (open points in existing StatictisFile
> interface added during Puffin)
>         #6267 <https://github.com/apache/iceberg/pull/6267>, #6090
> <https://github.com/apache/iceberg/pull/6090>, #6091
> <https://github.com/apache/iceberg/pull/6091>
>     b. Figure out how to give accurate stats with row-level deletes and
> how to mask dropped partition values from stats.
>         https://github.com/apache/iceberg/issues/6042
>     c. Standardize the `StatictisFile` interface to hold the parquet/Avro
> stats file (instead of always assuming it as a Puffin file)
>         and introduce a `StatisticsType` enum.
>     d. Conclude the storage format and get approval for the design.
>
> I will wait another week or two for some more people to take a look at the
> document
> before jumping into the implementation.
>
> Thanks,
> Ajantha.
>
> On Sat, Nov 26, 2022 at 8:25 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
>> Hi Ryan,
>>
>> are you saying that you think the partition-level stats should not be
>>> required? I think that would be best.
>>
>> I think there is some confusion here. Partition-level stats are
>> required (hence the proposal).
>> But does the writer always write it? (with the append/delete/replace
>> operation)
>> or writer skips writing it and then the user generates it using DML like
>> "Analyze table" was the point of discussion.
>> I think we can have both options with the writer stats writing controlled
>> by a table property "write.stats.enabled"
>>
>> I’m all for improving the interface for retrieving stats. It’s a separate
>>> issue
>>
>> Agree. Let us discuss it in a separate thread.
>>
>> Thanks,
>> Ajantha
>>
>> On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Ajantha, are you saying that you think the partition-level stats should
>>> not be required? I think that would be best.
>>>
>>> I’m all for improving the interface for retrieving stats. It’s a
>>> separate issue, but I think that Iceberg should provide both access to the
>>> Puffin files and metadata as well as a higher-level interface for
>>> retrieving information like a column’s NDV. Something like this:
>>>
>>> int ndv = 
>>> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x");
>>>
>>>
>>> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ryan,
>>>> Thanks a lot for the review and suggestions.
>>>>
>>>> but I think there is also a decision that we need to make before that:
>>>>> Should Iceberg require writers to maintain the partition stats?
>>>>
>>>> I think I would prefer to take a lazy approach and not assume that
>>>>> writers will keep the partition stats up to date,
>>>>
>>>> in which case we need a way to know which parts of a table are newer
>>>>> than the most recent stats.
>>>>
>>>>
>>>> This is a common problem for existing table-level puffin stats too.
>>>> Not just for partition stats.
>>>> As mentioned in the "integration with the current code" section point
>>>> 8),
>>>> I was planning to introduce a table property "write.stats.enabled" with
>>>> a default value set to false.
>>>> And as per point 7), I was planning to introduce an "ANALYZE table" or
>>>> "CALL procedure" SQL (maybe table-level API too) to asynchronously
>>>> compute the stats on demand from the previous checkpoints.
>>>>
>>>> But currently, `TableMetadata` doesn't have a clean Interface to
>>>> provide the statistics file for the current snapshot.
>>>> If stats are not present, we need another interface to provide a last
>>>> successful snapshot id for which stats was computed.
>>>> Also, there is some confusion around reusing the statistics file
>>>> (because the spec only has a computed snapshot id, not the referenced
>>>> snapshot id).
>>>> I am planning to open up a PR to handle these interface updates
>>>> this week. (same things as you suggested in the last Iceberg sync).
>>>> This should serve as a good foundation to get insights for lazy &
>>>> incremental stats computing.
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Thanks for writing this up, Ajantha! I think that we have all the
>>>>> upstream pieces in place to work on this so it's great to have a proposal.
>>>>>
>>>>> The proposal does a good job of summarizing the choices for how to
>>>>> store the data, but I think there is also a decision that we need to make
>>>>> before that: Should Iceberg require writers to maintain the partition 
>>>>> stats?
>>>>>
>>>>> If we do want writers to participate, then we may want to make choices
>>>>> that are easier for writers. But I think that is going to be a challenge.
>>>>> Adding requirements for writers would mean that we need to bump the spec
>>>>> version. Otherwise, we aren't guaranteed that writers will update the 
>>>>> files
>>>>> correctly. I think I would prefer to take a lazy approach and not assume
>>>>> that writers will keep the partition stats up to date, in which case we
>>>>> need a way to know which parts of a table are newer than the most recent
>>>>> stats.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Piotr for taking a look at it.
>>>>>> I have replied to all the comments in the document.
>>>>>> I might need your support in standardising the existing
>>>>>> `StatisticsFile` interface to adopt partition stats as mentioned in the
>>>>>> design.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *We do need more eyes on the design. Once I get approval for the
>>>>>> design, I can start the implementation. *
>>>>>> Thanks,
>>>>>> Ajantha
>>>>>>
>>>>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <
>>>>>> pi...@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi Ajantha,
>>>>>>>
>>>>>>> this is very interesting document, thank you for your work on this!
>>>>>>> I've added a few comments there.
>>>>>>>
>>>>>>> I have one high-level design comment so I thought it would be nicer
>>>>>>> to everyone if I re-post it here
>>>>>>>
>>>>>>> is "partition" the right level of keeping the stats?
>>>>>>>> We do this in Hive, but was it an accidental choice? or just the
>>>>>>>> only thing that was possible to be implemented many years ago?
>>>>>>>
>>>>>>>
>>>>>>>> Iceberg allows to have higher number of partitions compared to
>>>>>>>> Hive, because it scales better. But that means partition-level may or 
>>>>>>>> may
>>>>>>>> not be the right granularity.
>>>>>>>
>>>>>>>
>>>>>>>> A self-optimizing system would gather stats on "per query unit"
>>>>>>>> basis -- for example if i partition by [ day x country ], but usually 
>>>>>>>> query
>>>>>>>> by day, the days are the "query unit" and from stats perspective 
>>>>>>>> country
>>>>>>>> can be ignored.
>>>>>>>> Having more fine-grained partitions may lead to expensive planning
>>>>>>>> time, so it's not theoretical problem.
>>>>>>>
>>>>>>>
>>>>>>>> I am not saying we should implement all this logic right now, but I
>>>>>>>> think we should decouple partitioning scheme from stats partitions, to
>>>>>>>> allow  query engine to become smarter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> cc @Alexander Jo <alex...@starburstdata.com>
>>>>>>>
>>>>>>> Best
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Community,
>>>>>>>> I did a proposal write-up for the partition stats in Iceberg.
>>>>>>>> Please have a look and let me know what you think. I would like to
>>>>>>>> work on it.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>>>>>>
>>>>>>>> Requirement background snippet from the above document.
>>>>>>>>
>>>>>>>>> For some query engines that use cost-based-optimizer instead or
>>>>>>>>> along with rule-based-optimizer (like Dremio, Trino, etc), at the 
>>>>>>>>> planning
>>>>>>>>> time,
>>>>>>>>> it is good to know the partition level stats like total rows per
>>>>>>>>> partition and total files per partition to take decisions for CBO (
>>>>>>>>> like deciding on the join reordering and join type, identifying
>>>>>>>>> the parallelism).
>>>>>>>>> Currently, the only way to do this is to read the partition info
>>>>>>>>> from data_file in manifest_entry of the manifest file and compute
>>>>>>>>> partition-level statistics (the same thing that ‘partitions’ metadata 
>>>>>>>>> table
>>>>>>>>> is doing [see Appendix A
>>>>>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
>>>>>>>>> ]).
>>>>>>>>> Doing this on each query is expensive. Hence, this is a proposal
>>>>>>>>> for computing and storing partition-level stats for Iceberg tables and
>>>>>>>>> using them during queries.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ajantha
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: [Proposal] Partition stats in Iceberg

Reply via email to