Re: [Proposal] Partition stats in Iceberg

Ajantha Bhat Fri, 25 Nov 2022 18:55:39 -0800

Hi Ryan,

are you saying that you think the partition-level stats should not be
> required? I think that would be best.


I think there is some confusion here. Partition-level stats are
required (hence the proposal).
But does the writer always write it? (with the append/delete/replace
operation)
or writer skips writing it and then the user generates it using DML like
"Analyze table" was the point of discussion.
I think we can have both options with the writer stats writing controlled
by a table property "write.stats.enabled"

I’m all for improving the interface for retrieving stats. It’s a separate
> issue

Agree. Let us discuss it in a separate thread.

Thanks,
Ajantha

On Sat, Nov 26, 2022 at 12:12 AM Ryan Blue <b...@tabular.io> wrote:

> Ajantha, are you saying that you think the partition-level stats should
> not be required? I think that would be best.
>
> I’m all for improving the interface for retrieving stats. It’s a separate
> issue, but I think that Iceberg should provide both access to the Puffin
> files and metadata as well as a higher-level interface for retrieving
> information like a column’s NDV. Something like this:
>
> int ndv = 
> table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x");
>
>
> On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
>> Hi Ryan,
>> Thanks a lot for the review and suggestions.
>>
>> but I think there is also a decision that we need to make before that:
>>> Should Iceberg require writers to maintain the partition stats?
>>
>> I think I would prefer to take a lazy approach and not assume that
>>> writers will keep the partition stats up to date,
>>
>> in which case we need a way to know which parts of a table are newer than
>>> the most recent stats.
>>
>>
>> This is a common problem for existing table-level puffin stats too.  Not
>> just for partition stats.
>> As mentioned in the "integration with the current code" section point 8),
>> I was planning to introduce a table property "write.stats.enabled" with a
>> default value set to false.
>> And as per point 7), I was planning to introduce an "ANALYZE table" or
>> "CALL procedure" SQL (maybe table-level API too) to asynchronously
>> compute the stats on demand from the previous checkpoints.
>>
>> But currently, `TableMetadata` doesn't have a clean Interface to provide
>> the statistics file for the current snapshot.
>> If stats are not present, we need another interface to provide a last
>> successful snapshot id for which stats was computed.
>> Also, there is some confusion around reusing the statistics file (because
>> the spec only has a computed snapshot id, not the referenced snapshot id).
>> I am planning to open up a PR to handle these interface updates
>> this week. (same things as you suggested in the last Iceberg sync).
>> This should serve as a good foundation to get insights for lazy &
>> incremental stats computing.
>>
>> Thanks,
>> Ajantha
>>
>> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Thanks for writing this up, Ajantha! I think that we have all the
>>> upstream pieces in place to work on this so it's great to have a proposal.
>>>
>>> The proposal does a good job of summarizing the choices for how to store
>>> the data, but I think there is also a decision that we need to make before
>>> that: Should Iceberg require writers to maintain the partition stats?
>>>
>>> If we do want writers to participate, then we may want to make choices
>>> that are easier for writers. But I think that is going to be a challenge.
>>> Adding requirements for writers would mean that we need to bump the spec
>>> version. Otherwise, we aren't guaranteed that writers will update the files
>>> correctly. I think I would prefer to take a lazy approach and not assume
>>> that writers will keep the partition stats up to date, in which case we
>>> need a way to know which parts of a table are newer than the most recent
>>> stats.
>>>
>>> Ryan
>>>
>>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Piotr for taking a look at it.
>>>> I have replied to all the comments in the document.
>>>> I might need your support in standardising the existing
>>>> `StatisticsFile` interface to adopt partition stats as mentioned in the
>>>> design.
>>>>
>>>>
>>>>
>>>> *We do need more eyes on the design. Once I get approval for the
>>>> design, I can start the implementation. *
>>>> Thanks,
>>>> Ajantha
>>>>
>>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <
>>>> pi...@starburstdata.com> wrote:
>>>>
>>>>> Hi Ajantha,
>>>>>
>>>>> this is very interesting document, thank you for your work on this!
>>>>> I've added a few comments there.
>>>>>
>>>>> I have one high-level design comment so I thought it would be nicer to
>>>>> everyone if I re-post it here
>>>>>
>>>>> is "partition" the right level of keeping the stats?
>>>>>> We do this in Hive, but was it an accidental choice? or just the only
>>>>>> thing that was possible to be implemented many years ago?
>>>>>
>>>>>
>>>>>> Iceberg allows to have higher number of partitions compared to Hive,
>>>>>> because it scales better. But that means partition-level may or may not 
>>>>>> be
>>>>>> the right granularity.
>>>>>
>>>>>
>>>>>> A self-optimizing system would gather stats on "per query unit" basis
>>>>>> -- for example if i partition by [ day x country ], but usually query by
>>>>>> day, the days are the "query unit" and from stats perspective country can
>>>>>> be ignored.
>>>>>> Having more fine-grained partitions may lead to expensive planning
>>>>>> time, so it's not theoretical problem.
>>>>>
>>>>>
>>>>>> I am not saying we should implement all this logic right now, but I
>>>>>> think we should decouple partitioning scheme from stats partitions, to
>>>>>> allow  query engine to become smarter.
>>>>>
>>>>>
>>>>>
>>>>> cc @Alexander Jo <alex...@starburstdata.com>
>>>>>
>>>>> Best
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Community,
>>>>>> I did a proposal write-up for the partition stats in Iceberg.
>>>>>> Please have a look and let me know what you think. I would like to
>>>>>> work on it.
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>>>>
>>>>>> Requirement background snippet from the above document.
>>>>>>
>>>>>>> For some query engines that use cost-based-optimizer instead or
>>>>>>> along with rule-based-optimizer (like Dremio, Trino, etc), at the 
>>>>>>> planning
>>>>>>> time,
>>>>>>> it is good to know the partition level stats like total rows per
>>>>>>> partition and total files per partition to take decisions for CBO (
>>>>>>> like deciding on the join reordering and join type, identifying the
>>>>>>> parallelism).
>>>>>>> Currently, the only way to do this is to read the partition info
>>>>>>> from data_file in manifest_entry of the manifest file and compute
>>>>>>> partition-level statistics (the same thing that ‘partitions’ metadata 
>>>>>>> table
>>>>>>> is doing [see Appendix A
>>>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
>>>>>>> ]).
>>>>>>> Doing this on each query is expensive. Hence, this is a proposal
>>>>>>> for computing and storing partition-level stats for Iceberg tables and
>>>>>>> using them during queries.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ajantha
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [Proposal] Partition stats in Iceberg

Reply via email to