Re: [Proposal] Partition stats in Iceberg

Ryan Blue Fri, 25 Nov 2022 10:42:39 -0800

Ajantha, are you saying that you think the partition-level stats should not
be required? I think that would be best.


I’m all for improving the interface for retrieving stats. It’s a separate
issue, but I think that Iceberg should provide both access to the Puffin
files and metadata as well as a higher-level interface for retrieving
information like a column’s NDV. Something like this:

int ndv = 
table.findStat(Statistics.NDV).limitSnapshotDistance(3).forColumn("x");


On Thu, Nov 24, 2022 at 2:31 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Hi Ryan,
> Thanks a lot for the review and suggestions.
>
> but I think there is also a decision that we need to make before that:
>> Should Iceberg require writers to maintain the partition stats?
>
> I think I would prefer to take a lazy approach and not assume that writers
>> will keep the partition stats up to date,
>
> in which case we need a way to know which parts of a table are newer than
>> the most recent stats.
>
>
> This is a common problem for existing table-level puffin stats too.  Not
> just for partition stats.
> As mentioned in the "integration with the current code" section point 8),
> I was planning to introduce a table property "write.stats.enabled" with a
> default value set to false.
> And as per point 7), I was planning to introduce an "ANALYZE table" or
> "CALL procedure" SQL (maybe table-level API too) to asynchronously
> compute the stats on demand from the previous checkpoints.
>
> But currently, `TableMetadata` doesn't have a clean Interface to provide
> the statistics file for the current snapshot.
> If stats are not present, we need another interface to provide a last
> successful snapshot id for which stats was computed.
> Also, there is some confusion around reusing the statistics file (because
> the spec only has a computed snapshot id, not the referenced snapshot id).
> I am planning to open up a PR to handle these interface updates this week.
> (same things as you suggested in the last Iceberg sync).
> This should serve as a good foundation to get insights for lazy &
> incremental stats computing.
>
> Thanks,
> Ajantha
>
> On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote:
>
>> Thanks for writing this up, Ajantha! I think that we have all the
>> upstream pieces in place to work on this so it's great to have a proposal.
>>
>> The proposal does a good job of summarizing the choices for how to store
>> the data, but I think there is also a decision that we need to make before
>> that: Should Iceberg require writers to maintain the partition stats?
>>
>> If we do want writers to participate, then we may want to make choices
>> that are easier for writers. But I think that is going to be a challenge.
>> Adding requirements for writers would mean that we need to bump the spec
>> version. Otherwise, we aren't guaranteed that writers will update the files
>> correctly. I think I would prefer to take a lazy approach and not assume
>> that writers will keep the partition stats up to date, in which case we
>> need a way to know which parts of a table are newer than the most recent
>> stats.
>>
>> Ryan
>>
>> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> Thanks Piotr for taking a look at it.
>>> I have replied to all the comments in the document.
>>> I might need your support in standardising the existing `StatisticsFile`
>>> interface to adopt partition stats as mentioned in the design.
>>>
>>>
>>>
>>> *We do need more eyes on the design. Once I get approval for the design,
>>> I can start the implementation. *
>>> Thanks,
>>> Ajantha
>>>
>>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi Ajantha,
>>>>
>>>> this is very interesting document, thank you for your work on this!
>>>> I've added a few comments there.
>>>>
>>>> I have one high-level design comment so I thought it would be nicer to
>>>> everyone if I re-post it here
>>>>
>>>> is "partition" the right level of keeping the stats?
>>>>> We do this in Hive, but was it an accidental choice? or just the only
>>>>> thing that was possible to be implemented many years ago?
>>>>
>>>>
>>>>> Iceberg allows to have higher number of partitions compared to Hive,
>>>>> because it scales better. But that means partition-level may or may not be
>>>>> the right granularity.
>>>>
>>>>
>>>>> A self-optimizing system would gather stats on "per query unit" basis
>>>>> -- for example if i partition by [ day x country ], but usually query by
>>>>> day, the days are the "query unit" and from stats perspective country can
>>>>> be ignored.
>>>>> Having more fine-grained partitions may lead to expensive planning
>>>>> time, so it's not theoretical problem.
>>>>
>>>>
>>>>> I am not saying we should implement all this logic right now, but I
>>>>> think we should decouple partitioning scheme from stats partitions, to
>>>>> allow  query engine to become smarter.
>>>>
>>>>
>>>>
>>>> cc @Alexander Jo <alex...@starburstdata.com>
>>>>
>>>> Best
>>>> PF
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Community,
>>>>> I did a proposal write-up for the partition stats in Iceberg.
>>>>> Please have a look and let me know what you think. I would like to
>>>>> work on it.
>>>>>
>>>>>
>>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>>>
>>>>> Requirement background snippet from the above document.
>>>>>
>>>>>> For some query engines that use cost-based-optimizer instead or along
>>>>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning 
>>>>>> time,
>>>>>> it is good to know the partition level stats like total rows per
>>>>>> partition and total files per partition to take decisions for CBO (
>>>>>> like deciding on the join reordering and join type, identifying the
>>>>>> parallelism).
>>>>>> Currently, the only way to do this is to read the partition info from 
>>>>>> data_file
>>>>>> in manifest_entry of the manifest file and compute partition-level
>>>>>> statistics (the same thing that ‘partitions’ metadata table is doing [see
>>>>>> Appendix A
>>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
>>>>>> ]).
>>>>>> Doing this on each query is expensive. Hence, this is a proposal for
>>>>>> computing and storing partition-level stats for Iceberg tables and using
>>>>>> them during queries.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Ajantha
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: [Proposal] Partition stats in Iceberg

Reply via email to