Hi Ryan,
Thanks a lot for the review and suggestions.

but I think there is also a decision that we need to make before that:
> Should Iceberg require writers to maintain the partition stats?

I think I would prefer to take a lazy approach and not assume that writers
> will keep the partition stats up to date,

in which case we need a way to know which parts of a table are newer than
> the most recent stats.


This is a common problem for existing table-level puffin stats too.  Not
just for partition stats.
As mentioned in the "integration with the current code" section point 8),
I was planning to introduce a table property "write.stats.enabled" with a
default value set to false.
And as per point 7), I was planning to introduce an "ANALYZE table" or
"CALL procedure" SQL (maybe table-level API too) to asynchronously
compute the stats on demand from the previous checkpoints.

But currently, `TableMetadata` doesn't have a clean Interface to provide
the statistics file for the current snapshot.
If stats are not present, we need another interface to provide a last
successful snapshot id for which stats was computed.
Also, there is some confusion around reusing the statistics file (because
the spec only has a computed snapshot id, not the referenced snapshot id).
I am planning to open up a PR to handle these interface updates this week.
(same things as you suggested in the last Iceberg sync).
This should serve as a good foundation to get insights for lazy &
incremental stats computing.

Thanks,
Ajantha

On Thu, Nov 24, 2022 at 12:50 AM Ryan Blue <b...@tabular.io> wrote:

> Thanks for writing this up, Ajantha! I think that we have all the upstream
> pieces in place to work on this so it's great to have a proposal.
>
> The proposal does a good job of summarizing the choices for how to store
> the data, but I think there is also a decision that we need to make before
> that: Should Iceberg require writers to maintain the partition stats?
>
> If we do want writers to participate, then we may want to make choices
> that are easier for writers. But I think that is going to be a challenge.
> Adding requirements for writers would mean that we need to bump the spec
> version. Otherwise, we aren't guaranteed that writers will update the files
> correctly. I think I would prefer to take a lazy approach and not assume
> that writers will keep the partition stats up to date, in which case we
> need a way to know which parts of a table are newer than the most recent
> stats.
>
> Ryan
>
> On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
>> Thanks Piotr for taking a look at it.
>> I have replied to all the comments in the document.
>> I might need your support in standardising the existing `StatisticsFile`
>> interface to adopt partition stats as mentioned in the design.
>>
>>
>>
>> *We do need more eyes on the design. Once I get approval for the design,
>> I can start the implementation. *
>> Thanks,
>> Ajantha
>>
>> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi Ajantha,
>>>
>>> this is very interesting document, thank you for your work on this!
>>> I've added a few comments there.
>>>
>>> I have one high-level design comment so I thought it would be nicer to
>>> everyone if I re-post it here
>>>
>>> is "partition" the right level of keeping the stats?
>>>> We do this in Hive, but was it an accidental choice? or just the only
>>>> thing that was possible to be implemented many years ago?
>>>
>>>
>>>> Iceberg allows to have higher number of partitions compared to Hive,
>>>> because it scales better. But that means partition-level may or may not be
>>>> the right granularity.
>>>
>>>
>>>> A self-optimizing system would gather stats on "per query unit" basis
>>>> -- for example if i partition by [ day x country ], but usually query by
>>>> day, the days are the "query unit" and from stats perspective country can
>>>> be ignored.
>>>> Having more fine-grained partitions may lead to expensive planning
>>>> time, so it's not theoretical problem.
>>>
>>>
>>>> I am not saying we should implement all this logic right now, but I
>>>> think we should decouple partitioning scheme from stats partitions, to
>>>> allow  query engine to become smarter.
>>>
>>>
>>>
>>> cc @Alexander Jo <alex...@starburstdata.com>
>>>
>>> Best
>>> PF
>>>
>>>
>>>
>>>
>>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> Hi Community,
>>>> I did a proposal write-up for the partition stats in Iceberg.
>>>> Please have a look and let me know what you think. I would like to work
>>>> on it.
>>>>
>>>>
>>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>>
>>>> Requirement background snippet from the above document.
>>>>
>>>>> For some query engines that use cost-based-optimizer instead or along
>>>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning time,
>>>>> it is good to know the partition level stats like total rows per
>>>>> partition and total files per partition to take decisions for CBO (
>>>>> like deciding on the join reordering and join type, identifying the
>>>>> parallelism).
>>>>> Currently, the only way to do this is to read the partition info from 
>>>>> data_file
>>>>> in manifest_entry of the manifest file and compute partition-level
>>>>> statistics (the same thing that ‘partitions’ metadata table is doing [see
>>>>> Appendix A
>>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
>>>>> ]).
>>>>> Doing this on each query is expensive. Hence, this is a proposal for
>>>>> computing and storing partition-level stats for Iceberg tables and using
>>>>> them during queries.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to