Thanks for writing this up, Ajantha! I think that we have all the upstream
pieces in place to work on this so it's great to have a proposal.

The proposal does a good job of summarizing the choices for how to store
the data, but I think there is also a decision that we need to make before
that: Should Iceberg require writers to maintain the partition stats?

If we do want writers to participate, then we may want to make choices that
are easier for writers. But I think that is going to be a challenge. Adding
requirements for writers would mean that we need to bump the spec version.
Otherwise, we aren't guaranteed that writers will update the files
correctly. I think I would prefer to take a lazy approach and not assume
that writers will keep the partition stats up to date, in which case we
need a way to know which parts of a table are newer than the most recent
stats.

Ryan

On Wed, Nov 23, 2022 at 4:36 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Thanks Piotr for taking a look at it.
> I have replied to all the comments in the document.
> I might need your support in standardising the existing `StatisticsFile`
> interface to adopt partition stats as mentioned in the design.
>
>
>
> *We do need more eyes on the design. Once I get approval for the design, I
> can start the implementation. *
> Thanks,
> Ajantha
>
> On Wed, Nov 23, 2022 at 3:28 PM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi Ajantha,
>>
>> this is very interesting document, thank you for your work on this!
>> I've added a few comments there.
>>
>> I have one high-level design comment so I thought it would be nicer to
>> everyone if I re-post it here
>>
>> is "partition" the right level of keeping the stats?
>>> We do this in Hive, but was it an accidental choice? or just the only
>>> thing that was possible to be implemented many years ago?
>>
>>
>>> Iceberg allows to have higher number of partitions compared to Hive,
>>> because it scales better. But that means partition-level may or may not be
>>> the right granularity.
>>
>>
>>> A self-optimizing system would gather stats on "per query unit" basis --
>>> for example if i partition by [ day x country ], but usually query by day,
>>> the days are the "query unit" and from stats perspective country can be
>>> ignored.
>>> Having more fine-grained partitions may lead to expensive planning time,
>>> so it's not theoretical problem.
>>
>>
>>> I am not saying we should implement all this logic right now, but I
>>> think we should decouple partitioning scheme from stats partitions, to
>>> allow  query engine to become smarter.
>>
>>
>>
>> cc @Alexander Jo <alex...@starburstdata.com>
>>
>> Best
>> PF
>>
>>
>>
>>
>> On Mon, Nov 14, 2022 at 12:47 PM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> Hi Community,
>>> I did a proposal write-up for the partition stats in Iceberg.
>>> Please have a look and let me know what you think. I would like to work
>>> on it.
>>>
>>>
>>> https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
>>>
>>> Requirement background snippet from the above document.
>>>
>>>> For some query engines that use cost-based-optimizer instead or along
>>>> with rule-based-optimizer (like Dremio, Trino, etc), at the planning time,
>>>> it is good to know the partition level stats like total rows per
>>>> partition and total files per partition to take decisions for CBO (
>>>> like deciding on the join reordering and join type, identifying the
>>>> parallelism).
>>>> Currently, the only way to do this is to read the partition info from 
>>>> data_file
>>>> in manifest_entry of the manifest file and compute partition-level
>>>> statistics (the same thing that ‘partitions’ metadata table is doing [see
>>>> Appendix A
>>>> <https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit#heading=h.s8iywtu7x8m6>
>>>> ]).
>>>> Doing this on each query is expensive. Hence, this is a proposal for
>>>> computing and storing partition-level stats for Iceberg tables and using
>>>> them during queries.
>>>
>>>
>>>
>>> Thanks,
>>> Ajantha
>>>
>>

-- 
Ryan Blue
Tabular

Reply via email to