I want to float this back up, I think this is a really good idea for cross
engine support. I don't think we have to tie this to any specific Spec
version since they are just recommendations so I think we can do this at
any time

On Wed, Nov 27, 2024 at 1:31 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> This makes sense to me generally, I've tried a few times to search in the
> spec to find a list of possible snapshot summary properties, and was a bit
> surprised to not find them there.  So I think this would be a nice addition.
>
> I'm curious if there's any historical reason it's not been included in the
> spec.
>
> Thanks
> Szehon
>
> On Wed, Nov 27, 2024 at 10:55 AM Kevin Liu <kevinjq...@apache.org> wrote:
>
>> Thanks for driving this Honah!
>>
>> It's important to have a consistent naming scheme so that we don't need
>> to worry about edge cases when using multiple engines, and possibly have to
>> deal with migrations.
>>
>> Also, since users can store arbitrary key/value pairs in the summary
>> property, it's good to document the currently used properties to avoid
>> collision.
>>
>> I like the proposal to document all properties in a "snapshot summary"
>> table, this will ensure a centralized place to view all possible key/value
>> pairs, similar to how FileIO configuration is handled in iceberg-python
>> <https://py.iceberg.apache.org/configuration/#s3>. Other
>> implementations can use this table as a reference.
>>
>>  > This approach offers flexibility, as new fields can be added through
>> documentation updates without requiring specification changes.
>> This will save a lot of effort since specification changes require
>> greater scrutiny.
>>
>> > summary details would not be located near the Snapshot section, which
>> explains the summary field.
>> We can link the table to the Snapshot section.
>>
>>
>> Would love to hear others' thoughts on this.
>>
>> Best,
>> Kevin Liu
>>
>> On Tue, Nov 26, 2024 at 2:50 PM Honah J. <hon...@apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> I’d like to propose an addition to the table specification to document
>>> optional fields in the snapshot summary.
>>>
>>> Currently, the snapshot summary includes a required operation field and
>>> various optional fields. While these optional fields—such as metrics and
>>> partition-level summaries—are supported by Java
>>> <https://github.com/apache/iceberg/blob/549674b3fc0cdb18d6cad3e2d6320236fba8c562/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L32-L64>
>>> and Python
>>> <https://github.com/HonahX/iceberg-python/blob/45d611fe351f6f3847bf329aa053d890d810e2b6/pyiceberg/table/snapshots.py#L36-L60>
>>> implementations, they are not officially documented. This creates risks of
>>> inconsistency as other implementations and engines adopt and interact with
>>> these fields.
>>>
>>> I propose adding a new section to the table specification to document
>>> these optional fields, ensuring consistent naming conventions and reducing
>>> ambiguity across implementations. While this is the primary proposal, it
>>> may also be worth discussing whether documenting these fields separately in
>>> Docs/Table would provide additional flexibility for future updates.
>>>
>>> I’d love to hear your thoughts, suggestions, or concerns about this
>>> proposal.
>>>
>>> Looking forward to the discussion!
>>>
>>> Links
>>>
>>>    - GitHub tracking issue:
>>>    https://github.com/apache/iceberg/issues/11659
>>>    - Proposal:
>>>    
>>> https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
>>>    - PR: https://github.com/apache/iceberg/pull/11660
>>>
>>>
>>> Best regards,
>>> Honah
>>>
>>

Reply via email to