I want to float this back up, I think this is a really good idea for cross engine support. I don't think we have to tie this to any specific Spec version since they are just recommendations so I think we can do this at any time
On Wed, Nov 27, 2024 at 1:31 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > This makes sense to me generally, I've tried a few times to search in the > spec to find a list of possible snapshot summary properties, and was a bit > surprised to not find them there. So I think this would be a nice addition. > > I'm curious if there's any historical reason it's not been included in the > spec. > > Thanks > Szehon > > On Wed, Nov 27, 2024 at 10:55 AM Kevin Liu <kevinjq...@apache.org> wrote: > >> Thanks for driving this Honah! >> >> It's important to have a consistent naming scheme so that we don't need >> to worry about edge cases when using multiple engines, and possibly have to >> deal with migrations. >> >> Also, since users can store arbitrary key/value pairs in the summary >> property, it's good to document the currently used properties to avoid >> collision. >> >> I like the proposal to document all properties in a "snapshot summary" >> table, this will ensure a centralized place to view all possible key/value >> pairs, similar to how FileIO configuration is handled in iceberg-python >> <https://py.iceberg.apache.org/configuration/#s3>. Other >> implementations can use this table as a reference. >> >> > This approach offers flexibility, as new fields can be added through >> documentation updates without requiring specification changes. >> This will save a lot of effort since specification changes require >> greater scrutiny. >> >> > summary details would not be located near the Snapshot section, which >> explains the summary field. >> We can link the table to the Snapshot section. >> >> >> Would love to hear others' thoughts on this. >> >> Best, >> Kevin Liu >> >> On Tue, Nov 26, 2024 at 2:50 PM Honah J. <hon...@apache.org> wrote: >> >>> Hi everyone, >>> >>> I’d like to propose an addition to the table specification to document >>> optional fields in the snapshot summary. >>> >>> Currently, the snapshot summary includes a required operation field and >>> various optional fields. While these optional fields—such as metrics and >>> partition-level summaries—are supported by Java >>> <https://github.com/apache/iceberg/blob/549674b3fc0cdb18d6cad3e2d6320236fba8c562/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L32-L64> >>> and Python >>> <https://github.com/HonahX/iceberg-python/blob/45d611fe351f6f3847bf329aa053d890d810e2b6/pyiceberg/table/snapshots.py#L36-L60> >>> implementations, they are not officially documented. This creates risks of >>> inconsistency as other implementations and engines adopt and interact with >>> these fields. >>> >>> I propose adding a new section to the table specification to document >>> these optional fields, ensuring consistent naming conventions and reducing >>> ambiguity across implementations. While this is the primary proposal, it >>> may also be worth discussing whether documenting these fields separately in >>> Docs/Table would provide additional flexibility for future updates. >>> >>> I’d love to hear your thoughts, suggestions, or concerns about this >>> proposal. >>> >>> Looking forward to the discussion! >>> >>> Links >>> >>> - GitHub tracking issue: >>> https://github.com/apache/iceberg/issues/11659 >>> - Proposal: >>> >>> https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing >>> - PR: https://github.com/apache/iceberg/pull/11660 >>> >>> >>> Best regards, >>> Honah >>> >>