I am surprised that schema-id is optional for a v2 snapshot. I believe that the implementation now already writes a schema-id for both v1 and v2 snapshots. Of course, snapshots written before schema-id was added do not have it. I am working on implementing using the appropriate schema when reading a snapshot in Spark. It is implemented for Spark 2. It is as you understand it -- get the schema-id for the snapshot, and look up the schema by schema-id from the schemas. It will be implemented for Spark 3 too, but there are some technical complications that need to be resolved first. I also had a fallback -- if the schema-id is null, then we will look through the history to find the metadata for the snapshot and read the schema from there. The fallback was removed from my original PR but will be submitted as a separate change. The current behavior (and the behavior in Spark 2 before my change) is to use the.current schema when reading any snapshot.
On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki <vi...@dremio.com> wrote: > Hi, > > I am trying to understand how to identify the schema for an Iceberg > snapshot. > > Looking at the spec, I see the following: > Snapshots > > A snapshot consists of the following fields: > v1v2FieldDescription > *required* *required* snapshot-id A unique long ID > *optional* *optional* parent-snapshot-id The snapshot ID of the > snapshot’s parent. Omitted for any snapshot with no parent > *required* sequence-number A monotonically increasing long that tracks > the order of changes to a table > *required* *required* timestamp-ms A timestamp when the snapshot was > created, used for garbage collection and table inspection > *optional* *required* manifest-list The location of a manifest list for > this snapshot that tracks manifest files with additional meadata > *optional* manifests A list of manifest file locations. Must be omitted > if manifest-list is present > *optional* *required* summary A string map that summarizes the snapshot > changes, including operation (see below) > *optional* *optional* schema-id ID of the table’s current schema when the > snapshot was createdAlso the table metadata portion of the spec says the > following: > v1v2FieldDescription > *optional* *required* schemas A list of schemas, stored as objects with > schema-id. > For a v2 Iceberg table, my understanding is that the reader needs to do > the following to figure out the schema of a snapshot: > > - Read the schema-id for the snapshot > - Use the schemas field from the table metadata and find the schema > corresponding to the snapshot's schema-id > > Since schema-id is optional in V2 for a given snapshot, is this the > correct approach? How does this work, if the schema-id field is missing? > > For a V1 Iceberg table, how do we determine the schema of a particular > snapshot? > > Thanks > Vivek > >