Re: Identifying the schema of an Iceberg snapshot

Wing Yew Poon Mon, 08 Nov 2021 08:54:06 -0800

I am surprised that schema-id is optional for a v2 snapshot.
I believe that the implementation now already writes a schema-id for both
v1 and v2 snapshots. Of course, snapshots written before schema-id was
added do not have it.
I am working on implementing using the appropriate schema when reading a
snapshot in Spark. It is implemented for Spark 2. It is as you understand
it -- get the schema-id for the snapshot, and look up the schema by
schema-id from the schemas. It will be implemented for Spark 3 too, but
there are some technical complications that need to be resolved first. I
also had a fallback -- if the schema-id is null, then we will look through
the history to find the metadata for the snapshot and read the schema from
there. The fallback was removed from my original PR but will be submitted
as a separate change.
The current behavior (and the behavior in Spark 2 before my change) is to
use the.current schema when reading any snapshot.





On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki <[email protected]>
wrote:

> Hi,
>
> I am trying to understand how to identify the schema for an Iceberg
> snapshot.
>
> Looking at the spec, I see the following:
> Snapshots
>
> A snapshot consists of the following fields:
> v1v2FieldDescription
> *required* *required* snapshot-id A unique long ID
> *optional* *optional* parent-snapshot-id The snapshot ID of the
> snapshot’s parent. Omitted for any snapshot with no parent
> *required* sequence-number A monotonically increasing long that tracks
> the order of changes to a table
> *required* *required* timestamp-ms A timestamp when the snapshot was
> created, used for garbage collection and table inspection
> *optional* *required* manifest-list The location of a manifest list for
> this snapshot that tracks manifest files with additional meadata
> *optional* manifests A list of manifest file locations. Must be omitted
> if manifest-list is present
> *optional* *required* summary A string map that summarizes the snapshot
> changes, including operation (see below)
> *optional* *optional* schema-id ID of the table’s current schema when the
> snapshot was createdAlso the table metadata portion of the spec says the
> following:
> v1v2FieldDescription
> *optional* *required* schemas A list of schemas, stored as objects with
> schema-id.
> For a v2 Iceberg table, my understanding is that the reader needs to do
> the following to figure out the schema of a snapshot:
>
>    - Read the schema-id for the snapshot
>    - Use the schemas field from the table metadata and find the schema
>    corresponding to the snapshot's schema-id
>
> Since schema-id is optional in V2 for a given snapshot, is this the
> correct approach? How does this work, if the schema-id field is missing?
>
> For a V1 Iceberg table, how do we determine the schema of a particular
> snapshot?
>
> Thanks
> Vivek
>
>

Re: Identifying the schema of an Iceberg snapshot

Reply via email to