I am surprised that the logic of obtaining the schema for a snapshot is
implemented in Spark2 and Spark3. Shouldn't this be part of Iceberg APIs?
Basically, the Snapshot object has an API that returns the schema of the
snapshot.

On Mon, Nov 8, 2021 at 10:24 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
wrote:

> I am surprised that schema-id is optional for a v2 snapshot.
> I believe that the implementation now already writes a schema-id for both
> v1 and v2 snapshots. Of course, snapshots written before schema-id was
> added do not have it.
> I am working on implementing using the appropriate schema when reading a
> snapshot in Spark. It is implemented for Spark 2. It is as you understand
> it -- get the schema-id for the snapshot, and look up the schema by
> schema-id from the schemas. It will be implemented for Spark 3 too, but
> there are some technical complications that need to be resolved first. I
> also had a fallback -- if the schema-id is null, then we will look through
> the history to find the metadata for the snapshot and read the schema from
> there. The fallback was removed from my original PR but will be submitted
> as a separate change.
> The current behavior (and the behavior in Spark 2 before my change) is to
> use the.current schema when reading any snapshot.
>
>
>
>
> On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Hi,
>>
>> I am trying to understand how to identify the schema for an Iceberg
>> snapshot.
>>
>> Looking at the spec, I see the following:
>> Snapshots
>>
>> A snapshot consists of the following fields:
>> v1v2FieldDescription
>> *required* *required* snapshot-id A unique long ID
>> *optional* *optional* parent-snapshot-id The snapshot ID of the
>> snapshot’s parent. Omitted for any snapshot with no parent
>> *required* sequence-number A monotonically increasing long that tracks
>> the order of changes to a table
>> *required* *required* timestamp-ms A timestamp when the snapshot was
>> created, used for garbage collection and table inspection
>> *optional* *required* manifest-list The location of a manifest list for
>> this snapshot that tracks manifest files with additional meadata
>> *optional* manifests A list of manifest file locations. Must be omitted
>> if manifest-list is present
>> *optional* *required* summary A string map that summarizes the snapshot
>> changes, including operation (see below)
>> *optional* *optional* schema-id ID of the table’s current schema when
>> the snapshot was createdAlso the table metadata portion of the spec says
>> the following:
>> v1v2FieldDescription
>> *optional* *required* schemas A list of schemas, stored as objects with
>> schema-id.
>> For a v2 Iceberg table, my understanding is that the reader needs to do
>> the following to figure out the schema of a snapshot:
>>
>>    - Read the schema-id for the snapshot
>>    - Use the schemas field from the table metadata and find the schema
>>    corresponding to the snapshot's schema-id
>>
>> Since schema-id is optional in V2 for a given snapshot, is this the
>> correct approach? How does this work, if the schema-id field is missing?
>>
>> For a V1 Iceberg table, how do we determine the schema of a particular
>> snapshot?
>>
>> Thanks
>> Vivek
>>
>>

Reply via email to