Re: Identifying the schema of an Iceberg snapshot

Wing Yew Poon Mon, 08 Nov 2021 09:36:56 -0800

There is logic needed in both core Iceberg (in BaseTableScan and
DataTableScan) and in each engine.



On Mon, Nov 8, 2021 at 9:17 AM Vivekanand Vellanki <vi...@dremio.com> wrote:

> I am surprised that the logic of obtaining the schema for a snapshot is
> implemented in Spark2 and Spark3. Shouldn't this be part of Iceberg APIs?
> Basically, the Snapshot object has an API that returns the schema of the
> snapshot.
>
> On Mon, Nov 8, 2021 at 10:24 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
> wrote:
>
>> I am surprised that schema-id is optional for a v2 snapshot.
>> I believe that the implementation now already writes a schema-id for both
>> v1 and v2 snapshots. Of course, snapshots written before schema-id was
>> added do not have it.
>> I am working on implementing using the appropriate schema when reading a
>> snapshot in Spark. It is implemented for Spark 2. It is as you understand
>> it -- get the schema-id for the snapshot, and look up the schema by
>> schema-id from the schemas. It will be implemented for Spark 3 too, but
>> there are some technical complications that need to be resolved first. I
>> also had a fallback -- if the schema-id is null, then we will look through
>> the history to find the metadata for the snapshot and read the schema from
>> there. The fallback was removed from my original PR but will be submitted
>> as a separate change.
>> The current behavior (and the behavior in Spark 2 before my change) is to
>> use the.current schema when reading any snapshot.
>>
>>
>>
>>
>> On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to understand how to identify the schema for an Iceberg
>>> snapshot.
>>>
>>> Looking at the spec, I see the following:
>>> Snapshots
>>>
>>> A snapshot consists of the following fields:
>>> v1v2FieldDescription
>>> *required* *required* snapshot-id A unique long ID
>>> *optional* *optional* parent-snapshot-id The snapshot ID of the
>>> snapshot’s parent. Omitted for any snapshot with no parent
>>> *required* sequence-number A monotonically increasing long that tracks
>>> the order of changes to a table
>>> *required* *required* timestamp-ms A timestamp when the snapshot was
>>> created, used for garbage collection and table inspection
>>> *optional* *required* manifest-list The location of a manifest list for
>>> this snapshot that tracks manifest files with additional meadata
>>> *optional* manifests A list of manifest file locations. Must be omitted
>>> if manifest-list is present
>>> *optional* *required* summary A string map that summarizes the snapshot
>>> changes, including operation (see below)
>>> *optional* *optional* schema-id ID of the table’s current schema when
>>> the snapshot was createdAlso the table metadata portion of the spec
>>> says the following:
>>> v1v2FieldDescription
>>> *optional* *required* schemas A list of schemas, stored as objects with
>>> schema-id.
>>> For a v2 Iceberg table, my understanding is that the reader needs to do
>>> the following to figure out the schema of a snapshot:
>>>
>>>    - Read the schema-id for the snapshot
>>>    - Use the schemas field from the table metadata and find the schema
>>>    corresponding to the snapshot's schema-id
>>>
>>> Since schema-id is optional in V2 for a given snapshot, is this the
>>> correct approach? How does this work, if the schema-id field is missing?
>>>
>>> For a V1 Iceberg table, how do we determine the schema of a particular
>>> snapshot?
>>>
>>> Thanks
>>> Vivek
>>>
>>>

Re: Identifying the schema of an Iceberg snapshot

Reply via email to