Re: Identifying the schema of an Iceberg snapshot

Wing Yew Poon Mon, 08 Nov 2021 09:38:09 -0800

The fallback logic I mentioned will be in core Iceberg.


On Mon, Nov 8, 2021 at 9:35 AM Wing Yew Poon <wyp...@cloudera.com> wrote:

> There is logic needed in both core Iceberg (in BaseTableScan and
> DataTableScan) and in each engine.
>
>
> On Mon, Nov 8, 2021 at 9:17 AM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> I am surprised that the logic of obtaining the schema for a snapshot is
>> implemented in Spark2 and Spark3. Shouldn't this be part of Iceberg APIs?
>> Basically, the Snapshot object has an API that returns the schema of the
>> snapshot.
>>
>> On Mon, Nov 8, 2021 at 10:24 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
>> wrote:
>>
>>> I am surprised that schema-id is optional for a v2 snapshot.
>>> I believe that the implementation now already writes a schema-id for
>>> both v1 and v2 snapshots. Of course, snapshots written before schema-id was
>>> added do not have it.
>>> I am working on implementing using the appropriate schema when reading a
>>> snapshot in Spark. It is implemented for Spark 2. It is as you understand
>>> it -- get the schema-id for the snapshot, and look up the schema by
>>> schema-id from the schemas. It will be implemented for Spark 3 too, but
>>> there are some technical complications that need to be resolved first. I
>>> also had a fallback -- if the schema-id is null, then we will look through
>>> the history to find the metadata for the snapshot and read the schema from
>>> there. The fallback was removed from my original PR but will be submitted
>>> as a separate change.
>>> The current behavior (and the behavior in Spark 2 before my change) is
>>> to use the.current schema when reading any snapshot.
>>>
>>>
>>>
>>>
>>> On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki <vi...@dremio.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to understand how to identify the schema for an Iceberg
>>>> snapshot.
>>>>
>>>> Looking at the spec, I see the following:
>>>> Snapshots
>>>>
>>>> A snapshot consists of the following fields:
>>>> v1v2FieldDescription
>>>> *required* *required* snapshot-id A unique long ID
>>>> *optional* *optional* parent-snapshot-id The snapshot ID of the
>>>> snapshot’s parent. Omitted for any snapshot with no parent
>>>> *required* sequence-number A monotonically increasing long that tracks
>>>> the order of changes to a table
>>>> *required* *required* timestamp-ms A timestamp when the snapshot was
>>>> created, used for garbage collection and table inspection
>>>> *optional* *required* manifest-list The location of a manifest list
>>>> for this snapshot that tracks manifest files with additional meadata
>>>> *optional* manifests A list of manifest file locations. Must be
>>>> omitted if manifest-list is present
>>>> *optional* *required* summary A string map that summarizes the
>>>> snapshot changes, including operation (see below)
>>>> *optional* *optional* schema-id ID of the table’s current schema when
>>>> the snapshot was createdAlso the table metadata portion of the spec
>>>> says the following:
>>>> v1v2FieldDescription
>>>> *optional* *required* schemas A list of schemas, stored as objects
>>>> with schema-id.
>>>> For a v2 Iceberg table, my understanding is that the reader needs to do
>>>> the following to figure out the schema of a snapshot:
>>>>
>>>>    - Read the schema-id for the snapshot
>>>>    - Use the schemas field from the table metadata and find the schema
>>>>    corresponding to the snapshot's schema-id
>>>>
>>>> Since schema-id is optional in V2 for a given snapshot, is this the
>>>> correct approach? How does this work, if the schema-id field is missing?
>>>>
>>>> For a V1 Iceberg table, how do we determine the schema of a particular
>>>> snapshot?
>>>>
>>>> Thanks
>>>> Vivek
>>>>
>>>>

Re: Identifying the schema of an Iceberg snapshot

Reply via email to