The fallback logic I mentioned will be in core Iceberg.
On Mon, Nov 8, 2021 at 9:35 AM Wing Yew Poon <wyp...@cloudera.com> wrote: > There is logic needed in both core Iceberg (in BaseTableScan and > DataTableScan) and in each engine. > > > On Mon, Nov 8, 2021 at 9:17 AM Vivekanand Vellanki <vi...@dremio.com> > wrote: > >> I am surprised that the logic of obtaining the schema for a snapshot is >> implemented in Spark2 and Spark3. Shouldn't this be part of Iceberg APIs? >> Basically, the Snapshot object has an API that returns the schema of the >> snapshot. >> >> On Mon, Nov 8, 2021 at 10:24 PM Wing Yew Poon <wyp...@cloudera.com.invalid> >> wrote: >> >>> I am surprised that schema-id is optional for a v2 snapshot. >>> I believe that the implementation now already writes a schema-id for >>> both v1 and v2 snapshots. Of course, snapshots written before schema-id was >>> added do not have it. >>> I am working on implementing using the appropriate schema when reading a >>> snapshot in Spark. It is implemented for Spark 2. It is as you understand >>> it -- get the schema-id for the snapshot, and look up the schema by >>> schema-id from the schemas. It will be implemented for Spark 3 too, but >>> there are some technical complications that need to be resolved first. I >>> also had a fallback -- if the schema-id is null, then we will look through >>> the history to find the metadata for the snapshot and read the schema from >>> there. The fallback was removed from my original PR but will be submitted >>> as a separate change. >>> The current behavior (and the behavior in Spark 2 before my change) is >>> to use the.current schema when reading any snapshot. >>> >>> >>> >>> >>> On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki <vi...@dremio.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am trying to understand how to identify the schema for an Iceberg >>>> snapshot. >>>> >>>> Looking at the spec, I see the following: >>>> Snapshots >>>> >>>> A snapshot consists of the following fields: >>>> v1v2FieldDescription >>>> *required* *required* snapshot-id A unique long ID >>>> *optional* *optional* parent-snapshot-id The snapshot ID of the >>>> snapshot’s parent. Omitted for any snapshot with no parent >>>> *required* sequence-number A monotonically increasing long that tracks >>>> the order of changes to a table >>>> *required* *required* timestamp-ms A timestamp when the snapshot was >>>> created, used for garbage collection and table inspection >>>> *optional* *required* manifest-list The location of a manifest list >>>> for this snapshot that tracks manifest files with additional meadata >>>> *optional* manifests A list of manifest file locations. Must be >>>> omitted if manifest-list is present >>>> *optional* *required* summary A string map that summarizes the >>>> snapshot changes, including operation (see below) >>>> *optional* *optional* schema-id ID of the table’s current schema when >>>> the snapshot was createdAlso the table metadata portion of the spec >>>> says the following: >>>> v1v2FieldDescription >>>> *optional* *required* schemas A list of schemas, stored as objects >>>> with schema-id. >>>> For a v2 Iceberg table, my understanding is that the reader needs to do >>>> the following to figure out the schema of a snapshot: >>>> >>>> - Read the schema-id for the snapshot >>>> - Use the schemas field from the table metadata and find the schema >>>> corresponding to the snapshot's schema-id >>>> >>>> Since schema-id is optional in V2 for a given snapshot, is this the >>>> correct approach? How does this work, if the schema-id field is missing? >>>> >>>> For a V1 Iceberg table, how do we determine the schema of a particular >>>> snapshot? >>>> >>>> Thanks >>>> Vivek >>>> >>>>