There is logic needed in both core Iceberg (in BaseTableScan and DataTableScan) and in each engine.
On Mon, Nov 8, 2021 at 9:17 AM Vivekanand Vellanki <vi...@dremio.com> wrote: > I am surprised that the logic of obtaining the schema for a snapshot is > implemented in Spark2 and Spark3. Shouldn't this be part of Iceberg APIs? > Basically, the Snapshot object has an API that returns the schema of the > snapshot. > > On Mon, Nov 8, 2021 at 10:24 PM Wing Yew Poon <wyp...@cloudera.com.invalid> > wrote: > >> I am surprised that schema-id is optional for a v2 snapshot. >> I believe that the implementation now already writes a schema-id for both >> v1 and v2 snapshots. Of course, snapshots written before schema-id was >> added do not have it. >> I am working on implementing using the appropriate schema when reading a >> snapshot in Spark. It is implemented for Spark 2. It is as you understand >> it -- get the schema-id for the snapshot, and look up the schema by >> schema-id from the schemas. It will be implemented for Spark 3 too, but >> there are some technical complications that need to be resolved first. I >> also had a fallback -- if the schema-id is null, then we will look through >> the history to find the metadata for the snapshot and read the schema from >> there. The fallback was removed from my original PR but will be submitted >> as a separate change. >> The current behavior (and the behavior in Spark 2 before my change) is to >> use the.current schema when reading any snapshot. >> >> >> >> >> On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki <vi...@dremio.com> >> wrote: >> >>> Hi, >>> >>> I am trying to understand how to identify the schema for an Iceberg >>> snapshot. >>> >>> Looking at the spec, I see the following: >>> Snapshots >>> >>> A snapshot consists of the following fields: >>> v1v2FieldDescription >>> *required* *required* snapshot-id A unique long ID >>> *optional* *optional* parent-snapshot-id The snapshot ID of the >>> snapshot’s parent. Omitted for any snapshot with no parent >>> *required* sequence-number A monotonically increasing long that tracks >>> the order of changes to a table >>> *required* *required* timestamp-ms A timestamp when the snapshot was >>> created, used for garbage collection and table inspection >>> *optional* *required* manifest-list The location of a manifest list for >>> this snapshot that tracks manifest files with additional meadata >>> *optional* manifests A list of manifest file locations. Must be omitted >>> if manifest-list is present >>> *optional* *required* summary A string map that summarizes the snapshot >>> changes, including operation (see below) >>> *optional* *optional* schema-id ID of the table’s current schema when >>> the snapshot was createdAlso the table metadata portion of the spec >>> says the following: >>> v1v2FieldDescription >>> *optional* *required* schemas A list of schemas, stored as objects with >>> schema-id. >>> For a v2 Iceberg table, my understanding is that the reader needs to do >>> the following to figure out the schema of a snapshot: >>> >>> - Read the schema-id for the snapshot >>> - Use the schemas field from the table metadata and find the schema >>> corresponding to the snapshot's schema-id >>> >>> Since schema-id is optional in V2 for a given snapshot, is this the >>> correct approach? How does this work, if the schema-id field is missing? >>> >>> For a V1 Iceberg table, how do we determine the schema of a particular >>> snapshot? >>> >>> Thanks >>> Vivek >>> >>>