Hi Iceberg community, We have been working on converting our tables from the Hive table format to Iceberg. In order to achieve that switch transparently, we have introduced a number of Hive table features and compatibility modes in Iceberg, and connected them to Spark DataSource API. At a high level, given a table name that exists in both Hive and Iceberg formats (in the same catalog, like HMS, or in different catalogs behind a meta catalog), this kind of architecture enables us to change whether the user accesses the Hive or the Iceberg version of the table just with a property change on the table. We have used this to incrementally roll out almost all of our Hive tables to Iceberg readers, in preparation to switching the pointer to read from physical Iceberg metadata (which we also experimented with). Right now, those features reside in LinkedIn’s temporary fork of Iceberg (also open-source [1]). I am wondering if this is something that would be of interest to the general Iceberg community and if so, we can contribute those features back. Those features have been in production for a number of months at LinkedIn. The features include:
1. *Reading tables with only Hive metadata:* If a table was written by Hive (or Spark DataSource V1), we added a mechanism to construct an Iceberg in-memory snapshot on the fly from the table directory structure, and let Iceberg readers read that snapshot. 2. *Iceberg schema lower casing:* Before Iceberg, when users read Hive tables from Spark, the returned schema is lowercase since Hive stores all metadata in lowercase mode. If users move to Iceberg, such readers could break once Iceberg returns proper case schema. This feature is to add lowercasing for backward compatibility with existing scripts. This feature is added as an option and is not enabled by default. 3. *Hive table proper casing:* conversely, we leverage the Avro schema to supplement the lower case Hive schema when reading Hive tables. This is useful if someone wants to still get proper cased schemas while still in the Hive mode (to be forward-compatible with Iceberg). The same flag used in (2) is used here. 4. *Supporting default value semantics:* Hive tables support declaring a default value in the table schema to use on schema evolution. Currently Iceberg only supports optional fields, but not fields with default values. More details on this, and some exceptions that take place are in this Iceberg issue [2]. The feature is also on the Iceberg project roadmap [3]. 5. *Converting union types to structs:* Compute engines have varying degrees of supporting union types. Some like Spark do not support it on table APIs. Trino did some work to convert union types to structs [4]. We have applied the same transformation in Iceberg so those tables can be read without breaking (in Spark) or restating (of the table). We have connected the conversion to Spark table APIs, and users can now access tables with uniontypes from Spark SQL (i.e., both Hive and Iceberg tables). The struct schema is forward-compatible with Trino on Iceberg, since the schema used is the same as one used by Trino conversion. This means that current Trino + Hive users can transparently leverage that change, and it is a net new feature in Spark. What does the community think about those features? It would be great to get an agreement on features folks think are relevant so we can plan the work for contributing them back or refactoring them to a separate repo. [1] https://github.com/linkedin/iceberg [2] https://github.com/apache/iceberg/issues/2039 [3] https://github.com/apache/iceberg/blob/master/site/docs/roadmap.md [4] https://github.com/trinodb/trino/pull/3483 Thanks, Walaa.