Hi Iceberg community,

We have been working on converting our tables from the Hive table format to
Iceberg. In order to achieve that switch transparently, we have introduced
a number of Hive table features and compatibility modes in Iceberg, and
connected them to Spark DataSource API. At a high level, given a table name
that exists in both Hive and Iceberg formats (in the same catalog, like
HMS, or in different catalogs behind a meta catalog), this kind of
architecture enables us to change whether the user accesses the Hive or the
Iceberg version of the table just with a property change on the table. We
have used this to incrementally roll out almost all of our Hive tables to
Iceberg readers, in preparation to switching the pointer to read from
physical Iceberg metadata (which we also experimented with). Right now,
those features reside in LinkedIn’s temporary fork of Iceberg (also
open-source [1]). I am wondering if this is something that would be of
interest to the general Iceberg community and if so, we can contribute
those features back. Those features have been in production for a number of
months at LinkedIn. The features include:

   1. *Reading tables with only Hive metadata:* If a table was written by
   Hive (or Spark DataSource V1), we added a mechanism to construct an Iceberg
   in-memory snapshot on the fly from the table directory structure, and let
   Iceberg readers read that snapshot.
   2. *Iceberg schema lower casing:* Before Iceberg, when users read Hive
   tables from Spark, the returned schema is lowercase since Hive stores all
   metadata in lowercase mode. If users move to Iceberg, such readers could
   break once Iceberg returns proper case schema. This feature is to add
   lowercasing for backward compatibility with existing scripts. This feature
   is added as an option and is not enabled by default.
   3. *Hive table proper casing:* conversely, we leverage the Avro schema
   to supplement the lower case Hive schema when reading Hive tables. This is
   useful if someone wants to still get proper cased schemas while still in
   the Hive mode (to be forward-compatible with Iceberg). The same flag used
   in (2) is used here.
   4. *Supporting default value semantics:* Hive tables support declaring a
   default value in the table schema to use on schema evolution. Currently
   Iceberg only supports optional fields, but not fields with default values.
   More details on this, and some exceptions that take place are in this
   Iceberg issue [2]. The feature is also on the Iceberg project roadmap [3].
   5. *Converting union types to structs:* Compute engines have varying
   degrees of supporting union types. Some like Spark do not support it on
   table APIs. Trino did some work to convert union types to structs [4]. We
   have applied the same transformation in Iceberg so those tables can be read
   without breaking (in Spark) or restating (of the table). We have connected
   the conversion to Spark table APIs, and users can now access tables with
   uniontypes from Spark SQL (i.e., both Hive and Iceberg tables). The struct
   schema is forward-compatible with Trino on Iceberg, since the schema used
   is the same as one used by Trino conversion. This means that current Trino
   + Hive users can transparently leverage that change, and it is a net new
   feature in Spark.

What does the community think about those features? It would be great to
get an agreement on features folks think are relevant so we can plan the
work for contributing them back or refactoring them to a separate repo.

[1] https://github.com/linkedin/iceberg
[2] https://github.com/apache/iceberg/issues/2039
[3] https://github.com/apache/iceberg/blob/master/site/docs/roadmap.md
[4] https://github.com/trinodb/trino/pull/3483

Thanks,
Walaa.

Reply via email to