Hi everyone, bumping up this thread. Does it help to discuss each feature individually?
I am guessing (4. Default value support) is straightforward since it is on the roadmap already. If not, let us start with this. As for the next step, it would be great if we can focus on (1. Reading tables with only Hive metadata) and (5. Supporting files that contain uniontypes). It is worth noting that (5) does not require Hive table support, i.e., (1), since it can occur if the files with uniontype values are reused in Iceberg manifests. On the other hand, we have already started moving (1) to another repo, but we could save some steps if folks here think it is worth moving to Iceberg. Thanks, Walaa. On Thu, Jan 27, 2022 at 2:26 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Iceberg community, > > We have been working on converting our tables from the Hive table format > to Iceberg. In order to achieve that switch transparently, we have > introduced a number of Hive table features and compatibility modes in > Iceberg, and connected them to Spark DataSource API. At a high level, given > a table name that exists in both Hive and Iceberg formats (in the same > catalog, like HMS, or in different catalogs behind a meta catalog), this > kind of architecture enables us to change whether the user accesses the > Hive or the Iceberg version of the table just with a property change on the > table. We have used this to incrementally roll out almost all of our Hive > tables to Iceberg readers, in preparation to switching the pointer to read > from physical Iceberg metadata (which we also experimented with). Right > now, those features reside in LinkedIn’s temporary fork of Iceberg (also > open-source [1]). I am wondering if this is something that would be of > interest to the general Iceberg community and if so, we can contribute > those features back. Those features have been in production for a number of > months at LinkedIn. The features include: > > 1. *Reading tables with only Hive metadata:* If a table was written by > Hive (or Spark DataSource V1), we added a mechanism to construct an Iceberg > in-memory snapshot on the fly from the table directory structure, and let > Iceberg readers read that snapshot. > 2. *Iceberg schema lower casing:* Before Iceberg, when users read Hive > tables from Spark, the returned schema is lowercase since Hive stores all > metadata in lowercase mode. If users move to Iceberg, such readers could > break once Iceberg returns proper case schema. This feature is to add > lowercasing for backward compatibility with existing scripts. This feature > is added as an option and is not enabled by default. > 3. *Hive table proper casing:* conversely, we leverage the Avro schema > to supplement the lower case Hive schema when reading Hive tables. This is > useful if someone wants to still get proper cased schemas while still in > the Hive mode (to be forward-compatible with Iceberg). The same flag used > in (2) is used here. > 4. *Supporting default value semantics:* Hive tables support declaring > a default value in the table schema to use on schema evolution. Currently > Iceberg only supports optional fields, but not fields with default values. > More details on this, and some exceptions that take place are in this > Iceberg issue [2]. The feature is also on the Iceberg project roadmap [3]. > 5. *Converting union types to structs:* Compute engines have varying > degrees of supporting union types. Some like Spark do not support it on > table APIs. Trino did some work to convert union types to structs [4]. We > have applied the same transformation in Iceberg so those tables can be read > without breaking (in Spark) or restating (of the table). We have connected > the conversion to Spark table APIs, and users can now access tables with > uniontypes from Spark SQL (i.e., both Hive and Iceberg tables). The struct > schema is forward-compatible with Trino on Iceberg, since the schema used > is the same as one used by Trino conversion. This means that current Trino > + Hive users can transparently leverage that change, and it is a net new > feature in Spark. > > What does the community think about those features? It would be great to > get an agreement on features folks think are relevant so we can plan the > work for contributing them back or refactoring them to a separate repo. > > [1] https://github.com/linkedin/iceberg > [2] https://github.com/apache/iceberg/issues/2039 > [3] https://github.com/apache/iceberg/blob/master/site/docs/roadmap.md > [4] https://github.com/trinodb/trino/pull/3483 > > Thanks, > Walaa. > >