Thanks Ryan and Owen! Glad we have converged on this. Next steps for us: * Continuing the discussion on the default value PR (already ongoing [1]). * Filing the union type conversion PR (ETA end of next week). * Moving listing-based Hive table scan using Iceberg to a separate repo (likely open source). For this I expect introducing some extension points to Iceberg such as making some classes SPI. I hope that the community is okay with that.
By the way, Owen and I synced on the Hive casing behavior, and it is a bit more involved: Hive lowers the schema case for all fields (including nested fields) in the Avro case, but only lowers top-level field case and preserves inner field case for other formats (we experimented with ORC and Text). Hope this clarifies the confusion. [1] https://github.com/apache/iceberg/pull/2496 Thanks, Walaa. On Wed, Feb 2, 2022 at 2:40 PM Ryan Blue <b...@tabular.io> wrote: > Walaa, thanks for this list. I think most of these are definitely useful. > I think the best one to focus on first is the default values, since those > will make Iceberg tables behave more like standard SQL tables, which is the > goal. > > I'm really curious to learn more about #1, but I don't think that I have > enough detail to know whether it is something that fits in the Iceberg > project. At Netflix, we had an alternative implementation of Hive and Spark > tables (Spark tables are slightly different) that we similarly used. But we > didn't write to both at the same time. > > For the others, I'm interested in hearing what other people in the > community find valuable. I don't think I would use #2 or #3, for example. > That's because we already support a flag for case insensitive column > resolution that is well supported throughout Iceberg. If you wanted to use > alternative names, then I'd probably recommend just turning that on... > although that may not be an option depending on how you're working with a > table. It would work in Spark, though. This may be a better feature for > your system that is built on Iceberg. > > Reading unions as structs has come up a couple times so that seems like > people will want it. I think someone attempted to add this support in the > past, but ran into issues because the spec is clear that these are NOT > Iceberg files. There is no guarantee that other implementations will read > them and Iceberg cannot write them in this form. I'm fairly confident that > not allowing unions to be written is a good choice, but I would support > being able to read them. > > Ryan > > On Mon, Jan 31, 2022 at 4:32 PM Owen O'Malley <owen.omal...@gmail.com> > wrote: > >> >> >> On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> *2. Iceberg schema lower casing:* Before Iceberg, when users read Hive >>> tables from Spark, the returned schema is lowercase since Hive stores all >>> metadata in lowercase mode. If users move to Iceberg, such readers could >>> break once Iceberg returns proper case schema. This feature is to add >>> lowercasing for backward compatibility with existing scripts. This feature >>> is added as an option and is not enabled by default. >>> >> >> This isn't quite correct. Hive lowercases top-level columns. It does not >> lowercase field names inside structs. >> >> >>> *3. Hive table proper casing:* conversely, we leverage the Avro schema >>> to supplement the lower case Hive schema when reading Hive tables. This is >>> useful if someone wants to still get proper cased schemas while still in >>> the Hive mode (to be forward-compatible with Iceberg). The same flag used >>> in (2) is used here. >>> >> >> Are there users of Avro schemas in Hive outside of LinkedIn? I've never >> seen it used. I don't think you should tie #2 and #3 together. >> >> Supporting default values and union types are useful extensions. >> >> .. Owen >> > > > -- > Ryan Blue > Tabular >