The union type conversion PR is up: https://github.com/apache/iceberg/pull/4242.
Thanks, Walaa. On Fri, Feb 11, 2022 at 8:53 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Thanks Ryan! Yes there is an active discussion on the PR on the spec > aspect. > > On Fri, Feb 11, 2022 at 8:47 AM Ryan Blue <b...@tabular.io> wrote: > >> Sounds great. Thanks for the update! That PR is on my list to take a look >> at, but I still recommend starting with the spec changes. For example, how >> should default values be stored in Iceberg metadata for each type? >> Currently, the spec changes just mention defaults without going into detail >> about how they are tracked and what rules there are about them. >> >> On Wed, Feb 9, 2022 at 6:32 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Thanks Ryan and Owen! Glad we have converged on this. Next steps for us: >>> >>> * Continuing the discussion on the default value PR (already ongoing >>> [1]). >>> * Filing the union type conversion PR (ETA end of next week). >>> * Moving listing-based Hive table scan using Iceberg to a separate repo >>> (likely open source). For this I expect introducing some extension points >>> to Iceberg such as making some classes SPI. I hope that the community is >>> okay with that. >>> >>> By the way, Owen and I synced on the Hive casing behavior, and it is a >>> bit more involved: Hive lowers the schema case for all fields (including >>> nested fields) in the Avro case, but only lowers top-level field case and >>> preserves inner field case for other formats (we experimented with ORC and >>> Text). Hope this clarifies the confusion. >>> >>> [1] https://github.com/apache/iceberg/pull/2496 >>> >>> Thanks, >>> Walaa. >>> >>> >>> >>> On Wed, Feb 2, 2022 at 2:40 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Walaa, thanks for this list. I think most of these are definitely >>>> useful. I think the best one to focus on first is the default values, since >>>> those will make Iceberg tables behave more like standard SQL tables, which >>>> is the goal. >>>> >>>> I'm really curious to learn more about #1, but I don't think that I >>>> have enough detail to know whether it is something that fits in the Iceberg >>>> project. At Netflix, we had an alternative implementation of Hive and Spark >>>> tables (Spark tables are slightly different) that we similarly used. But we >>>> didn't write to both at the same time. >>>> >>>> For the others, I'm interested in hearing what other people in the >>>> community find valuable. I don't think I would use #2 or #3, for example. >>>> That's because we already support a flag for case insensitive column >>>> resolution that is well supported throughout Iceberg. If you wanted to use >>>> alternative names, then I'd probably recommend just turning that on... >>>> although that may not be an option depending on how you're working with a >>>> table. It would work in Spark, though. This may be a better feature for >>>> your system that is built on Iceberg. >>>> >>>> Reading unions as structs has come up a couple times so that seems like >>>> people will want it. I think someone attempted to add this support in the >>>> past, but ran into issues because the spec is clear that these are NOT >>>> Iceberg files. There is no guarantee that other implementations will read >>>> them and Iceberg cannot write them in this form. I'm fairly confident that >>>> not allowing unions to be written is a good choice, but I would support >>>> being able to read them. >>>> >>>> Ryan >>>> >>>> On Mon, Jan 31, 2022 at 4:32 PM Owen O'Malley <owen.omal...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>>>> *2. Iceberg schema lower casing:* Before Iceberg, when users read >>>>>> Hive tables from Spark, the returned schema is lowercase since Hive >>>>>> stores >>>>>> all metadata in lowercase mode. If users move to Iceberg, such readers >>>>>> could break once Iceberg returns proper case schema. This feature is to >>>>>> add >>>>>> lowercasing for backward compatibility with existing scripts. This >>>>>> feature >>>>>> is added as an option and is not enabled by default. >>>>>> >>>>> >>>>> This isn't quite correct. Hive lowercases top-level columns. It does >>>>> not lowercase field names inside structs. >>>>> >>>>> >>>>>> *3. Hive table proper casing:* conversely, we leverage the Avro >>>>>> schema to supplement the lower case Hive schema when reading Hive tables. >>>>>> This is useful if someone wants to still get proper cased schemas while >>>>>> still in the Hive mode (to be forward-compatible with Iceberg). The same >>>>>> flag used in (2) is used here. >>>>>> >>>>> >>>>> Are there users of Avro schemas in Hive outside of LinkedIn? I've >>>>> never seen it used. I don't think you should tie #2 and #3 together. >>>>> >>>>> Supporting default values and union types are useful extensions. >>>>> >>>>> .. Owen >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >