Thanks Ryan! Yes there is an active discussion on the PR on the spec aspect.
On Fri, Feb 11, 2022 at 8:47 AM Ryan Blue <b...@tabular.io> wrote: > Sounds great. Thanks for the update! That PR is on my list to take a look > at, but I still recommend starting with the spec changes. For example, how > should default values be stored in Iceberg metadata for each type? > Currently, the spec changes just mention defaults without going into detail > about how they are tracked and what rules there are about them. > > On Wed, Feb 9, 2022 at 6:32 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> > wrote: > >> Thanks Ryan and Owen! Glad we have converged on this. Next steps for us: >> >> * Continuing the discussion on the default value PR (already ongoing [1]). >> * Filing the union type conversion PR (ETA end of next week). >> * Moving listing-based Hive table scan using Iceberg to a separate repo >> (likely open source). For this I expect introducing some extension points >> to Iceberg such as making some classes SPI. I hope that the community is >> okay with that. >> >> By the way, Owen and I synced on the Hive casing behavior, and it is a >> bit more involved: Hive lowers the schema case for all fields (including >> nested fields) in the Avro case, but only lowers top-level field case and >> preserves inner field case for other formats (we experimented with ORC and >> Text). Hope this clarifies the confusion. >> >> [1] https://github.com/apache/iceberg/pull/2496 >> >> Thanks, >> Walaa. >> >> >> >> On Wed, Feb 2, 2022 at 2:40 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Walaa, thanks for this list. I think most of these are definitely >>> useful. I think the best one to focus on first is the default values, since >>> those will make Iceberg tables behave more like standard SQL tables, which >>> is the goal. >>> >>> I'm really curious to learn more about #1, but I don't think that I have >>> enough detail to know whether it is something that fits in the Iceberg >>> project. At Netflix, we had an alternative implementation of Hive and Spark >>> tables (Spark tables are slightly different) that we similarly used. But we >>> didn't write to both at the same time. >>> >>> For the others, I'm interested in hearing what other people in the >>> community find valuable. I don't think I would use #2 or #3, for example. >>> That's because we already support a flag for case insensitive column >>> resolution that is well supported throughout Iceberg. If you wanted to use >>> alternative names, then I'd probably recommend just turning that on... >>> although that may not be an option depending on how you're working with a >>> table. It would work in Spark, though. This may be a better feature for >>> your system that is built on Iceberg. >>> >>> Reading unions as structs has come up a couple times so that seems like >>> people will want it. I think someone attempted to add this support in the >>> past, but ran into issues because the spec is clear that these are NOT >>> Iceberg files. There is no guarantee that other implementations will read >>> them and Iceberg cannot write them in this form. I'm fairly confident that >>> not allowing unions to be written is a good choice, but I would support >>> being able to read them. >>> >>> Ryan >>> >>> On Mon, Jan 31, 2022 at 4:32 PM Owen O'Malley <owen.omal...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> *2. Iceberg schema lower casing:* Before Iceberg, when users read >>>>> Hive tables from Spark, the returned schema is lowercase since Hive stores >>>>> all metadata in lowercase mode. If users move to Iceberg, such readers >>>>> could break once Iceberg returns proper case schema. This feature is to >>>>> add >>>>> lowercasing for backward compatibility with existing scripts. This feature >>>>> is added as an option and is not enabled by default. >>>>> >>>> >>>> This isn't quite correct. Hive lowercases top-level columns. It does >>>> not lowercase field names inside structs. >>>> >>>> >>>>> *3. Hive table proper casing:* conversely, we leverage the Avro >>>>> schema to supplement the lower case Hive schema when reading Hive tables. >>>>> This is useful if someone wants to still get proper cased schemas while >>>>> still in the Hive mode (to be forward-compatible with Iceberg). The same >>>>> flag used in (2) is used here. >>>>> >>>> >>>> Are there users of Avro schemas in Hive outside of LinkedIn? I've never >>>> seen it used. I don't think you should tie #2 and #3 together. >>>> >>>> Supporting default values and union types are useful extensions. >>>> >>>> .. Owen >>>> >>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >