Re: Hive table compatibility for Iceberg readers

Walaa Eldin Moustafa Wed, 09 Feb 2022 18:32:38 -0800

Thanks Ryan and Owen! Glad we have converged on this. Next steps for us:

* Continuing the discussion on the default value PR (already ongoing [1]).
* Filing the union type conversion PR (ETA end of next week).
* Moving listing-based Hive table scan using Iceberg to a separate repo
(likely open source). For this I expect introducing some extension points
to Iceberg such as making some classes SPI. I hope that the community is
okay with that.


By the way, Owen and I synced on the Hive casing behavior, and it is a bit
more involved: Hive lowers the schema case for all fields (including nested
fields) in the Avro case, but only lowers top-level field case and
preserves inner field case for other formats (we experimented with ORC and
Text). Hope this clarifies the confusion.

[1] https://github.com/apache/iceberg/pull/2496

Thanks,
Walaa.



On Wed, Feb 2, 2022 at 2:40 PM Ryan Blue <b...@tabular.io> wrote:

> Walaa, thanks for this list. I think most of these are definitely useful.
> I think the best one to focus on first is the default values, since those
> will make Iceberg tables behave more like standard SQL tables, which is the
> goal.
>
> I'm really curious to learn more about #1, but I don't think that I have
> enough detail to know whether it is something that fits in the Iceberg
> project. At Netflix, we had an alternative implementation of Hive and Spark
> tables (Spark tables are slightly different) that we similarly used. But we
> didn't write to both at the same time.
>
> For the others, I'm interested in hearing what other people in the
> community find valuable. I don't think I would use #2 or #3, for example.
> That's because we already support a flag for case insensitive column
> resolution that is well supported throughout Iceberg. If you wanted to use
> alternative names, then I'd probably recommend just turning that on...
> although that may not be an option depending on how you're working with a
> table. It would work in Spark, though. This may be a better feature for
> your system that is built on Iceberg.
>
> Reading unions as structs has come up a couple times so that seems like
> people will want it. I think someone attempted to add this support in the
> past, but ran into issues because the spec is clear that these are NOT
> Iceberg files. There is no guarantee that other implementations will read
> them and Iceberg cannot write them in this form. I'm fairly confident that
> not allowing unions to be written is a good choice, but I would support
> being able to read them.
>
> Ryan
>
> On Mon, Jan 31, 2022 at 4:32 PM Owen O'Malley <owen.omal...@gmail.com>
> wrote:
>
>>
>>
>> On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> *2. Iceberg schema lower casing:* Before Iceberg, when users read Hive
>>> tables from Spark, the returned schema is lowercase since Hive stores all
>>> metadata in lowercase mode. If users move to Iceberg, such readers could
>>> break once Iceberg returns proper case schema. This feature is to add
>>> lowercasing for backward compatibility with existing scripts. This feature
>>> is added as an option and is not enabled by default.
>>>
>>
>> This isn't quite correct. Hive lowercases top-level columns. It does not
>> lowercase field names inside structs.
>>
>>
>>> *3. Hive table proper casing:* conversely, we leverage the Avro schema
>>> to supplement the lower case Hive schema when reading Hive tables. This is
>>> useful if someone wants to still get proper cased schemas while still in
>>> the Hive mode (to be forward-compatible with Iceberg). The same flag used
>>> in (2) is used here.
>>>
>>
>> Are there users of Avro schemas in Hive outside of LinkedIn? I've never
>> seen it used. I don't think you should tie #2 and #3 together.
>>
>> Supporting default values and union types are useful extensions.
>>
>> .. Owen
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: Hive table compatibility for Iceberg readers

Reply via email to