Re: Hive table compatibility for Iceberg readers

Walaa Eldin Moustafa Fri, 11 Feb 2022 08:54:36 -0800

Thanks Ryan! Yes there is an active discussion on the PR on the spec aspect.


On Fri, Feb 11, 2022 at 8:47 AM Ryan Blue <b...@tabular.io> wrote:

> Sounds great. Thanks for the update! That PR is on my list to take a look
> at, but I still recommend starting with the spec changes. For example, how
> should default values be stored in Iceberg metadata for each type?
> Currently, the spec changes just mention defaults without going into detail
> about how they are tracked and what rules there are about them.
>
> On Wed, Feb 9, 2022 at 6:32 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
> wrote:
>
>> Thanks Ryan and Owen! Glad we have converged on this. Next steps for us:
>>
>> * Continuing the discussion on the default value PR (already ongoing [1]).
>> * Filing the union type conversion PR (ETA end of next week).
>> * Moving listing-based Hive table scan using Iceberg to a separate repo
>> (likely open source). For this I expect introducing some extension points
>> to Iceberg such as making some classes SPI. I hope that the community is
>> okay with that.
>>
>> By the way, Owen and I synced on the Hive casing behavior, and it is a
>> bit more involved: Hive lowers the schema case for all fields (including
>> nested fields) in the Avro case, but only lowers top-level field case and
>> preserves inner field case for other formats (we experimented with ORC and
>> Text). Hope this clarifies the confusion.
>>
>> [1] https://github.com/apache/iceberg/pull/2496
>>
>> Thanks,
>> Walaa.
>>
>>
>>
>> On Wed, Feb 2, 2022 at 2:40 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Walaa, thanks for this list. I think most of these are definitely
>>> useful. I think the best one to focus on first is the default values, since
>>> those will make Iceberg tables behave more like standard SQL tables, which
>>> is the goal.
>>>
>>> I'm really curious to learn more about #1, but I don't think that I have
>>> enough detail to know whether it is something that fits in the Iceberg
>>> project. At Netflix, we had an alternative implementation of Hive and Spark
>>> tables (Spark tables are slightly different) that we similarly used. But we
>>> didn't write to both at the same time.
>>>
>>> For the others, I'm interested in hearing what other people in the
>>> community find valuable. I don't think I would use #2 or #3, for example.
>>> That's because we already support a flag for case insensitive column
>>> resolution that is well supported throughout Iceberg. If you wanted to use
>>> alternative names, then I'd probably recommend just turning that on...
>>> although that may not be an option depending on how you're working with a
>>> table. It would work in Spark, though. This may be a better feature for
>>> your system that is built on Iceberg.
>>>
>>> Reading unions as structs has come up a couple times so that seems like
>>> people will want it. I think someone attempted to add this support in the
>>> past, but ran into issues because the spec is clear that these are NOT
>>> Iceberg files. There is no guarantee that other implementations will read
>>> them and Iceberg cannot write them in this form. I'm fairly confident that
>>> not allowing unions to be written is a good choice, but I would support
>>> being able to read them.
>>>
>>> Ryan
>>>
>>> On Mon, Jan 31, 2022 at 4:32 PM Owen O'Malley <owen.omal...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> *2. Iceberg schema lower casing:* Before Iceberg, when users read
>>>>> Hive tables from Spark, the returned schema is lowercase since Hive stores
>>>>> all metadata in lowercase mode. If users move to Iceberg, such readers
>>>>> could break once Iceberg returns proper case schema. This feature is to 
>>>>> add
>>>>> lowercasing for backward compatibility with existing scripts. This feature
>>>>> is added as an option and is not enabled by default.
>>>>>
>>>>
>>>> This isn't quite correct. Hive lowercases top-level columns. It does
>>>> not lowercase field names inside structs.
>>>>
>>>>
>>>>> *3. Hive table proper casing:* conversely, we leverage the Avro
>>>>> schema to supplement the lower case Hive schema when reading Hive tables.
>>>>> This is useful if someone wants to still get proper cased schemas while
>>>>> still in the Hive mode (to be forward-compatible with Iceberg). The same
>>>>> flag used in (2) is used here.
>>>>>
>>>>
>>>> Are there users of Avro schemas in Hive outside of LinkedIn? I've never
>>>> seen it used. I don't think you should tie #2 and #3 together.
>>>>
>>>> Supporting default values and union types are useful extensions.
>>>>
>>>> .. Owen
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Hive table compatibility for Iceberg readers

Reply via email to