Re: Iceberg - Hive schema synchronization

Owen O'Malley Tue, 24 Nov 2020 10:13:38 -0800

You left the complex types off of your list (struct, map, array,
uniontype). All of them have natural mappings in Iceberg, except for
uniontype. Interval is supported on output, but not as a column type.
Unfortunately, we have some tables with uniontype, so we'll need a solution
for how to deal with it.


I'm generally in favor of a strict mapping in both type and column
mappings. One piece that I think will help a lot is if we add type
annotations to Iceberg so that for example we could mark a struct as
actually being a uniontype. If someone has the use case where they need to
support Hive's char or varchar types it would make sense to define an
attribute for the max length.

Vivekanand, what kind of conversions are you needing. Hive has a *lot* of
conversions. Many of those conversions are more error-prone than useful.
(For example, I seriously doubt anyone found Hive's conversion of
timestamps to booleans useful...)

.. Owen

On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com>
wrote:

> One of the challenges we've had is that Hive is more flexible with schema
> evolution compared to Iceberg. Are you guys also looking at this aspect?
>
> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Team,
>>
>> With Shardul we had a longer discussion yesterday about the schema
>> synchronization between Iceberg and Hive, and we thought that it would be
>> good to ask the opinion of the greater community too.
>>
>> We can have 2 sources for the schemas.
>>
>>    1. Hive table definition / schema
>>    2. Iceberg schema.
>>
>>
>> If we want Iceberg and Hive to work together we have to find a way to
>> synchronize them. Either by defining a master schema, or by defining a
>> compatibility matrix and conversion for them.
>> In previous Hive integrations we can see examples for both:
>>
>>    - With Avro there is a possibility to read the schema from the data
>>    file directly, and the master schema is the one which is in Avro.
>>    - With HBase you can provide a mapping between HBase columns by
>>    providing the *hbase.columns.mapping* table property
>>
>>
>> Maybe the differences are caused by how the storage format is perceived
>> Avro being a simple storage format, HBase being an independent query engine
>> - but his is just a questionable opinion :)
>>
>> I would like us to decide how Iceberg - Hive integration should be
>> handled.
>>
>> There are at least 2 questions:
>>
>>    1. How flexible we should be with the type mapping between Hive and
>>    Iceberg types?
>>       1. Shall we have a strict mapping - This way if we have an Iceberg
>>       schema we can immediately derive the Hive schema from it.
>>       2. Shall we be more relaxed on this - Automatic casting /
>>       conversions can be built into the integration, allowing the users to 
>> skip
>>       view and/or UDF creation for typical conversions
>>    2. How flexible we should be with column mappings?
>>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>>       Iceberg schema we can immediately derive the Hive schema from it. We 
>> still
>>       have to omit Iceberg columns which does not have a representation 
>> available
>>       in Hive.
>>       2. Shall we allow flexibility on Hive table creation to chose
>>       specific Iceberg columns instead of immediately creating a Hive table 
>> with
>>       all of the columns from the Iceberg table
>>
>>
>> Currently I would chose:
>>
>>    - Strict type mapping because of the following reasons:
>>       - Faster execution (we want as few checks and conversions as
>>       possible, since it will be executed for every record)
>>       - Complexity exponentially increases with every conversion
>>    - Flexible column mapping:
>>       - I think it will be a typical situation when we have a huge
>>       Iceberg table storing the facts with big number of columns and we would
>>       like to create multiple Hive tables above that. The problem could be 
>> solved
>>       by creating the table and adding a view above that table, but I think 
>> it
>>       would be more user-friendly if we could avoid this extra step.
>>       - The added complexity is at table creation / query planning which
>>       has far smaller impact on the overall performance
>>
>>
>> I would love to hear your thoughts as well since the choice should really
>> depend on the user base, and what are the expected use-cases.
>>
>> Thanks,
>> Peter
>>
>>
>> Appendix 1 - Type mapping proposal:
>> Iceberg typeHive2 typeHive3 typeStatus
>> boolean BOOLEAN BOOLEAN OK
>> int INTEGER INTEGER OK
>> long BIGINT BIGINT OK
>> float FLOAT FLOAT OK
>> double DOUBLE DOUBLE OK
>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>> binary BINARY BINARY OK
>> date DATE DATE OK
>> timestamp TIMESTAMP TIMESTAMP OK
>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>> string STRING STRING OK
>> uuid STRING or BINARY STRING or BINARY TODO
>> time - - -
>> fixed(L) - - -
>> - TINYINT TINYINT -
>> - SMALLINT SMALLINT -
>> - INTERVAL INTERVAL -
>> - VARCHAR VARCHAR -
>> - CHAR CHAR -
>>
>>

Re: Iceberg - Hive schema synchronization

Reply via email to