Re: Iceberg - Hive schema synchronization

Ryan Blue Wed, 25 Nov 2020 14:46:35 -0800

I agree that a 1-to-1 type mapping is the right option. Some additional
mappings should be supported; I think it should be fine to use VARCHAR in
DDL to produce a string column in Iceberg.


Iceberg is also strict about type promotion, and I don't think that we
should confuse type promotion with how data is returned for an engine.
Using the example of long to string from Vivekanand, it makes no sense for
Hive to read all values as strings when another engine would read them as
longs. This would also introduce write errors because Hive may attempt to
write strings to the table when it must store longs. The right solution for
this is type promotion, where the long column is promoted to string.
Iceberg doesn't currently support that, but we can loosen those
restrictions if we choose.

I think the column mapping should also be 1-to-1. Hive would have trouble
writing to a table if it didn't include all required columns. I think that
the right thing is for all engines to provide uniform access to all columns.

Last, Iceberg must be the source of truth for schema information. Schema in
the Hive metastore should always come from Iceberg, or else columns would
be inaccessible and Hive would require extra DDL to access new columns.
Hive should have the same view of an Iceberg table that any other engine
would.

rb

On Wed, Nov 25, 2020 at 8:57 AM Zoltán Borók-Nagy <borokna...@apache.org>
wrote:

> Hi Everyone,
>
> In Impala we face the same challenges. I think a strict 1-to-1 type
> mapping would be beneficial because that way we could derive the Iceberg
> schema from the Hive schema, not just the other way around. So we could
> just naturally create Iceberg tables via DDL.
>
> We should use the same type mapping for Hive and Impala, which is almost
> true already. The type mappings in Impala reflect the table where we have a
> 1-to-1 mapping. Impala also has a mapping between fixed(L) and CHAR(L),
> although the latter is restricted to UTF8 strings AFAIK. I added this
> mapping to Impala master but I'm not sure if it's correct. Currently
> Iceberg definitely dislikes it as it throws exceptions on the read path
> (because of the UTF8 annotation), so I will probably just remove it.
>
> Right now Impala throws an exception when there is no type mapping,
> for example when CREATE TABLE with VARCHAR, or when creating an external
> table from an Iceberg table that has UUID.
>
> For column mapping we currently also do a strict mapping. We consider the
> Iceberg table schema as the source of truth, and always derive the Hive
> schema from it. We cannot create an external table with only a subset of
> the columns. And as I mentioned above, we get an exception when facing an
> unsupported type.
> I think strict column mapping is useful because we might also want to
> write the tables. VIEWs are more explicit for reading only a subset of
> columns in my opinion.
>
> Cheers,
>     Zoltan
>
>
> On Wed, Nov 25, 2020 at 5:17 AM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Some of the conversions we are seeing are:
>>
>>    - Decimal to Decimal; not just limited to increasing precision as
>>    with Iceberg
>>    - varchar<N> to string
>>    - numeric type to numeric type (float to Decimal, double to Decimal,
>>    Decimal to double, etc)
>>    -
>>    - numeric type to string
>>
>>
>> On Tue, Nov 24, 2020 at 11:43 PM Owen O'Malley <owen.omal...@gmail.com>
>> wrote:
>>
>>> You left the complex types off of your list (struct, map, array,
>>> uniontype). All of them have natural mappings in Iceberg, except for
>>> uniontype. Interval is supported on output, but not as a column type.
>>> Unfortunately, we have some tables with uniontype, so we'll need a solution
>>> for how to deal with it.
>>>
>>> I'm generally in favor of a strict mapping in both type and column
>>> mappings. One piece that I think will help a lot is if we add type
>>> annotations to Iceberg so that for example we could mark a struct as
>>> actually being a uniontype. If someone has the use case where they need to
>>> support Hive's char or varchar types it would make sense to define an
>>> attribute for the max length.
>>>
>>> Vivekanand, what kind of conversions are you needing. Hive has a *lot*
>>> of conversions. Many of those conversions are more error-prone than useful.
>>> (For example, I seriously doubt anyone found Hive's conversion of
>>> timestamps to booleans useful...)
>>>
>>> .. Owen
>>>
>>> On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com>
>>> wrote:
>>>
>>>> One of the challenges we've had is that Hive is more flexible with
>>>> schema evolution compared to Iceberg. Are you guys also looking at this
>>>> aspect?
>>>>
>>>> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> With Shardul we had a longer discussion yesterday about the schema
>>>>> synchronization between Iceberg and Hive, and we thought that it would be
>>>>> good to ask the opinion of the greater community too.
>>>>>
>>>>> We can have 2 sources for the schemas.
>>>>>
>>>>>    1. Hive table definition / schema
>>>>>    2. Iceberg schema.
>>>>>
>>>>>
>>>>> If we want Iceberg and Hive to work together we have to find a way to
>>>>> synchronize them. Either by defining a master schema, or by defining a
>>>>> compatibility matrix and conversion for them.
>>>>> In previous Hive integrations we can see examples for both:
>>>>>
>>>>>    - With Avro there is a possibility to read the schema from the
>>>>>    data file directly, and the master schema is the one which is in Avro.
>>>>>    - With HBase you can provide a mapping between HBase columns by
>>>>>    providing the *hbase.columns.mapping* table property
>>>>>
>>>>>
>>>>> Maybe the differences are caused by how the storage format is
>>>>> perceived Avro being a simple storage format, HBase being an independent
>>>>> query engine - but his is just a questionable opinion :)
>>>>>
>>>>> I would like us to decide how Iceberg - Hive integration should be
>>>>> handled.
>>>>>
>>>>> There are at least 2 questions:
>>>>>
>>>>>    1. How flexible we should be with the type mapping between Hive
>>>>>    and Iceberg types?
>>>>>       1. Shall we have a strict mapping - This way if we have an
>>>>>       Iceberg schema we can immediately derive the Hive schema from it.
>>>>>       2. Shall we be more relaxed on this - Automatic casting /
>>>>>       conversions can be built into the integration, allowing the users 
>>>>> to skip
>>>>>       view and/or UDF creation for typical conversions
>>>>>    2. How flexible we should be with column mappings?
>>>>>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>>>>>       Iceberg schema we can immediately derive the Hive schema from it. 
>>>>> We still
>>>>>       have to omit Iceberg columns which does not have a representation 
>>>>> available
>>>>>       in Hive.
>>>>>       2. Shall we allow flexibility on Hive table creation to chose
>>>>>       specific Iceberg columns instead of immediately creating a Hive 
>>>>> table with
>>>>>       all of the columns from the Iceberg table
>>>>>
>>>>>
>>>>> Currently I would chose:
>>>>>
>>>>>    - Strict type mapping because of the following reasons:
>>>>>       - Faster execution (we want as few checks and conversions as
>>>>>       possible, since it will be executed for every record)
>>>>>       - Complexity exponentially increases with every conversion
>>>>>    - Flexible column mapping:
>>>>>       - I think it will be a typical situation when we have a huge
>>>>>       Iceberg table storing the facts with big number of columns and we 
>>>>> would
>>>>>       like to create multiple Hive tables above that. The problem could 
>>>>> be solved
>>>>>       by creating the table and adding a view above that table, but I 
>>>>> think it
>>>>>       would be more user-friendly if we could avoid this extra step.
>>>>>       - The added complexity is at table creation / query planning
>>>>>       which has far smaller impact on the overall performance
>>>>>
>>>>>
>>>>> I would love to hear your thoughts as well since the choice should
>>>>> really depend on the user base, and what are the expected use-cases.
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>>
>>>>> Appendix 1 - Type mapping proposal:
>>>>> Iceberg typeHive2 typeHive3 typeStatus
>>>>> boolean BOOLEAN BOOLEAN OK
>>>>> int INTEGER INTEGER OK
>>>>> long BIGINT BIGINT OK
>>>>> float FLOAT FLOAT OK
>>>>> double DOUBLE DOUBLE OK
>>>>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>>>>> binary BINARY BINARY OK
>>>>> date DATE DATE OK
>>>>> timestamp TIMESTAMP TIMESTAMP OK
>>>>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>>>>> string STRING STRING OK
>>>>> uuid STRING or BINARY STRING or BINARY TODO
>>>>> time - - -
>>>>> fixed(L) - - -
>>>>> - TINYINT TINYINT -
>>>>> - SMALLINT SMALLINT -
>>>>> - INTERVAL INTERVAL -
>>>>> - VARCHAR VARCHAR -
>>>>> - CHAR CHAR -
>>>>>
>>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg - Hive schema synchronization

Reply via email to