Iceberg - Hive schema synchronization

Peter Vary Tue, 24 Nov 2020 06:51:14 -0800

Hi Team,

With Shardul we had a longer discussion yesterday about the schema 
synchronization between Iceberg and Hive, and we thought that it would be good 
to ask the opinion of the greater community too.


We can have 2 sources for the schemas. 
Hive table definition / schema
Iceberg schema.

If we want Iceberg and Hive to work together we have to find a way to 
synchronize them. Either by defining a master schema, or by defining a 
compatibility matrix and conversion for them.
In previous Hive integrations we can see examples for both:
With Avro there is a possibility to read the schema from the data file 
directly, and the master schema is the one which is in Avro.
With HBase you can provide a mapping between HBase columns by providing the 
hbase.columns.mapping table property

Maybe the differences are caused by how the storage format is perceived Avro 
being a simple storage format, HBase being an independent query engine - but 
his is just a questionable opinion :)

I would like us to decide how Iceberg - Hive integration should be handled.

There are at least 2 questions:
How flexible we should be with the type mapping between Hive and Iceberg types?
Shall we have a strict mapping - This way if we have an Iceberg schema we can 
immediately derive the Hive schema from it.
Shall we be more relaxed on this - Automatic casting / conversions can be built 
into the integration, allowing the users to skip view and/or UDF creation for 
typical conversions
How flexible we should be with column mappings?
Shall we have strict 1-on-1 mapping - This way if we have an Iceberg schema we 
can immediately derive the Hive schema from it. We still have to omit Iceberg 
columns which does not have a representation available in Hive.
Shall we allow flexibility on Hive table creation to chose specific Iceberg 
columns instead of immediately creating a Hive table with all of the columns 
from the Iceberg table

Currently I would chose:
Strict type mapping because of the following reasons:
Faster execution (we want as few checks and conversions as possible, since it 
will be executed for every record)
Complexity exponentially increases with every conversion
Flexible column mapping:
I think it will be a typical situation when we have a huge Iceberg table 
storing the facts with big number of columns and we would like to create 
multiple Hive tables above that. The problem could be solved by creating the 
table and adding a view above that table, but I think it would be more 
user-friendly if we could avoid this extra step.
The added complexity is at table creation / query planning which has far 
smaller impact on the overall performance

I would love to hear your thoughts as well since the choice should really 
depend on the user base, and what are the expected use-cases.

Thanks,
Peter


Appendix 1 - Type mapping proposal:
Iceberg type    Hive2 type      Hive3 type      Status
boolean BOOLEAN BOOLEAN OK
int     INTEGER INTEGER OK
long    BIGINT  BIGINT  OK
float   FLOAT   FLOAT   OK
double  DOUBLE  DOUBLE  OK
decimal(P,S)    DECIMAL(P,S)    DECIMAL(P,S)    OK
binary  BINARY  BINARY  OK
date    DATE    DATE    OK
timestamp       TIMESTAMP       TIMESTAMP       OK
timestamptz     TIMESTAMP       TIMESTAMP WITH LOCAL TIMEZONE   TODO
string  STRING  STRING  OK
uuid    STRING or BINARY        STRING or BINARY        TODO
time    -       -       -
fixed(L)        -       -       -
-       TINYINT TINYINT -
-       SMALLINT        SMALLINT        -
-       INTERVAL        INTERVAL        -
-       VARCHAR VARCHAR -
-       CHAR    CHAR    -

Iceberg - Hive schema synchronization

Reply via email to