Re: Spark cannot read iceberg tables which were originally written by Impala

Zoltán Borók-Nagy Tue, 26 Dec 2023 14:38:02 -0800

Hey Everyone,

Thank you for raising this issue and reaching out to the Impala community.


Let me clarify that the problem only happens when there is a legacy Hive
table written by Impala, which is then converted to Iceberg. When Impala
writes into an Iceberg table there is no problem with interoperability.

The root cause is that Impala only supports the BINARY type recently. And
the STRING type could serve as a workaround to store binary data. This is
why Impala does not add the UTF8 annotation for STRING columns in legacy
Hive tables. (Again, for Iceberg tables Impala adds the UTF8 annotation.)

Later, when the table is converted to Iceberg, the migration process does
not rewrite the datafiles. Neither Spark, neither Impala's own ALTER TABLE
CONVERT TO statement.

My comments about the proposed solutions, and also adding another one,
(Approach C):

Approach A (promote BINARY to UTF8 during reads): I think it makes sense.
The Parquet metadata also stores information about the writer, so if we
want this to be a very specific fix, we can check if the writer was indeed
Impala.

Approach B (Impala should annotate STRING columns with UTF8): This probably
can only be fixed in a new major version of Impala. Impala supports the
BINARY type now, so I think it makes sense to limit the STRING type to
actual string data. This approach does not fix already written files, as
you already pointed out.

Approach C: Migration job could copy data files but rewrite file metadata,
if needed. This makes migration slower, but it's probably still faster than
a CREATE TABLE AS SELECT.

At Impala-side we surely need to update our docs about migration and
interoperability.

Cheers,
   Zoltan

OpenInx <open...@gmail.com> ezt írta (időpont: 2023. dec. 26., K 7:40):

> Hi dev
>
> Sensordata [1] had encountered an interesting Apache Impala & Iceberg bug
> in their real customer production environment.
> Their customers use Apache Impala to create a large mount of Apache Hive
> tables in HMS, and ingested PB-level dataset
> in their hive table (which were originally written by Apache Impala).   In
> recent days,  their customers migrated those Hive
> tables to Apache Iceberg tables, but failed to query their huge dataset in
> iceberg table format by using the Apache Spark.
>
> Jiajie Feng (from Sensordata) and I had wrote a simple demo to demonstrate
> this issue, for more details please see below:
>
> https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing
>
> We'd like to hear the feedback and suggestions from both the impala and
> iceberg community. I think both Jiajie and I would like
> to fix this issue if we had an aligned solution.
>
> Best Regards.
>
> 1. https://www.sensorsdata.com/en/
>

Re: Spark cannot read iceberg tables which were originally written by Impala

Reply via email to