Hey Everyone, Thank you for raising this issue and reaching out to the Impala community.
Let me clarify that the problem only happens when there is a legacy Hive table written by Impala, which is then converted to Iceberg. When Impala writes into an Iceberg table there is no problem with interoperability. The root cause is that Impala only supports the BINARY type recently. And the STRING type could serve as a workaround to store binary data. This is why Impala does not add the UTF8 annotation for STRING columns in legacy Hive tables. (Again, for Iceberg tables Impala adds the UTF8 annotation.) Later, when the table is converted to Iceberg, the migration process does not rewrite the datafiles. Neither Spark, neither Impala's own ALTER TABLE CONVERT TO statement. My comments about the proposed solutions, and also adding another one, (Approach C): Approach A (promote BINARY to UTF8 during reads): I think it makes sense. The Parquet metadata also stores information about the writer, so if we want this to be a very specific fix, we can check if the writer was indeed Impala. Approach B (Impala should annotate STRING columns with UTF8): This probably can only be fixed in a new major version of Impala. Impala supports the BINARY type now, so I think it makes sense to limit the STRING type to actual string data. This approach does not fix already written files, as you already pointed out. Approach C: Migration job could copy data files but rewrite file metadata, if needed. This makes migration slower, but it's probably still faster than a CREATE TABLE AS SELECT. At Impala-side we surely need to update our docs about migration and interoperability. Cheers, Zoltan OpenInx <open...@gmail.com> ezt írta (időpont: 2023. dec. 26., K 7:40): > Hi dev > > Sensordata [1] had encountered an interesting Apache Impala & Iceberg bug > in their real customer production environment. > Their customers use Apache Impala to create a large mount of Apache Hive > tables in HMS, and ingested PB-level dataset > in their hive table (which were originally written by Apache Impala). In > recent days, their customers migrated those Hive > tables to Apache Iceberg tables, but failed to query their huge dataset in > iceberg table format by using the Apache Spark. > > Jiajie Feng (from Sensordata) and I had wrote a simple demo to demonstrate > this issue, for more details please see below: > > https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing > > We'd like to hear the feedback and suggestions from both the impala and > iceberg community. I think both Jiajie and I would like > to fix this issue if we had an aligned solution. > > Best Regards. > > 1. https://www.sensorsdata.com/en/ >