Hi Zotan Thanks for the issue, I think it's fair to wait for a new major release for this breaking change.
Best Regards. On Wed, Jan 3, 2024 at 11:16 PM Zoltán Borók-Nagy <borokna...@cloudera.com.invalid> wrote: > Hi, > > I created a IMPALA-12675 > <https://issues.apache.org/jira/browse/IMPALA-12675> about annotating > STRINGs with UTF8 by default. The code change should be trivial, but I'm > afraid we will need to wait for a new major release with this (because > users might store binary data in STRING columns, so it would be a breaking > change for them). Until then users can set PARQUET_ANNOTATE_STRINGS_UTF8 > for themselves. > > Approach C: Yeah, if Approach A goes through then we don't really need to > bother with this. > > Cheers, > Zoltan > > > On Wed, Jan 3, 2024 at 2:02 PM OpenInx <open...@gmail.com> wrote: > >> Thanks Zoltan and Ryan for your feedback. >> >> I think we all agreed that adding an option to promote BINARY to String >> (Approach A) in flink/spark/hive reader sides to read those historic >> dataset correctly written by impala on hive already. Besides that, >> applying approach B to future Apache Impala releases also sounds >> reasonable >> to me, I think we can also create a PR in apache impala repo at the same >> time when applying approach A to iceberg repo. >> >> About approach C, I guess those parquet files will also need to be totally >> rewritten although we are only trying to change those file metadata, which >> may be costly. So I'm a bit hesitant to choose this approach. >> >> Jiafei and I will try to create two PRs for the two things (A and B), one >> for apache iceberg repo and another one for apache impala repo. >> >> Best regards. >> >> On Tue, Jan 2, 2024 at 2:49 AM Ryan Blue <b...@tabular.io> wrote: >> >> > Thanks for bringing this up and for finding the cause. >> > >> > I think we should add an option to promote binary to string (Approach >> A). >> > That sounds pretty reasonable overall. I think it would be great if >> Impala >> > also produced correct Parquet files, but that's beyond our control and >> > there's, no doubt, a ton of data already in that format. >> > >> > This could also be part of our v3 work, where I think we intend to add >> > binary to string type promotion to the format. >> > >> > On Tue, Dec 26, 2023 at 2:38 PM Zoltán Borók-Nagy < >> borokna...@apache.org> >> > wrote: >> > >> >> Hey Everyone, >> >> >> >> Thank you for raising this issue and reaching out to the Impala >> community. >> >> >> >> Let me clarify that the problem only happens when there is a legacy >> Hive >> >> table written by Impala, which is then converted to Iceberg. When >> Impala >> >> writes into an Iceberg table there is no problem with interoperability. >> >> >> >> The root cause is that Impala only supports the BINARY type recently. >> And >> >> the STRING type could serve as a workaround to store binary data. This >> is >> >> why Impala does not add the UTF8 annotation for STRING columns in >> legacy >> >> Hive tables. (Again, for Iceberg tables Impala adds the UTF8 >> annotation.) >> >> >> >> Later, when the table is converted to Iceberg, the migration process >> does >> >> not rewrite the datafiles. Neither Spark, neither Impala's own ALTER >> TABLE >> >> CONVERT TO statement. >> >> >> >> My comments about the proposed solutions, and also adding another one, >> >> (Approach C): >> >> >> >> Approach A (promote BINARY to UTF8 during reads): I think it makes >> sense. >> >> The Parquet metadata also stores information about the writer, so if we >> >> want this to be a very specific fix, we can check if the writer was >> indeed >> >> Impala. >> >> >> >> Approach B (Impala should annotate STRING columns with UTF8): This >> >> probably can only be fixed in a new major version of Impala. Impala >> >> supports the BINARY type now, so I think it makes sense to limit the >> STRING >> >> type to actual string data. This approach does not fix already written >> >> files, as you already pointed out. >> >> >> >> Approach C: Migration job could copy data files but rewrite file >> >> metadata, if needed. This makes migration slower, but it's probably >> still >> >> faster than a CREATE TABLE AS SELECT. >> >> >> >> At Impala-side we surely need to update our docs about migration and >> >> interoperability. >> >> >> >> Cheers, >> >> Zoltan >> >> >> >> OpenInx <open...@gmail.com> ezt írta (időpont: 2023. dec. 26., K >> 7:40): >> >> >> >>> Hi dev >> >>> >> >>> Sensordata [1] had encountered an interesting Apache Impala & Iceberg >> bug >> >>> in their real customer production environment. >> >>> Their customers use Apache Impala to create a large mount of Apache >> Hive >> >>> tables in HMS, and ingested PB-level dataset >> >>> in their hive table (which were originally written by Apache Impala). >> >>> In >> >>> recent days, their customers migrated those Hive >> >>> tables to Apache Iceberg tables, but failed to query their huge >> dataset >> >>> in >> >>> iceberg table format by using the Apache Spark. >> >>> >> >>> Jiajie Feng (from Sensordata) and I had wrote a simple demo to >> >>> demonstrate >> >>> this issue, for more details please see below: >> >>> >> >>> >> https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing >> >>> >> >>> We'd like to hear the feedback and suggestions from both the impala >> and >> >>> iceberg community. I think both Jiajie and I would like >> >>> to fix this issue if we had an aligned solution. >> >>> >> >>> Best Regards. >> >>> >> >>> 1. https://www.sensorsdata.com/en/ >> >>> >> >> >> > >> > -- >> > Ryan Blue >> > Tabular >> > >> >