Zoltán Borók-Nagy created IMPALA-14229:
------------------------------------------
Summary: Iceberg STRINGs are not UTF-8 aware in Impala
Key: IMPALA-14229
URL: https://issues.apache.org/jira/browse/IMPALA-14229
Project: IMPALA
Issue Type: Bug
Reporter: Zoltán Borók-Nagy
The Iceberg spec states that STRINGs are UTF-8 encoded
[https://iceberg.apache.org/spec/#primitive-types]
But Impala still mostly treats them as raw byte arrays. Because of this several
things do not work:
{noformat}
create table ice_str (s string)
partitioned by spec(truncate(2, s))
stored by iceberg;
> insert into ice_str values ('tüüü');
2025-07-15 17:25:39 [Exception] ERROR: Query f04bccbb7613822c:12d814bf00000000
failed:
RuntimeException: java.nio.charset.MalformedInputException: Input length = 1
CAUSED BY: MalformedInputException: Input length = 1
{noformat}
Or produce incorrect results:
{noformat}
> insert into ice_str values ('üüü');
> show files in ice_str
hdfs://localhost:20500/test-warehouse/ice_str/data/s_trunc=ü/xxx_data.0.parq
<== incorrect partition, Hive also URL-encodes the UTF-8 characters
> select s, length(s) from ice_str;
+-----+-----------+
| s | length(s) |
+-----+-----------+
| üüü | 6 | <== length should be 3
+-----+-----------+{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]