Csaba Ringhofer created IMPALA-14096:
----------------------------------------
Summary: Writing non-UTF8 partition values can lead to dirty writes
Key: IMPALA-14096
URL: https://issues.apache.org/jira/browse/IMPALA-14096
Project: IMPALA
Issue Type: Bug
Reporter: Csaba Ringhofer
{code}
create table tspart (s string) partitioned by (p string);
insert into tspart partition (p="a") values ("a");
insert into tspart partition (p="aa") values ("aa");
-- s is not valid utf8
insert into tspart partition (p="a") values (unhex("aa"));
-- insert the table again but swap p and s, so one partition will be unhex("aa")
insert into tspart partition (p) select p s_, concat(s, "a") p_ from tspart;
-- leads to error:
2025-05-26 11:47:03 [Exception] ERROR: Query da440f13f21ab301:79918f1100000000
failed:
Error(s) moving partition files. First error (of 1) was: Hdfs op (RENAME
hdfs://localhost:20500/test-warehouse/tspart/_impala_insert_staging/da440f13f21ab301_79918f1100000000/.da440f13f21ab301-79918f1100000002_588063374_dir/p=�a/da440f13f21ab301-79918f1100000002_782687841_data.0.txt
TO
hdfs://localhost:20500/test-warehouse/tspart/p=�a/da440f13f21ab301-79918f1100000002_782687841_data.0.txt)
failed, error was:
hdfs://localhost:20500/test-warehouse/tspart/_impala_insert_staging/da440f13f21ab301_79918f1100000000/.da440f13f21ab301-79918f1100000002_588063374_dir/p=�a/da440f13f21ab301-79918f1100000002_782687841_data.0.txt
Error(5): Input/output error
select count(*) from tspart;
-- result: 3, the table looks unchanged
refresh tspart;
select count(*) from tspart;
-- result: 4, because an extra file was found by refresh
{code}
While dirty writes is a known issue in non transactional tables, reproducing it
so easily should be avoided if possible. The problem in this case is that the
error comes when moving the files, so some files can be already moved to their
final destination. Detecting the problematic partition names earlier could
ensure that files written for other partitions are not moved out of staging dir.
https://github.com/apache/impala/blob/f4e75510948bdb72f2d5206161fee12e5b6d0888/be/src/runtime/dml-exec-state.cc#L341
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]