Eugene Koifman created HIVE-20901: ------------------------------------- Summary: running compactor when there is nothing to do produces duplicate data Key: HIVE-20901 URL: https://issues.apache.org/jira/browse/HIVE-20901 Project: Hive Issue Type: Bug Components: Transactions Affects Versions: 4.0.0 Reporter: Eugene Koifman Assignee: Eugene Koifman
suppose we run minor compaction 2 times, via alter table The 2nd request to compaction should have nothing to do but I don't think there is a check for that. It's visible in the context of HIVE-20823, where each compactor run produces a delta with new visibility suffix so we end up with something like {noformat} target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/ ├── delete_delta_0000001_0000002_v0000019 │ ├── _orc_acid_version │ └── bucket_00000 ├── delete_delta_0000001_0000002_v0000021 │ ├── _orc_acid_version │ └── bucket_00000 ├── delta_0000001_0000001_0000 │ ├── _orc_acid_version │ └── bucket_00000 ├── delta_0000001_0000002_v0000019 │ ├── _orc_acid_version │ └── bucket_00000 ├── delta_0000001_0000002_v0000021 │ ├── _orc_acid_version │ └── bucket_00000 └── delta_0000002_0000002_0000 ├── _orc_acid_version └── bucket_00000{noformat} i.e. 2 deltas with the same write ID range this is bad. Probably happens today as well but new run produces a delta with the same name and clobbers the previous one, which may interfere with writers need to investigate -- This message was sent by Atlassian JIRA (v7.6.3#76005)