John Sherman created HIVE-26472: ----------------------------------- Summary: Concurrent UPDATEs can cause duplicate rows Key: HIVE-26472 URL: https://issues.apache.org/jira/browse/HIVE-26472 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 4.0.0-alpha-1 Reporter: John Sherman Attachments: debug.diff
Concurrent UPDATEs to the same table can cause duplicate rows when the following occurs: Two UPDATEs get assigned txnIds and writeIds like this: UPDATE #1 = txnId: 100 writeId: 50 <--- commits first UPDATE #2 = txnId: 101 writeId: 49 To replicate the issue: I applied the attach debug.diff patch which adds hive.lock.sleep.writeid (which controls the amount to sleep before acquiring a writeId) and hive.lock.sleep.post.writeid (which controls the amount to sleep after acquiring a writeId). {code:java} CREATE TABLE test_update(i int) STORED AS ORC TBLPROPERTIES('transactional'="true"); INSERT INTO test_update VALUES (1); Start two beeline connections. In connection #1 - run: set hive.driver.parallel.compilation = true; set hive.lock.sleep.writeid=5s; update test_update set i = 1 where i = 1; Wait one second and in connection #2 - run: set hive.driver.parallel.compilation = true; set hive.lock.sleep.post.writeid=10s; update test_update set i = 1 where i = 1; After both updates complete - it is likely that test_update contains two rows now. {code} HIVE-24211 seems to address the case when: UPDATE #1 = txnId: 100 writeId: 50 UPDATE #2 = txnId: 101 writeId: 49 <--- commits first (I think this causes UPDATE #1 to detect the snapshot is out of date because commitedTxn > UPDATE #1s txnId) A possible work around is to set hive.driver.parallel.compilation = false, but this would only help in cases there is only one HS2 instance. -- This message was sent by Atlassian Jira (v8.20.10#820010)