John Sherman created HIVE-26472:
-----------------------------------

             Summary: Concurrent UPDATEs can cause duplicate rows
                 Key: HIVE-26472
                 URL: https://issues.apache.org/jira/browse/HIVE-26472
             Project: Hive
          Issue Type: Bug
          Components: HiveServer2
    Affects Versions: 4.0.0-alpha-1
            Reporter: John Sherman
         Attachments: debug.diff

Concurrent UPDATEs to the same table can cause duplicate rows when the 
following occurs:
Two UPDATEs get assigned txnIds and writeIds like this:
UPDATE #1 = txnId: 100 writeId: 50 <--- commits first
UPDATE #2 = txnId: 101 writeId: 49

To replicate the issue:
I applied the attach debug.diff patch which adds hive.lock.sleep.writeid (which 
controls the amount to sleep before acquiring a writeId) and 
hive.lock.sleep.post.writeid (which controls the amount to sleep after 
acquiring a writeId).
{code:java}
CREATE TABLE test_update(i int) STORED AS ORC 
TBLPROPERTIES('transactional'="true");
INSERT INTO test_update VALUES (1);

Start two beeline connections.
In connection #1 - run:
set hive.driver.parallel.compilation = true;
set hive.lock.sleep.writeid=5s;
update test_update set i = 1 where i = 1;

Wait one second and in connection #2 - run:
set hive.driver.parallel.compilation = true;
set hive.lock.sleep.post.writeid=10s;
update test_update set i = 1 where i = 1;

After both updates complete - it is likely that test_update contains two rows 
now.
{code}

HIVE-24211 seems to address the case when:
UPDATE #1 = txnId: 100 writeId: 50
UPDATE #2 = txnId: 101 writeId: 49 <--- commits first (I think this causes 
UPDATE #1 to detect the snapshot is out of date because commitedTxn > UPDATE 
#1s txnId)

A possible work around is to set hive.driver.parallel.compilation = false, but 
this would only help in cases there is only one HS2 instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to