[ https://issues.apache.org/jira/browse/HIVE-26875?focusedWorklogId=834941&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-834941 ]
ASF GitHub Bot logged work on HIVE-26875: ----------------------------------------- Author: ASF GitHub Bot Created on: 20/Dec/22 22:52 Start Date: 20/Dec/22 22:52 Worklog Time Spent: 10m Work Description: jfsii commented on code in PR #3887: URL: https://github.com/apache/hive/pull/3887#discussion_r1053821695 ########## ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java: ########## @@ -42,6 +42,7 @@ import org.apache.hadoop.hive.ql.Driver; import org.apache.hadoop.hive.ql.QueryState; import org.apache.hadoop.hive.ql.io.AcidUtils; +import org.apache.hadoop.hive.ql.metadata.HiveException; Review Comment: I don't think this is used, remove it. ########## ql/src/java/org/apache/hadoop/hive/ql/Driver.java: ########## @@ -555,7 +551,14 @@ private void prepareContext() throws CommandProcessorException { String originalCboInfo = context != null ? context.cboInfo : null; if (context != null && context.getExplainAnalyze() != AnalyzeState.RUNNING) { // close the existing ctx etc before compiling a new query, but does not destroy driver - closeInProcess(false); + if (!driverContext.isRetrial()) { + closeInProcess(false); + } else { + // On retrail we need to maintain information from the prior context. Such Review Comment: fix typo ########## ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java: ########## @@ -2465,6 +2468,113 @@ private void testConcurrent2MergeUpdatesConflict(boolean slowCompile) throws Exc Assert.assertTrue("Lost Update", isEqualCollection(res, asList("earl\t10", "amy\t10"))); } + // The intent of this test is to cause multiple conflicts to the same query to test the conflict retry functionality. + @Test + public void testConcurrentConflictRetry() throws Exception { + dropTable(new String[]{"target"}); + + driver2.getConf().setBoolVar(HiveConf.ConfVars.TXN_WRITE_X_LOCK, true); + driver.run("create table target(i int) stored as orc tblproperties ('transactional'='true')"); + driver.run("insert into target values (1),(1)"); + + DbTxnManager txnMgr2 = (DbTxnManager) TxnManagerFactory.getTxnManagerFactory().getTxnManager(conf); + swapTxnManager(txnMgr2); + // Start a query on driver2 + driver2.compileAndRespond("update target set i = 1 where i = 1"); + + swapTxnManager(txnMgr); + // this should cause a conflict which should cause driver2.run to invoke its conflict lambda + driver.run("update target set i = 1 where i = 1"); + swapTxnManager(txnMgr2); + + // This run should conflict with the above query and cause the "conflict lambda" to be execute, + // which will then also conflict with the driver2 query and cause it to retry. + AtomicInteger conflictCount = new AtomicInteger(); + driver2.run(() -> { + conflictCount.getAndIncrement(); + // on first conflict, + // we execute another query to cause an additional conflict + if (conflictCount.get() == 1) { + swapTxnManager(txnMgr); + // this should cause a conflict + try { + driver.run("update target set i = 1 where i = 1"); + } catch (Exception e) { + // do nothing + } + swapTxnManager(txnMgr2); + } + }); + + // we expected two conflicts + Assert.assertEquals(2, conflictCount.get()); + swapTxnManager(txnMgr); + // we expect two rows + driver.run("select * from target"); + List<String> res = new ArrayList<>(); + driver.getFetchTask().fetch(res); + Assert.assertEquals(2, res.size()); + } + + @Rule + public ExpectedException exceptionRule = ExpectedException.none(); + + @Test + public void testConcurrentConflictMaxRetryCount() throws Exception { + dropTable(new String[]{"target"}); + + driver2.getConf().setBoolVar(HiveConf.ConfVars.TXN_WRITE_X_LOCK, true); + + final int maxRetries = 4; + driver2.getConf().setIntVar(HiveConf.ConfVars.HIVE_TXN_MAX_RETRYSNAPSHOT_COUNT, maxRetries); + + driver.run("create table target(i int) stored as orc tblproperties ('transactional'='true')"); + driver.run("insert into target values (1),(1)"); + + DbTxnManager txnMgr2 = (DbTxnManager) TxnManagerFactory.getTxnManagerFactory().getTxnManager(conf); + swapTxnManager(txnMgr2); + // Start a query on driver2, we expect this query to never execute because of HIVE_TXN_MAX_RETRYSNAPSHOT_COUNT. + // We verify that it is never executed by counting the number of rows returned that have i = 1. + driver2.compileAndRespond("update target set i = 2 where i = 1"); + + swapTxnManager(txnMgr); + // this should cause a conflict which should cause driver2.run to invoke its conflict lambda + driver.run("update target set i = 1 where i = 1"); + swapTxnManager(txnMgr2); + + // This run should conflict with the above query and cause the "conflict lambda" to be execute, + // which will then also conflict with the driver2 query and cause it to retry. The intent here is + // to cause driver2's query to exceed HIVE_TXN_MAX_RETRYSNAPSHOT_COUNT and throw exception. + AtomicInteger conflictCount = new AtomicInteger(); + exceptionRule.expect(CommandProcessorException.class); + exceptionRule.expectMessage("Operation could not be executed, snapshot was outdated when locks were acquired."); + driver2.run(() -> { + conflictCount.getAndIncrement(); + // on first conflict, + // we execute another query to cause an additional conflict + if (conflictCount.get() <= maxRetries) { + swapTxnManager(txnMgr); + // this should cause a conflict + try { + // we use update here, to force a conflict but not create additional rows + driver.run("update target set i = 1 where i = 1"); + } catch (Exception e) { + // do nothing + } + swapTxnManager(txnMgr2); + } + }); + + // We expect maxRetries+1 conflicts Review Comment: I don't think this gets executed because of the exception rule. I'm going to rewrite this to not use exception rule and use a try / catch and some asserts. Issue Time Tracking ------------------- Worklog Id: (was: 834941) Time Spent: 20m (was: 10m) > Transaction conflict retry loop only executes once > -------------------------------------------------- > > Key: HIVE-26875 > URL: https://issues.apache.org/jira/browse/HIVE-26875 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Reporter: John Sherman > Assignee: John Sherman > Priority: Critical > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Currently the "conflict retry loop" only executes once. > [https://github.com/apache/hive/blob/ab4c53de82d4aaa33706510441167f2df55df15e/ql/src/java/org/apache/hadoop/hive/ql/Driver.java#L264] > The intent of this loop is to detect if a conflicting transaction has > committed while we were waiting to acquire locks. If there is a conflicting > transaction, it invalidates the snapshot, rolls-back the transaction, opens a > new transaction and tries to re-acquire locks (and then recompile). It then > checks again if a conflicting transaction has committed and if so, redoes the > above steps again, up to HIVE_TXN_MAX_RETRYSNAPSHOT_COUNT times. > However - isValidTxnState relies on getNonSharedLockedTable(): > [https://github.com/apache/hive/blob/ab4c53de82d4aaa33706510441167f2df55df15e/ql/src/java/org/apache/hadoop/hive/ql/DriverTxnHandler.java#L422] > which does: > {code:java} > private Set<String> getNonSharedLockedTables() { > if (CollectionUtils.isEmpty(driver.getContext().getHiveLocks())) { > return Collections.emptySet(); // Nothing to check > }{code} > getHiveLocks gets populated by lockAndRespond... HOWEVER - > compileInternal ends up calling compile which ends up calling > preparForCompile which ends up calling prepareContext which ends up > destroying the context with the information lockAndRespond populated. So when > the loop executes after all of this, it will never detect a 2nd conflict > because isValidTxnState will always return true (because it thinks there are > no locked objects). > This manifests as duplicate records being created during concurrent UPDATEs > if a transaction get conflicted twice. -- This message was sent by Atlassian Jira (v8.20.10#820010)