[ https://issues.apache.org/jira/browse/HIVE-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stamatis Zampetakis resolved HIVE-28565. ---------------------------------------- Fix Version/s: 4.1.0 Resolution: Fixed Fixed in [https://github.com/apache/hive/commit/48a67a4f2cc7a65bf9aac4a1ed518958c5b00027] Thanks for the review [~simhadri-g] ! > Reduce lock.sleep.duration.between.retries for tests > ---------------------------------------------------- > > Key: HIVE-28565 > URL: https://issues.apache.org/jira/browse/HIVE-28565 > Project: Hive > Issue Type: Task > Security Level: Public(Viewable by anyone) > Components: Tests > Reporter: Stamatis Zampetakis > Assignee: Stamatis Zampetakis > Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > > The default value for hive.lock.numretries/metastore.lock.numretries property > is 100. In combination with hive.lock.sleep.between.retries property set to > 60s can keep a test running and retrying for ~1.6hours (6000s). > In normal circumstances tests should obtain a lock rapidly but if something > goes wrong then waiting ~1.6 hours just to see the test fail is unacceptable. > I've hit this situation a couple of times and more recently in > https://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5249/12/tests/ > where > TestCrudCompactorOnTez#testRebalanceCompactionWithParallelDeleteAsSecondPessimisticLock > kept running for 1h 46m. > {noformat} > 2024-10-08T14:36:15,954 ERROR [main] lockmgr.DbLockManager: Unable to acquire > locks for lockId=19 after 101 retries (retries took 6343541 ms). > QueryId=jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854 > LockResponse(lockid:19, state:WAITING) > FAILED: Error in acquiring locks: Lock acquisition for > LockRequest(component:[LockComponent(type:SHARED_WRITE, level:TABLE, > dbname:default, tablename:rebalance_test, operationType:DELETE, > isTransactional:true, isDynamicPartitionWrite:false)], txnid:19, > user:jenkins, hostname:hive-precommit-pr-5249-12-kztld-624b4-hng5v, > agentInfo:jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854, > zeroWaitReadEnabled:true, exclusiveCTAS:false) timed out after 6343541ms. > LockResponse(lockid:19, state:WAITING) > 2024-10-08T14:36:15,967 ERROR [main] ql.Driver: FAILED: Error in acquiring > locks: Lock acquisition for > LockRequest(component:[LockComponent(type:SHARED_WRITE, level:TABLE, > dbname:default, tablename:rebalance_test, operationType:DELETE, > isTransactional:true, isDynamicPartitionWrite:false)], txnid:19, > user:jenkins, hostname:hive-precommit-pr-5249-12-kztld-624b4-hng5v, > agentInfo:jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854, > zeroWaitReadEnabled:true, exclusiveCTAS:false) timed out after 6343541ms. > LockResponse(lockid:19, state:WAITING) > org.apache.hadoop.hive.ql.lockmgr.LockException: Lock acquisition for > LockRequest(component:[LockComponent(type:SHARED_WRITE, level:TABLE, > dbname:default, tablename:rebalance_test, operationType:DELETE, > isTransactional:true, isDynamicPartitionWrite:false)], txnid:19, > user:jenkins, hostname:hive-precommit-pr-5249-12-kztld-624b4-hng5v, > agentInfo:jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854, > zeroWaitReadEnabled:true, exclusiveCTAS:false) timed out after 6343541ms. > LockResponse(lockid:19, state:WAITING) > at > org.apache.hadoop.hive.ql.lockmgr.DbLockManager.lock(DbLockManager.java:155) > at > org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:464) > at > org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocksWithHeartbeatDelay(DbTxnManager.java:498) > at > org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:295) > at > org.apache.hadoop.hive.ql.lockmgr.HiveTxnManagerImpl.acquireLocks(HiveTxnManagerImpl.java:81) > at > org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:100) > at > org.apache.hadoop.hive.ql.DriverTxnHandler.acquireLocksInternal(DriverTxnHandler.java:338) > at > org.apache.hadoop.hive.ql.DriverTxnHandler.acquireLocks(DriverTxnHandler.java:240) > at > org.apache.hadoop.hive.ql.DriverTxnHandler.acquireLocksIfNeeded(DriverTxnHandler.java:147) > at org.apache.hadoop.hive.ql.Driver.lockAndRespond(Driver.java:335) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:179) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:143) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:130) > at > org.apache.hadoop.hive.ql.txn.compactor.TestCompactorBase.executeStatementOnDriver(TestCompactorBase.java:171) > at > org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez.testRebalanceCompactionWithParallelDeleteAsSecond(TestCrudCompactorOnTez.java:143) > at > org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez.testRebalanceCompactionWithParallelDeleteAsSecondPessimisticLock(TestCrudCompactorOnTez.java:102) > {noformat} > I propose to set the respective properties to some small value (i.e., 5) when > running the tests to fail fast when there is an issue to obtain a lock and > don't waste resources for nothing. -- This message was sent by Atlassian Jira (v8.20.10#820010)