Eugene Chung created HIVE-25663:
-----------------------------------

             Summary: Need to modify table/partition lock acquisition retry for 
Zookeeper option
                 Key: HIVE-25663
                 URL: https://issues.apache.org/jira/browse/HIVE-25663
             Project: Hive
          Issue Type: Improvement
          Components: Locking
            Reporter: Eugene Chung
            Assignee: Eugene Chung
         Attachments: image-2021-10-30-11-54-42-164.png

 
{code:java}
LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE;
SET hive.query.timeout.seconds=5;
SELECT * FROM default.my_table WHERE log_date='2021-10-30' LIMIT 10;
{code}
 

If you execute the three SQLs above in the same session, the last SELECT will 
be cancelled by timeout error. The problem is that when you execute 'show 
locks', you will see a SHARED lock of default.my_table is remained for 100 
minutes, if you are using ZooKeeperHiveLockManager.

!image-2021-10-30-11-54-42-164.png|width=873,height=411!

I am going to explain the problem one by one.

 

The SELECT SQL which gets some data from a partitioned table 

 
{code:java}
SELECT * FROM my_table WHERE log_date='2021-10-30' LIMIT 10{code}
 

needs two SHARED locks in order. The two SHARED locks are
 * default.my_table
 * default.my_table@log_date=2021-10-30

Before executing the SQL, an EXCLUSIVE lock of the partition exists. I can 
simulate it easily with a DDL like below;

 
{code:java}
LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE{code}
 

The SELECT SQL can't acquire the SHARED lock of the partition and it retries to 
acquire it as specified by two configurations. The default values mean it will 
retry for 100 minutes.
 * hive.lock.sleep.between.retries=60s
 * hive.lock.numretries=100

 

If query.timeout is set to 5 seconds, the SELECT SQL is cancelled 5 seconds 
later and the client returns with timeout error. But the SHARED lock of the 
my_table is still remained. It's because [the current ZooKeeperHiveLockManager 
just logs 
InterruptedException|https://github.com/apache/hive/blob/8a8e03d02003aa3543f46f595b4425fd8c156ad9/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/zookeeper/ZooKeeperHiveLockManager.java#L326]
 and still goes on lock retry. This also means that the SQL processing thread 
is still doing its job for 100 minutes(by default) even though the SQL is 
cancelled. If the same SQL is executed 3 times, you can see 3 threads each of 
which thread dump is like below;

 
{code:java}
"HiveServer2-Background-Pool: Thread-154" #154 prio=5 os_prio=0 
tid=0x00007f0ac91cb000 nid=0x13d25 waiting on condition [0x000
07f0aa2ce2000]
 java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:303)
 at 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:207)
 at 
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager.acquireLocks(DummyTxnManager.java:199)
 at org.apache.hadoop.hive.ql.Driver.acquireLocks(Driver.java:1610)
 at org.apache.hadoop.hive.ql.Driver.lockAndRespond(Driver.java:1796)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1966)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1710)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1704)
 at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
 at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:217)
 at 
org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:87)
 at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:309)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
 at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:322)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748){code}
 

 

I think ZooKeeperHiveLockManager should not swallow the unexpected exception. I 
should retry for expected ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to