Sergey Shelukhin created HIVE-12904:
---------------------------------------

             Summary: LLAP: deadlock in task scheduling
                 Key: HIVE-12904
                 URL: https://issues.apache.org/jira/browse/HIVE-12904
             Project: Hive
          Issue Type: Bug
            Reporter: Hui Zheng


{noformat}
Thread 34107: (state = BLOCKED)
 - 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.isInWaitQueue()
 @bci=0, line=690 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.finishableStateUpdated(org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper,
 boolean) @bci=8, line=485 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.access$1500(org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService,
 org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper, 
boolean) @bci=3, line=78 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.finishableStateUpdated(boolean)
 @bci=27, line=733 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.QueryInfo$FinishableStateTracker.sourceStateUpdated(java.lang.String)
 @bci=76, line=210 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.QueryInfo.sourceStateUpdated(java.lang.String)
 @bci=5, line=164 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.registerSourceStateChange(java.lang.String,
 java.lang.String, 
org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateProto)
 @bci=34, line=228 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.sourceStateUpdated(org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateUpdatedRequestProto)
 @bci=47, line=255 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.sourceStateUpdated(org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateUpdatedRequestProto)
 @bci=5, line=328 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.LlapDaemonProtocolServerImpl.sourceStateUpdated(com.google.protobuf.RpcController,
 
org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$SourceStateUpdatedRequestProto)
 @bci=5, line=105 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor,
 com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=80, 
line=13067 (Compiled frame)
 - 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(org.apache.hadoop.ipc.RPC$Server,
 java.lang.String, org.apache.hadoop.io.Writable, long) @bci=246, line=616 
(Compiled frame)
 - org.apache.hadoop.ipc.RPC$Server.call(org.apache.hadoop.ipc.RPC$RpcKind, 
java.lang.String, org.apache.hadoop.io.Writable, long) @bci=9, line=969 
(Compiled frame)
 - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=38, line=2151 (Compiled 
frame)
 - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=1, line=2147 (Compiled 
frame)
 - 
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
 java.security.AccessControlContext) @bci=0 (Compiled frame)
 - javax.security.auth.Subject.doAs(javax.security.auth.Subject, 
java.security.PrivilegedExceptionAction) @bci=42, line=422 (Compiled frame)
 - 
org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
 @bci=14, line=1657 (Compiled frame)
 - org.apache.hadoop.ipc.Server$Handler.run() @bci=315, line=2145 (Interpreted 
frame)


and 


Thread 34500: (state = BLOCKED)
 - 
org.apache.hadoop.hive.llap.daemon.impl.QueryInfo$FinishableStateTracker.unregisterForUpdates(org.apache.hadoop.hive.llap.daemon.FinishableStateUpdateHandler)
 @bci=0, line=195 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.QueryInfo.unregisterFinishableStateUpdate(org.apache.hadoop.hive.llap.daemon.FinishableStateUpdateHandler)
 @bci=5, line=160 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.QueryFragmentInfo.unregisterForFinishableStateUpdates(org.apache.hadoop.hive.llap.daemon.FinishableStateUpdateHandler)
 @bci=5, line=143 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.maybeUnregisterForFinishedStateNotifications()
 @bci=20, line=681 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$InternalCompletionListener.onSuccess(org.apache.tez.runtime.task.TaskRunner2Result)
 @bci=32, line=548 (Compiled frame)
 - 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$InternalCompletionListener.onSuccess(java.lang.Object)
 @bci=5, line=535 (Compiled frame)
 - com.google.common.util.concurrent.Futures$4.run() @bci=55, line=1149 
(Compiled frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1142 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=617 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

"IPC Server handler 0 on 15001":
  waiting to lock Monitor@0x00007f5d322ecb08 (Object@0x00007f67032cd2c0, a 
org/apache/hadoop/hive/llap/daemon/impl/TaskExecutorService$TaskWrapper),
  which is held by "ExecutionCompletionThread #0"
"ExecutionCompletionThread #0":
  waiting to lock Monitor@0x00007f6066b9e8c8 (Object@0x00007f66b6570200, a 
org/apache/hadoop/hive/llap/daemon/impl/QueryInfo$FinishableStateTracker),
  which is held by "IPC Server handler 0 on 15001"

Found a total of 1 deadlock.

{noformat}

Looks like it's caused by synchronized blocks:
{noformat}
TaskWrapper:
public synchronized void maybeUnregisterForFinishedStateNotifications
{noformat}
Eventually calls 
{noformat}
FinishableStateTracker
synchronized void unregisterForUpdates(FinishableStateUpdateHandler handler) {
{noformat}

and 
{noformat}
FST
 synchronized void sourceStateUpdated(String sourceName) {
   {noformat}
eventually calls
{noformat}
 public synchronized boolean isInWaitQueue() {
{noformat}

The latter returns the boolean, so it definitely doesn't need synchronized, 
however I don't know if there are other similar issues and what is necessary 
inside sync blocks, perhaps there's a better fix.

Overall I'd say synch methods on objects that call any other non-trivial 
objects should not be used. Perhaps for now it would be good to replace all 
sync methods by sync blocks that cover entire method, as well as remove the 
unnecessary ones like the isWait... one. Then the scope of the blocks can be 
adjusted based on logic in future.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to