[ https://issues.apache.org/jira/browse/HIVE-22687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053187#comment-17053187 ]
Prasanth Jayachandran commented on HIVE-22687: ---------------------------------------------- I was able to repro this issue recently and tested this patch but the patch doesn't seem to help. Even if the order of slot znode and worker znode is changed the notification of slot znode is not received by the AM and hence it does not populate the map. For the query that failed to get scheduled I see this {code:java} No zookeeper data for path NODE_ADDED for zknode ..../workers/worker-0000000000 {code} For the query that succeeded I see this {code:java} NODE_ADDED for zknode ..../workers/worker-0000000000 NODE_ADDED for zknode ..../workers/worker-0000000000 {code} "No zookeeper data for path" seem to indicate there is a lost notification for the slot znode. > Query hangs indefinitely if LLAP daemon registers after the query is submitted > ------------------------------------------------------------------------------ > > Key: HIVE-22687 > URL: https://issues.apache.org/jira/browse/HIVE-22687 > Project: Hive > Issue Type: Bug > Components: llap > Affects Versions: 3.1.0 > Reporter: Himanshu Mishra > Assignee: Himanshu Mishra > Priority: Major > Attachments: HIVE-22687.01.patch, HIVE-22687.02.patch > > > If a query is submitted and no LLAP daemon is running, it waits for 1 minute > and times out with error {{SERVICE_UNAVAILABLE}}. > While waiting, if a new LLAP Daemon starts, then the timeout is cancelled, > and the tasks do not get scheduled as well. As a result, the query hangs > indefinitely. > This is due to the race condition where LLAP Daemon first registers the LLAP > instance at {{.../workers/worker-0000}}, and afterwards registers > {{.../workers/slot-0000}}. In the gap between two, Tez AM gets notified of > worker zk node and while processing it checks if slot zk node is present, if > not it rejects the LLAP Daemon. Error in Tez AM is: > {code:java} > [INFO] [LlapScheduler] |impl.LlapZookeeperRegistryImpl|: Unknown slot for > 8ebfdc45-0382-4757-9416-52898885af90{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)