[ https://issues.apache.org/jira/browse/HIVE-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872776#comment-15872776 ]
Sergey Shelukhin commented on HIVE-15255: ----------------------------------------- Looking at the disable/enable code, this should have worked. However, I wonder if scheduling takes into account disabled nodes correctly? I cannot find special processing for service busy. > LLAP: service_busy error should not be retried so fast > ------------------------------------------------------ > > Key: HIVE-15255 > URL: https://issues.apache.org/jira/browse/HIVE-15255 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > > {noformat} > 2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, > status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, > nodeHttpAddress=(node3), counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 > 2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, > containerId=container_222212222_2622_01_012504, nodeId=(node3):15001 > 2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, > status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, > nodeHttpAddress=(node3), counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 > 2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, > containerId=container_222212222_2622_01_012511, nodeId=(node3):15001 > 2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, > status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, > nodeHttpAddress=(node3), counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 > 2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, > containerId=container_222212222_2622_01_012522, nodeId=(node3):15001 > 2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, > status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, > nodeHttpAddress=(node3), counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 > 2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, > containerId=container_222212222_2622_01_012529, nodeId=(node3):15001 > 2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, > taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, > status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, > nodeHttpAddress=(node3), counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 > {noformat} > As you can see by the attempt number, this has been going on for a while. In > fact I think other tasks could have been scheduled in the time (not sure), > but the thread just kept at it for this one task until it was finally > scheduled. > There should be some fallback after initial failures; we should also make > sure such retries do not take over all scheduling (not sure if they do, need > to check). > LLAP on the node was alive, just busy with other tasks. The task did > eventually get scheduled. -- This message was sent by Atlassian JIRA (v6.3.15#6346)