[ 
https://issues.apache.org/jira/browse/HIVE-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872776#comment-15872776
 ] 

Sergey Shelukhin commented on HIVE-15255:
-----------------------------------------

Looking at the disable/enable code, this should have worked. However, I wonder 
if scheduling takes into account disabled nodes correctly? I cannot find 
special processing for service busy.

> LLAP: service_busy error should not be retried so fast
> ------------------------------------------------------
>
>                 Key: HIVE-15255
>                 URL: https://issues.apache.org/jira/browse/HIVE-15255
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> {noformat}
> 2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, 
> containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
> 2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, 
> containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
> 2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, 
> containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
> 2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, 
> containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
> 2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, 
> taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, 
> status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, 
> nodeHttpAddress=(node3), counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> {noformat}
> As you can see by the attempt number, this has been going on for a while. In 
> fact I think other tasks could have been scheduled in the time (not sure), 
> but the thread just kept at it for this one task until it was finally 
> scheduled.
> There should be some fallback after initial failures; we should also make 
> sure such retries do not take over all scheduling (not sure if they do, need 
> to check).
> LLAP on the node was alive, just busy with other tasks. The task did 
> eventually get scheduled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to