[ https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-10649: ------------------------------------ Description: See HIVE-10648. When AM cannot connect to a node, that appears to cause it to stall; example log, there are no other interleaving logs even though this is happening in the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled. >From "Assigning" messages I can also see tasks are scheduled to all the nodes >before and after the pause, not just to the problematic node. LLAP daemons have corresponding gaps where between two fragments nothing is ran for a long time on any daemon. {noformat} 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to RUNNING due to event T_ATTEMPT_LAUNCHED 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 12 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 13 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 14 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 15 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 16 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 17 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 18 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 19 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 21 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 22 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 23 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:00,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 24 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:01,820 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 25 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:02,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 26 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:03,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 27 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:04,822 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 28 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:05,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 29 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:06,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 30 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:06,984 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:14:07,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 31 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:08,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 32 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:09,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 33 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:10,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 34 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:11,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 35 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:12,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 36 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:13,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 37 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:14,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 38 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:15,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:16,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 40 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:16,996 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:14:17,829 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 41 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:18,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 42 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:19,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 43 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:20,831 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 44 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:21,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:22,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:23,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:24,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:25,834 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:25,836 INFO [TaskCommunicator # 3] tezplugins.LlapTaskCommunicator: Unable to run task: attempt_1429683757595_0784_1_00_000017_0 on containerId: container_222212222_0784_01_000018, Communication Error 2015-05-07 12:14:25,841 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0784_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0784_1_00_000017_0, startTime=1431026014322, finishTime=1431026065838, timeTaken=51516, status=KILLED, errorEnum=COMMUNICATION_ERROR, diagnostics=Communication Error, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 {noformat} was: See HIVE-10648. When AM cannot connect to a node, that appears to cause it to stall. {noformat} 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to RUNNING due to event T_ATTEMPT_LAUNCHED 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 12 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 13 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 14 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 15 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 16 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 17 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 18 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 19 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 21 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 22 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 23 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:00,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 24 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:01,820 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 25 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:02,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 26 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:03,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 27 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:04,822 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 28 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:05,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 29 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:06,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 30 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:06,984 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:14:07,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 31 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:08,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 32 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:09,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 33 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:10,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 34 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:11,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 35 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:12,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 36 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:13,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 37 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:14,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 38 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:15,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:16,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 40 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:16,996 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 2015-05-07 12:14:17,829 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 41 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:18,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 42 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:19,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 43 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:20,831 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 44 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:21,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:22,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:23,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:24,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:25,834 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-05-07 12:14:25,836 INFO [TaskCommunicator # 3] tezplugins.LlapTaskCommunicator: Unable to run task: attempt_1429683757595_0784_1_00_000017_0 on containerId: container_222212222_0784_01_000018, Communication Error 2015-05-07 12:14:25,841 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0784_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0784_1_00_000017_0, startTime=1431026014322, finishTime=1431026065838, timeTaken=51516, status=KILLED, errorEnum=COMMUNICATION_ERROR, diagnostics=Communication Error, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1 {noformat} > LLAP: AM gets stuck completely if one node is dead > -------------------------------------------------- > > Key: HIVE-10649 > URL: https://issues.apache.org/jira/browse/HIVE-10649 > Project: Hive > Issue Type: Sub-task > Reporter: Sergey Shelukhin > Assignee: Siddharth Seth > > See HIVE-10648. > When AM cannot connect to a node, that appears to cause it to stall; example > log, there are no other interleaving logs even though this is happening in > the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled. > From "Assigning" messages I can also see tasks are scheduled to all the nodes > before and after the pause, not just to the problematic node. > LLAP daemons have corresponding gaps where between two fragments nothing is > ran for a long time on any daemon. > {noformat} > 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: > task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to > RUNNING due to event T_ATTEMPT_LAUNCHED > 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 10 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] > impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 > 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 11 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 12 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 13 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 14 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 15 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 16 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 17 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 18 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 19 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 20 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] > impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 > 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 21 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 22 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 23 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:00,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 24 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:01,820 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 25 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:02,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 26 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:03,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 27 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:04,822 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 28 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:05,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 29 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:06,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 30 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:06,984 INFO [LlapSchedulerNodeEnabler] > impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 > 2015-05-07 12:14:07,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 31 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:08,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 32 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:09,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 33 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:10,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 34 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:11,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 35 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:12,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 36 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:13,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 37 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:14,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 38 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:15,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 39 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:16,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 40 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:16,996 INFO [LlapSchedulerNodeEnabler] > impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583 > 2015-05-07 12:14:17,829 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 41 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:18,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 42 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:19,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 43 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:20,831 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 44 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:21,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 45 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:22,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 46 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:23,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 47 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:24,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 48 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:25,834 INFO [TaskCommunicator # 3] ipc.Client: Retrying > connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. > Already tried 49 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-05-07 12:14:25,836 INFO [TaskCommunicator # 3] > tezplugins.LlapTaskCommunicator: Unable to run task: > attempt_1429683757595_0784_1_00_000017_0 on containerId: > container_222212222_0784_01_000018, Communication Error > 2015-05-07 12:14:25,841 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1429683757595_0784_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Map 1, taskAttemptId=attempt_1429683757595_0784_1_00_000017_0, > startTime=1431026014322, finishTime=1431026065838, timeTaken=51516, > status=KILLED, errorEnum=COMMUNICATION_ERROR, diagnostics=Communication > Error, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, > DATA_LOCAL_TASKS=1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)