[ 
https://issues.apache.org/jira/browse/FLINK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346620#comment-17346620
 ] 

Jinhong Liu commented on FLINK-3204:
------------------------------------

[~nvasilishin] I stop a NodeManger while the job is running, then I cancel the 
job, the issue occurs.

Firstly, this issue occurs just at least one TaskManger is running on the Dead 
NoManager.

Secondly, when the issue occurs, all the containers include the AppMaster 
cannot exit, not only the containers on the Dead NodeManager.

Flink Version: 1.12.2

Hadoop Version 2.7.3

> TaskManagers are not shutting down properly on YARN
> ---------------------------------------------------
>
>                 Key: FLINK-3204
>                 URL: https://issues.apache.org/jira/browse/FLINK-3204
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Robert Metzger
>            Assignee: Nikolay Vasilishin
>            Priority: Major
>              Labels: test-stability
>
> While running some experiments on a YARN cluster, I saw the following error
> {code}
> 10:15:24,741 INFO  org.apache.flink.yarn.YarnJobManager                       
>    - Stopping YARN JobManager with status SUCCEEDED and diagnostic Flink YARN 
> Client requested shutdown.
> 10:15:24,748 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl      
>    - Waiting for application to be successfully unregistered.
> 10:15:24,852 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - 
> Interrupted while waiting for queue
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>       at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275)
> 10:15:24,875 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000010when 
> stopping NMClientImpl
> 10:15:24,899 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000007when 
> stopping NMClientImpl
> 10:15:24,954 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000006when 
> stopping NMClientImpl
> 10:15:24,982 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000009when 
> stopping NMClientImpl
> 10:15:25,013 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000011when 
> stopping NMClientImpl
> 10:15:25,037 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000008when 
> stopping NMClientImpl
> 10:15:25,041 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000012when 
> stopping NMClientImpl
> 10:15:25,072 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000005when 
> stopping NMClientImpl
> 10:15:25,075 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000003when 
> stopping NMClientImpl
> 10:15:25,077 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000004when 
> stopping NMClientImpl
> 10:15:25,079 ERROR org.apache.hadoop.yarn.client.api.impl.NMClientImpl        
>    - Failed to stop Container container_1452019681933_0002_01_000002when 
> stopping NMClientImpl
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-0.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-1.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-master.c.astral-sorter-757.internal:8041
> 10:15:25,080 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-4.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-2.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-3.c.astral-sorter-757.internal:8041
> 10:15:25,081 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Closing proxy : cdh544-worker-5.c.astral-sorter-757.internal:8041
> 10:15:25,085 INFO  org.apache.flink.yarn.YarnJobManager                       
>    - Stopping JobManager akka.tcp://flink@10.240.221.7:46845/user/jobmanager.
> 10:15:25,092 INFO  org.apache.flink.runtime.blob.BlobServer                   
>    - Stopped BLOB server at 0.0.0.0:35997
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to