[ 
https://issues.apache.org/jira/browse/FLINK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022879#comment-16022879
 ] 

Till Rohrmann commented on FLINK-6646:
--------------------------------------

This is actually not limited to the yarn-session but also applies to the 
yarn-cluster mode. The only reason why it didn't surface so far is that in the 
yarn-cluster mode we use a higher connection timeout {{60 s}} compared to {{10 
s}} in the session mode.

The underlying problem is, however, the wrong resource lifecycle management of 
the application files. In the current version, the {{ClusterClient}} decides 
when to delete the Flink cluster files even though this should be the 
responsibility of the {{YarnApplicationMaster}}. The {{YarnApplicationMaster}} 
should decide when the Yarn application has terminated and when the files can 
be deleted. This wrong separation of concerns also causes why the uploaded 
application files are never deleted in case of a detached execution.

The correction of the resource lifecycle management is actually a bigger task 
and should be properly implemented with the upcoming Flip-6 work. Therefore, I 
propose to mitigate the current problem by increasing the connection timeout 
also for the yarn session similar to the yarn-cluster mode.

> YARN session doesn't work with HA
> ---------------------------------
>
>                 Key: FLINK-6646
>                 URL: https://issues.apache.org/jira/browse/FLINK-6646
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>            Priority: Critical
>
> While testing Flink 1.3.0 RC1, I ran into the following issue on the 
> JobManager.
> {code}
> 2017-05-19 14:41:38,030 INFO  
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader 
> reachable under 
> akka.tcp://flink@permanent-qa-cluster-i7c9.c.astral-sorter-757.internal:36528/user/jobmanager:6539dc04-d7fe-4f85-a0b6-09bfb0de8a58.
> 2017-05-19 14:41:38,033 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Resource Manager associating with leading JobManager 
> Actor[akka://flink/user/jobmanager#1602741108] - leader session 
> 6539dc04-d7fe-4f85-a0b6-09bfb0de8a58
> 2017-05-19 14:41:38,033 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Requesting new TaskManager container with 1024 megabytes 
> memory. Pending requests: 1
> 2017-05-19 14:41:38,781 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Received new container: 
> container_1494870922226_0061_02_000002 - Remaining pending container 
> requests: 0
> 2017-05-19 14:41:38,782 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Launching TaskManager in container ContainerInLaunch @ 
> 1495204898782: Container: [ContainerId: 
> container_1494870922226_0061_02_000002, NodeId: 
> permanent-qa-cluster-d3iz.c.astral-sorter-757.internal:8041, NodeHttpAddress: 
> permanent-qa-cluster-d3iz.c.astral-sorter-757.internal:8042, Resource: 
> <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, 
> service: 10.240.0.32:8041 }, ] on host 
> permanent-qa-cluster-d3iz.c.astral-sorter-757.internal
> 2017-05-19 14:41:38,788 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : permanent-qa-cluster-d3iz.c.astral-sorter-757.internal:8041
> 2017-05-19 14:41:44,284 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Container container_1494870922226_0061_02_000002 failed, with 
> a TaskManager in launch or registration. Exit status: -1000
> 2017-05-19 14:41:44,284 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Diagnostics for container 
> container_1494870922226_0061_02_000002 in state COMPLETE : exitStatus=-1000 
> diagnostics=File does not exist: 
> hdfs://nameservice1/user/robert/.flink/application_1494870922226_0061/cf9287fe-ac75-4066-a648-91787d946890-taskmanager-conf.yaml
> java.io.FileNotFoundException: File does not exist: 
> hdfs://nameservice1/user/robert/.flink/application_1494870922226_0061/cf9287fe-ac75-4066-a648-91787d946890-taskmanager-conf.yaml
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1219)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1211)
>       at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1211)
>       at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
>       at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
>       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> The problem is the following:
> - JobManager1 starts from a yarn-session.sh
> - Job1 gets submitted to JobManager1
> - JobManager1 dies
> - YARN starts a new JM: JobManager2
> - in the meantime, errors on the yarn-session.sh appear, shutting down the 
> session. This includes deleting the yarn staging directory in HDFS.
> - JobManager2 is unable to start a new Taskmanager because files in staging 
> got deleted by the client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to