[ https://issues.apache.org/jira/browse/FLINK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946785#comment-15946785 ]
Haohui Mai commented on FLINK-5668: ----------------------------------- Sorry for the delayed response. Our main requirement is to allow Flink to support mission-critical, real-time applications. Our colleagues want to build mission-critical, real-time applications that are built on top of Flink. They are concerned about the fact that not being able to start any jobs when HDFS is down -- today there are no workarounds for their applications to keep their SLAs when HDFS is under maintenance. As you pointed out, there are multiple issues (e.g., checkpoints) to keep the Flink job running in the above scenario. To get started we would like to be able to start the job when HDFS is down and address other issues in later jiras. As a result this essentially reduces to one requirement -- Flink needs to have an option to bootstrap the jobs without persisting data on {{default.FS}}. I think https://github.com/apache/flink/pull/2796/files will work as long as (1) Flink persists everything to that path, and (2) the path can specify a file system other than {{default.FS}} [~bill.liu8904] can you elaborate why it won't work for you? Below are some inlined answers. {quote} All the paths are programatically generated and there are no configuration parameters for passing custom paths (correct me if I'm wrong). Are you planning to basically fork Flink and create a custom YARN client / Application Master implementation that allows using custom paths? {quote} It is sufficient to just specify the root of the path -- I believe something like {{yarn.deploy.fs}} or https://github.com/apache/flink/pull/2796/files will work. {quote} I think we didn't have your use case in mind when implementing the code. We assumed that one file system will be used for distributing all required files. Also, this approach works nicely will all the Hadoop vendor's versions. {quote} We originally shared the same line of thoughts that HDFS HA should be sufficient. The problem is that mission-critical real-time applications have a much stricter SLA that HDFS thus they need to survive from HDFS downtime. {quote} The general theme is: Some persistent store is needed currently, at least for high-availability modes. Decoupling Yarn from a persistent store pushes the responsibility to another layer. {quote} Totally agree. Whether it is in HA mode or not, having a distributed file system underneath simplifies things a lot. Passing state as configuration / environment variables is just one solution but not necessarily the best one. I think we are good to go as long as Flink is able to bootstrap the jobs from places other than {{default.FS}}. Thoughts? > passing taskmanager configuration through taskManagerEnv instead of file > ------------------------------------------------------------------------ > > Key: FLINK-5668 > URL: https://issues.apache.org/jira/browse/FLINK-5668 > Project: Flink > Issue Type: Improvement > Components: YARN > Reporter: Bill Liu > Original Estimate: 48h > Remaining Estimate: 48h > > When create a Flink cluster on Yarn, JobManager depends on HDFS to share > taskmanager-conf.yaml with TaskManager. > It's better to share the taskmanager-conf.yaml on JobManager Web server > instead of HDFS, which could reduce the HDFS dependency at job startup. -- This message was sent by Atlassian JIRA (v6.3.15#6346)