[ 
https://issues.apache.org/jira/browse/FLINK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946785#comment-15946785
 ] 

Haohui Mai commented on FLINK-5668:
-----------------------------------

Sorry for the delayed response.

Our main requirement is to allow Flink to support mission-critical, real-time 
applications. Our colleagues want to build mission-critical, real-time 
applications that are built on top of Flink. They are concerned about the fact 
that not being able to start any jobs when HDFS is down -- today there are no 
workarounds for their applications to keep their SLAs when HDFS is under 
maintenance.

As you pointed out, there are multiple issues (e.g., checkpoints) to keep the 
Flink job running in the above scenario. To get started we would like to be 
able to start the job when HDFS is down and address other issues in later jiras.

As a result this essentially reduces to one requirement -- Flink needs to have 
an option to bootstrap the jobs without persisting data on {{default.FS}}.

I think https://github.com/apache/flink/pull/2796/files will work as long as 
(1) Flink persists everything to that path, and (2) the path can specify a file 
system other than {{default.FS}} [~bill.liu8904] can you elaborate why it won't 
work for you?

Below are some inlined answers.

{quote}
All the paths are programatically generated and there are no configuration 
parameters for passing custom paths (correct me if I'm wrong).
Are you planning to basically fork Flink and create a custom YARN client / 
Application Master implementation that allows using custom paths?
{quote}

It is sufficient to just specify the root of the path -- I believe something 
like {{yarn.deploy.fs}} or https://github.com/apache/flink/pull/2796/files will 
work.

{quote}
I think we didn't have your use case in mind when implementing the code. We 
assumed that one file system will be used for distributing all required files. 
Also, this approach works nicely will all the Hadoop vendor's versions.
{quote}

We originally shared the same line of thoughts that HDFS HA should be 
sufficient. The problem is that mission-critical real-time applications have a 
much stricter SLA that HDFS thus they need to survive from HDFS downtime.

{quote}
The general theme is: Some persistent store is needed currently, at least for 
high-availability modes. Decoupling Yarn from a persistent store pushes the 
responsibility to another layer.
{quote}

Totally agree. Whether it is in HA mode or not, having a distributed file 
system underneath simplifies things a lot. Passing state as configuration / 
environment variables is just one solution but not necessarily the best one. I 
think we are good to go as long as Flink is able to bootstrap the jobs from 
places other than {{default.FS}}.

Thoughts?



> passing taskmanager configuration through taskManagerEnv instead of file
> ------------------------------------------------------------------------
>
>                 Key: FLINK-5668
>                 URL: https://issues.apache.org/jira/browse/FLINK-5668
>             Project: Flink
>          Issue Type: Improvement
>          Components: YARN
>            Reporter: Bill Liu
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When create a Flink cluster on Yarn,  JobManager depends on  HDFS to share  
> taskmanager-conf.yaml  with TaskManager.
> It's better to share the taskmanager-conf.yaml  on JobManager Web server 
> instead of HDFS, which could reduce the HDFS dependency  at job startup.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to