Hi Chiang,

Some of the answers you can find in line:

> On Sep 17, 2018, at 3:47 PM, Chang Liu <fluency...@gmail.com> wrote:
> 
> Dear All,
> 
> I am helping my team setup a Flink cluster and we would like to have high 
> availability and easy to scale.
> 
> We would like to setup a minimal cluster environment but can be easily scaled 
> in the future. This is my simple proposal: 
> 2 nodes
> each node is running a Flink instance, a YARN, and a HDFS
> Flink, YARN and HDFS are all running in cluster mode.
> 
> <image.jpeg>
> 
> Based on it, my questions are:
> By using HDFS as the file system, we can achieve fault tolerant (by 
> recovering the checkpoint states when job fails). Question: so Flink itself 
> is not capable of keeping and maintaining distributed state persistence just 
> using local Linux file system, right?
> Then, my follow-up is: if you have a Flink cluster (multiple nodes), and you 
> use local Linux file system keeping the state checkpoints, what will happen 
> if Flink job failed and Flink start to restart the job and recover the state 
> from checkpoints?

For both the above:
When a task fails, the whole job (all the tasks) are restarted, and are 
rescheduled on different machines.
If you use a local FS and you try to fetch state remotely upon recovery, how 
would the new nodes be able to locate
the state on a remote machine?

This is why Flink uses a distributed file system.

> If the Flink is deployed and managed on YARN, does that mean: if YARN is 
> down, Flink is down?

Well, it depends on which component fails. And I am not sure about all of them, 
but you could try it and see.

> If we have Flink managed by YARN, what is the purpose of using ZooKeeper? I 
> did not really understand this part: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/jobmanager_high_availability.html
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/jobmanager_high_availability.html>.
>  My question is: why YARN cannot provide this JobManager HA, and we have to 
> add ZooKeeper?

YARN can make sure that a new job master starts, but that master will have to 
fetch the state of the previous job master in order to know which jobs are 
running, their
progress, etc.

> How do you think I can keep different components of the architecture in 
> different nodes (servers)? Do I keep every instance of Flink/YARN/HDFS on 
> every single server, or I put each of them on completely different servers. 
> Some of my considerations:
> if we put them on different servers, there will be many latency over the 
> network between Flink <-> HDFS, and YARN <-> HDFS
> But if I each all of the 3 components Flink/YARN/HDFS on every server, they 
> can also fight against each other for resources, right?

You are right that you have to consider the above before deciding on your setup.

> Correct me if i am wrong: one thing for sure is that, for every new where 
> there is a Flink instance running, there should be a YARN running right?
> 
> 
> Many thanks in advance!
> 
> Best regards/祝好,
> 
> Chang Liu 刘畅

I hope this helps,
Kostas

Reply via email to