Thanks for your answers :) Best regards/祝好,
Chang Liu 刘畅 > On 17 Sep 2018, at 17:25, Kostas Kloudas <k.klou...@data-artisans.com> wrote: > > Hi Chiang, > > Some of the answers you can find in line: > >> On Sep 17, 2018, at 3:47 PM, Chang Liu <fluency...@gmail.com >> <mailto:fluency...@gmail.com>> wrote: >> >> Dear All, >> >> I am helping my team setup a Flink cluster and we would like to have high >> availability and easy to scale. >> >> We would like to setup a minimal cluster environment but can be easily >> scaled in the future. This is my simple proposal: >> 2 nodes >> each node is running a Flink instance, a YARN, and a HDFS >> Flink, YARN and HDFS are all running in cluster mode. >> >> <image.jpeg> >> >> Based on it, my questions are: >> By using HDFS as the file system, we can achieve fault tolerant (by >> recovering the checkpoint states when job fails). Question: so Flink itself >> is not capable of keeping and maintaining distributed state persistence just >> using local Linux file system, right? >> Then, my follow-up is: if you have a Flink cluster (multiple nodes), and you >> use local Linux file system keeping the state checkpoints, what will happen >> if Flink job failed and Flink start to restart the job and recover the state >> from checkpoints? > > For both the above: > When a task fails, the whole job (all the tasks) are restarted, and are > rescheduled on different machines. > If you use a local FS and you try to fetch state remotely upon recovery, how > would the new nodes be able to locate > the state on a remote machine? > > This is why Flink uses a distributed file system. > >> If the Flink is deployed and managed on YARN, does that mean: if YARN is >> down, Flink is down? > > Well, it depends on which component fails. And I am not sure about all of > them, but you could try it and see. > >> If we have Flink managed by YARN, what is the purpose of using ZooKeeper? I >> did not really understand this part: >> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/jobmanager_high_availability.html >> >> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/jobmanager_high_availability.html>. >> My question is: why YARN cannot provide this JobManager HA, and we have to >> add ZooKeeper? > > YARN can make sure that a new job master starts, but that master will have to > fetch the state of the previous job master in order to know which jobs are > running, their > progress, etc. > >> How do you think I can keep different components of the architecture in >> different nodes (servers)? Do I keep every instance of Flink/YARN/HDFS on >> every single server, or I put each of them on completely different servers. >> Some of my considerations: >> if we put them on different servers, there will be many latency over the >> network between Flink <-> HDFS, and YARN <-> HDFS >> But if I each all of the 3 components Flink/YARN/HDFS on every server, they >> can also fight against each other for resources, right? > > You are right that you have to consider the above before deciding on your > setup. > >> Correct me if i am wrong: one thing for sure is that, for every new where >> there is a Flink instance running, there should be a YARN running right? >> >> >> Many thanks in advance! >> >> Best regards/祝好, >> >> Chang Liu 刘畅 > > I hope this helps, > Kostas