Hello!

I am having some difficulty with multiple job managers in an HA setup using
Flink 1.9.0.

I have 2 job managers and have setup the HA setup with the following config

high-availability: zookeeper
high-availability.cluster-id: /imet-enhance
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum:
flink-state-hdfs-zookeeper-1.flink-state-hdfs-zookeeper-headless.default.svc.cluster.local:2181,flink-state-hdfs-zookeeper-2.flink-state-hdfs-zookeeper-headless.default.svc.cluster.local:2181,flink-state-hdfs-zookeeper-0.flink-state-hdfs-zookeeper-headless.default.svc.cluster.local:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50000-50025

I have the job managers behind a load balancer inside a kubernetes cluster

They work great except for one thing. When I use the website (or API) to
upload the Jar file and start the job sometimes the request goes to a
different job manager, which doesn't have the jar file in it's temporary
directory, so it fails to start.

In the 1.7 version of this setup the second Job Manager would return a
Redirect request. I put an HAProxy in front of it that only allowed traffic
to flow to the Job Manager that wasn't returning a 300 and this worked well
for everything. In 1.9 it appears that both Job Managers are able to
respond (via the internal proxy mechanism I have seen in prior emails).
However it appears the web file cache is still shared.

I also tried attaching a shared NFS folder between the two machines and
tried to set their web.tmpdir property to the shared folder, however it
appears that each job manager creates a seperate job inside that directory.

My end goals are:
1) Provide a fault tolerant Flink Cluster
2) provide a persistent storage directory for the Jar file so I can perform
rescaling without needing to re-upload the jar file.

Thoughts?
-Steve

Reply via email to