Hello! I am having some difficulty with multiple job managers in an HA setup using Flink 1.9.0.
I have 2 job managers and have setup the HA setup with the following config high-availability: zookeeper high-availability.cluster-id: /imet-enhance high-availability.storageDir: hdfs:///flink/ha/ high-availability.zookeeper.quorum: flink-state-hdfs-zookeeper-1.flink-state-hdfs-zookeeper-headless.default.svc.cluster.local:2181,flink-state-hdfs-zookeeper-2.flink-state-hdfs-zookeeper-headless.default.svc.cluster.local:2181,flink-state-hdfs-zookeeper-0.flink-state-hdfs-zookeeper-headless.default.svc.cluster.local:2181 high-availability.zookeeper.path.root: /flink high-availability.jobmanager.port: 50000-50025 I have the job managers behind a load balancer inside a kubernetes cluster They work great except for one thing. When I use the website (or API) to upload the Jar file and start the job sometimes the request goes to a different job manager, which doesn't have the jar file in it's temporary directory, so it fails to start. In the 1.7 version of this setup the second Job Manager would return a Redirect request. I put an HAProxy in front of it that only allowed traffic to flow to the Job Manager that wasn't returning a 300 and this worked well for everything. In 1.9 it appears that both Job Managers are able to respond (via the internal proxy mechanism I have seen in prior emails). However it appears the web file cache is still shared. I also tried attaching a shared NFS folder between the two machines and tried to set their web.tmpdir property to the shared folder, however it appears that each job manager creates a seperate job inside that directory. My end goals are: 1) Provide a fault tolerant Flink Cluster 2) provide a persistent storage directory for the Jar file so I can perform rescaling without needing to re-upload the jar file. Thoughts? -Steve