Hi community, I would like to test flink k8s operator's HA capabilities for TM and JM failover.
The simple test I did for TM failover was as follows: - run Flink session cluster in native mode - submit FlinkSessionJob resource with SAVEPOINT upgreade mode. - kill task manager pod It turns out that after I killed the TM, k8s operator does not create a new TM that would replace the killed one. The job was canceled and landed in Job Status -> Failed. I had an impression that for TM HA no extra configuration is needed. I have found [1] and [2]. But I'm not sure if this is for JM failvoer only or both, TM and JM. Also it is not clear for me if when using flink k8s operat do I still need to configure [1]? [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#kubernetes-ha-services [2] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability Regards, Krzysztof Chmielewski