Are there any special precautions that need to be taken for undergoing regular K8s maintenance procedures such as migrating/upgrading clusters?
For the sake of concreteness, I'm running my jobs via the Flink K8s Operator and I'm finding that when rolling out new nodes and migrating my jobs to them in some cases they get stuck and/or don't restart properly, or do so multiple times causing more downtime than expected. As of now my migration/rollout process is as follows: - Create new K8s nodes/instances - Cordon old ones to be replaced (where my jobs are running) - Take savepoints - Drain old nodes - Wait until all jobs show up as RUNNING and STABLE Nothing special here I would say. However, I wonder if there are any best practices for Flink specifically which help minimize the downtime/potential failures during these maintenance windows. Things such as tweaking budget disruption policies and/or pod affinities, or maybe considering a HA setup with multiple jobmangers vs just one. To be clear, all my jobs are deployed like this: ``` apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment ... spec: ... mode: native ``` and for what it's worth, their HA setup is based on the native K8s mode (vs Zookeeper) and a single jobmanager.