Hi all, I’m looking for some help understanding a behaviour of the k8s operator when recovering a JobManager deployment.
Scenario is – ApplicationCluster, with HA enabled and upgrade mode of savepoint. The job is in failed state – so it’s not in the HA config map. * If I delete the JM pod, the leader information is removed from the HA config map (so it’s now empty). k8s creates a new pod – and when it starts, the application job is submitted and fails because the persisted HA data knows the job was in failure state – even without HA metadata. WARN org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Ignoring JobGraph submission 'sql-runner-deploy' (0bb6676ed5cf22dded08018f1aca7794) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution. * If I change the spec of a failed job or use restartNonce – the operator will delete the HA config map, then delete and create a new JM deployment. The ApplicationCluster then resumes from the checkpoint. My guess is that using the restart nonce tells the operator not to look for HA metadata because it knows it personally deleted the CM. The JobManager clears out persisted HA data on startup if the CM is missing, and the job can be re-submitted and accepted. org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job 0bb6676ed5cf22dded08018f1aca7794 from savepoint file:/opt/flink/volume/flink-cp/0bb6676ed5cf22dded08018f1aca7794/chk-15 (allowing non restored state INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting job 'sql-runner-deploy' (0bb6676ed5cf22dded08018f1aca7794). * If I delete the underlying JobManager deployment (config map is now empty), the operator goes into a failure state with: [ERROR][sp/sql-runner-deploy] Flink recovery failed org.apache.flink.kubernetes.operator.exception.RecoveryFailureException: HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. I can see that the AbstractFlinkDeploymentObserver spots the missing JobManager deployment and marks the CR into an error state. This then triggers a reconcile, and ApplicationReconciler.reconcileOtherChanges() realizes the deployment needs to be recovered so triggers resubmitJob() – including requiring HA metadata to exist because the ApplicationCluster is configured in HA mode. That eventually trickles down to AbstractFlinkService.submitApplicationCluster(), which will validate HA metadata. This then appears to fail because the fabric8 client is filtering out the empty HA config map. I understand that when a JM pod restarts, that’s business as usual for Flink HA – and the job therefore doesn’t restart (though it also disappears from the JM job list too). Am I correct in assuming that the operator is looking for HA when recovering the deployment to ensure that it doesn’t redeploy a failed job automatically? Whereas with restartNonce you are explicitly asking to restart? If that is the case – why does using restartNonce not recreate the missing deployment? -- Nic Townsend IBM Event Processing Senior Engineer / Technical Lead Slack: @nictownsend Bluesky: @nict0wnsend.bsky.social Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: Building C, IBM Hursley Office, Hursley Park Road, Winchester, Hampshire SO21 2JN