Hi all,

I’m looking for some help understanding a behaviour of the k8s operator when 
recovering a JobManager deployment.

Scenario is – ApplicationCluster, with HA enabled and upgrade mode of 
savepoint. The job is in failed state – so it’s not in the HA config map.


  *   If I delete the JM pod, the leader information is removed from the HA 
config map (so it’s now empty). k8s creates a new pod – and when it starts, the 
application job is submitted and fails because the persisted HA data knows the 
job was in failure state – even without HA metadata.

WARN org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Ignoring 
JobGraph submission 'sql-runner-deploy' (0bb6676ed5cf22dded08018f1aca7794) 
because the job already reached a globally-terminal state (i.e. FAILED, 
CANCELED, FINISHED) in a previous execution.


  *   If I change the spec of a failed job or use restartNonce – the operator 
will delete the HA config map, then delete and create a new JM deployment. The 
ApplicationCluster then resumes from the checkpoint. My guess is that using the 
restart nonce tells the operator not to look for HA metadata because it knows 
it personally deleted the CM. The JobManager clears out persisted HA data on 
startup if the CM is missing, and the job can be re-submitted and accepted.


org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 
0bb6676ed5cf22dded08018f1aca7794 from savepoint 
file:/opt/flink/volume/flink-cp/0bb6676ed5cf22dded08018f1aca7794/chk-15 
(allowing non restored state

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting 
job 'sql-runner-deploy' (0bb6676ed5cf22dded08018f1aca7794).


  *   If I delete the underlying JobManager deployment (config map is now 
empty), the operator goes into a failure state with:


[ERROR][sp/sql-runner-deploy] Flink recovery failed

org.apache.flink.kubernetes.operator.exception.RecoveryFailureException: HA 
metadata not available to restore from last state. It is possible that the job 
has finished or terminally failed, or the configmaps have been deleted.

I can see that the AbstractFlinkDeploymentObserver spots the missing JobManager 
deployment and marks the CR into an error state. This then triggers a 
reconcile, and ApplicationReconciler.reconcileOtherChanges() realizes the 
deployment needs to be recovered so triggers resubmitJob() – including 
requiring HA metadata to exist because the ApplicationCluster is configured in 
HA mode.

That eventually trickles down to 
AbstractFlinkService.submitApplicationCluster(), which will validate HA 
metadata. This then appears to fail because the fabric8 client is filtering out 
the empty HA config map.

I understand that when a JM pod restarts, that’s business as usual for Flink HA 
– and the job therefore doesn’t restart (though it also disappears from the JM 
job list too). Am I correct in assuming that the operator is looking for HA 
when recovering the deployment to ensure that it doesn’t redeploy a failed job 
automatically? Whereas with restartNonce you are explicitly asking to restart?

If that is the case – why does using restartNonce not recreate the missing 
deployment?
--

Nic Townsend
IBM Event Processing
Senior Engineer / Technical Lead

Slack: @nictownsend
Bluesky: @nict0wnsend.bsky.social


Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: Building C, IBM Hursley Office, Hursley Park Road, 
Winchester, Hampshire SO21 2JN

Reply via email to