[ https://issues.apache.org/jira/browse/FLINK-30305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643488#comment-17643488 ]
Gyula Fora commented on FLINK-30305: ------------------------------------ I think there are a mix of things happening here. And it may not be a bug. when you use savepoint upgrademode the job actually completes and HA metadata is automatically deleted by Flink itself. So the first upgrading log message is fine. the problem with submitting incorrect pod template is that if the job never starts the operator does not know why HA metadata is not available and not sure if it’s safe to retry > Operator deletes HA metadata during stateful upgrade, preventing potential > manual rollback > ------------------------------------------------------------------------------------------ > > Key: FLINK-30305 > URL: https://issues.apache.org/jira/browse/FLINK-30305 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.2.0 > Reporter: Alexis Sarda-Espinosa > Priority: Major > > I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade > mode = {{savepoint}}, and with _automatic_ rollback _disabled_ in the > operator. After the job was running, I purposely created an erroneous spec by > changing my pod template to include an entry in {{envFrom -> secretRef}} with > a name that doesn't exist. Schema validation passed, so the operator tried to > upgrade the job, but the new pod hangs with {{CreateContainerConfigError}}, > and I see this in the operator logs: > {noformat} > >>> Status | Info | UPGRADING | The resource is being upgraded > Deleting deployment with terminated application before new deployment > Deleting JobManager deployment and HA metadata. > {noformat} > Afterwards, even if I remove the non-existing entry from my pod template, the > operator can no longer propagate the new spec because "Job is not running yet > and HA metadata is not available, waiting for upgradeable state". -- This message was sent by Atlassian Jira (v8.20.10#820010)