[ 
https://issues.apache.org/jira/browse/FLINK-30305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Sarda-Espinosa updated FLINK-30305:
------------------------------------------
    Description: 
I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade mode 
= {{savepoint}}, and with _automatic_ rollback _disabled_ in the operator. 
After the job was running, I purposely created an erroneous spec by changing my 
pod template to include an entry in {{envFrom -> secretRef}} with a name that 
doesn't exist. Schema validation passed, so the operator tried to upgrade the 
job, but the new pod hangs with {{CreateContainerConfigError}}, and I see this 
in the operator logs:

{noformat}
>>> Status | Info    | UPGRADING       | The resource is being upgraded
Deleting deployment with terminated application before new deployment
Deleting JobManager deployment and HA metadata.
{noformat}

Afterwards, even if I remove the non-existing entry from my pod template, the 
operator can no longer propagate the new spec because "Job is not running yet 
and HA metadata is not available, waiting for upgradeable state".


  was:
I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade mode 
= {{savepoint}}, and with _automatic_ rollback _disabled_ in the operator. 
After the job was running, I purposely created an erroneous spec by changing my 
pod template to include an entry in {{envFrom -> secretRef}} with a name that 
doesn't exist. Schema validation passed, so the operator tried to upgrade the 
job, and I see this in the logs:

{noformat}
>>> Status | Info    | UPGRADING       | The resource is being upgraded
Deleting deployment with terminated application before new deployment
Deleting JobManager deployment and HA metadata.
{noformat}

Afterwards, even if I remove the non-existing entry from my pod template, the 
operator can no longer propagate the new spec because "Job is not running yet 
and HA metadata is not available, waiting for upgradeable state".



> Operator deletes HA metadata during stateful upgrade, preventing potential 
> manual rollback
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30305
>                 URL: https://issues.apache.org/jira/browse/FLINK-30305
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0
>            Reporter: Alexis Sarda-Espinosa
>            Priority: Major
>
> I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade 
> mode = {{savepoint}}, and with _automatic_ rollback _disabled_ in the 
> operator. After the job was running, I purposely created an erroneous spec by 
> changing my pod template to include an entry in {{envFrom -> secretRef}} with 
> a name that doesn't exist. Schema validation passed, so the operator tried to 
> upgrade the job, but the new pod hangs with {{CreateContainerConfigError}}, 
> and I see this in the operator logs:
> {noformat}
> >>> Status | Info    | UPGRADING       | The resource is being upgraded
> Deleting deployment with terminated application before new deployment
> Deleting JobManager deployment and HA metadata.
> {noformat}
> Afterwards, even if I remove the non-existing entry from my pod template, the 
> operator can no longer propagate the new spec because "Job is not running yet 
> and HA metadata is not available, waiting for upgradeable state".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to