Mika Naylor created FLINK-26772:
-----------------------------------

             Summary: Kubernetes Native in HA Application Mode does not retry 
resource cleanup
                 Key: FLINK-26772
                 URL: https://issues.apache.org/jira/browse/FLINK-26772
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.15.0
            Reporter: Mika Naylor


I set up a scenario in which a k8s native cluster running in Application Mode 
used an s3 bucket for it's high availability storage directory, with the hadoop 
plugin. The credentials the cluster used gives it permission to write to the 
bucket, but not delete, so cleaning up the blob/jobgraph will fail.

I expected that when trying to clean up the HA resources, it would attempt to 
retry the cleanup. I even configured this explicitly:

{{cleanup-strategy: fixed-delay
cleanup-strategy.fixed-delay.attempts: 100
cleanup-strategy.fixed-delay.delay: 10 s}}

However, the behaviour I observed is that the blob and jobgraph cleanup is only 
attempted once. After this failure, I observe in the logs that:

{{2022-03-21 09:34:40,634 INFO  
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
[] - Application completed SUCCESSFULLY
2022-03-21 09:34:40,635 INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Shutting 
KubernetesApplicationClusterEntrypoint down with application status SUCCEEDED. 
Diagnostics null.}}

After which, the cluster recieves a SIGTERM an exits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to