[ 
https://issues.apache.org/jira/browse/FLINK-26772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mika Naylor updated FLINK-26772:
--------------------------------
    Description: 
We discovered that in Application Mode, when the application has completed, the 
cluster is shutdown even if there are ongoing resource cleanup events happening 
in the background. For example, if ha cleanup fails, further retries are not 
attempted as the cluster is shut down before this can happen.

 

We should also add a flag for the shutdown that will prevent further jobs from 
being submitted.

  was:
I set up a scenario in which a k8s native cluster running in Application Mode 
used an s3 bucket for it's high availability storage directory, with the hadoop 
plugin. The credentials the cluster used gives it permission to write to the 
bucket, but not delete, so cleaning up the blob/jobgraph will fail.

I expected that when trying to clean up the HA resources, it would attempt to 
retry the cleanup. I even configured this explicitly:

{{cleanup-strategy: fixed-delay}}
{{cleanup-strategy.fixed-delay.attempts: 100}}
{{cleanup-strategy.fixed-delay.delay: 10 s}}

However, the behaviour I observed is that the blob and jobgraph cleanup is only 
attempted once. After this failure, I observe in the logs that:

{{2022-03-21 09:34:40,634 INFO 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
[] - Application completed SUCCESSFULLY}}
{{2022-03-21 09:34:40,635 INFO 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting 
KubernetesApplicationClusterEntrypoint down with application status SUCCEEDED. 
Diagnostics null.}}

After which, the cluster recieves a SIGTERM an exits.


> Application Mode does not wait for job cleanup during shutdown
> --------------------------------------------------------------
>
>                 Key: FLINK-26772
>                 URL: https://issues.apache.org/jira/browse/FLINK-26772
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Mika Naylor
>            Priority: Critical
>         Attachments: testcluster-599f4d476b-bghw5_log.txt
>
>
> We discovered that in Application Mode, when the application has completed, 
> the cluster is shutdown even if there are ongoing resource cleanup events 
> happening in the background. For example, if ha cleanup fails, further 
> retries are not attempted as the cluster is shut down before this can happen.
>  
> We should also add a flag for the shutdown that will prevent further jobs 
> from being submitted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to