Re: Flink High Availability Data Cleanup

Zhanghao Chen Thu, 06 Feb 2025 00:14:55 -0800

Simply put, HA metadata will only be deleted when the job reaches terminal 
state (either failed or cancelled). The ref doc is 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-strategies

Best,
Zhanghao Chen
________________________________
From: Chen Yang <chen.y...@doordash.com>
Sent: Thursday, February 6, 2025 6:55
To: Zhanghao Chen <zhanghao.c...@outlook.com>
Cc: user@flink.apache.org <user@flink.apache.org>; Vignesh Chandramohan 
<vignesh.chandramo...@doordash.com>; Allison Cheng <chen.ch...@doordash.com>
Subject: Re: Flink High Availability Data Cleanup

Hi Zhanghao,

Thanks for the quick response! My current restart strategy type is fixed-delay 
with 10 seconds delay as follows. I used the default restart strategy 
exponential-delaybefore, but see high pressure in the Kafka cluster during 
incidents. Do you know how long Flink will retain the HA metadata? Any 
reference or configuration for this? I will try to use the default 
exponential-delay. Understanding the HA metadata lifecycle will enable me to 
fine-tune the backoff time, ensuring the job restarts promptly before reaching 
a terminal state.

Current setting
restart-strategy.type: fixed-delay
restart-strategy.fixed-delay.delay: 10s

Thanks,
Chen

On Tue, Feb 4, 2025 at 5:14 PM Zhanghao Chen 
<zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com>> wrote:
Hi Yang,

When the job failed temporarily, e.g. due to single machine failure, Flink will 
retain the HA metadata and try to recover. However, when the job has already 
reached the terminal failed status (controlled by the restart strategy [1]), 
Flink will delete all metadata and exit. In your case, you might want to revise 
the restart strategy of the job to avoid entering the terminal failed status 
too quickly.

The two options are apocryphal. Don't trust LLMs too much :)

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-strategies

Best,
Zhanghao Chen
________________________________
From: Chen Yang via user <user@flink.apache.org<mailto:user@flink.apache.org>>
Sent: Wednesday, February 5, 2025 7:17
To: user@flink.apache.org<mailto:user@flink.apache.org> 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Cc: Vignesh Chandramohan 
<vignesh.chandramo...@doordash.com<mailto:vignesh.chandramo...@doordash.com>>
Subject: Flink High Availability Data Cleanup

Hi Flink Community,

I'm running the Flink jobs (standalone mode) with high availability in 
Kubernetes (Flink version 1.17.2). The job is deployed with two job managers. I 
noticed that the leader job manager would delete the ConfigMap when the job 
failed and restarted. Thus the standby job manager couldn't recover the jobId 
and checkpoint from the ConfigMap. And the job started with a fresh state. 
While from the Flink docs 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up,
 it mentions that HA related ConfigMaps would be retained and job would 
recovered from the checkpoints stored in the ConfigMaps. Looks like the Flink 
doesn't work as described. Are there some configs to persist the configmap when 
the job fails or restarts?

During my search via Google and ChatGPT, it recommends the following 2 configs 
to keep the configmap during job cleanup. But I can't find any Flink docs 
mentioning these configurations nor in the Flink code. Please advise!

high-availability.cleanup-on-shutdown

or

kubernetes.jobmanager.cleanup-ha-metadata

Thanks,
Chen

--
[https://s3.us-west-2.amazonaws.com/doordash-static/media/email-signatures/doordash-icon.png]
Chen Yang
Software Engineer, Data Infrastructure

DoorDash.com<http://www.doordash.com/>

--
[https://s3.us-west-2.amazonaws.com/doordash-static/media/email-signatures/doordash-icon.png]
Chen Yang
Software Engineer, Data Infrastructure

DoorDash.com<http://www.doordash.com/>

Re: Flink High Availability Data Cleanup

Reply via email to