Hi Zhanghao, Thanks for the quick response! My current restart strategy type is fixed-delay with 10 seconds delay as follows. I used the default restart strategy exponential-delaybefore, but see high pressure in the Kafka cluster during incidents. Do you know how long Flink will retain the HA metadata? Any reference or configuration for this? I will try to use the default exponential-delay. Understanding the HA metadata lifecycle will enable me to fine-tune the backoff time, ensuring the job restarts promptly before reaching a terminal state.
Current setting restart-strategy.type: fixed-delay restart-strategy.fixed-delay.delay: 10s Thanks, Chen On Tue, Feb 4, 2025 at 5:14 PM Zhanghao Chen <zhanghao.c...@outlook.com> wrote: > Hi Yang, > > When the job failed temporarily, e.g. due to single machine failure, Flink > will retain the HA metadata and try to recover. However, when the job has > already reached the terminal failed status (controlled by the restart > strategy [1]), Flink will delete all metadata and exit. In your case, you > might want to revise the restart strategy of the job to avoid entering the > terminal failed status too quickly. > > The two options are apocryphal. Don't trust LLMs too much :) > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-strategies > > Best, > Zhanghao Chen > ------------------------------ > *From:* Chen Yang via user <user@flink.apache.org> > *Sent:* Wednesday, February 5, 2025 7:17 > *To:* user@flink.apache.org <user@flink.apache.org> > *Cc:* Vignesh Chandramohan <vignesh.chandramo...@doordash.com> > *Subject:* Flink High Availability Data Cleanup > > Hi Flink Community, > > I'm running the Flink jobs (standalone mode) with high availability in > Kubernetes (Flink version 1.17.2). The job is deployed with two job > managers. I noticed that the leader job manager would delete the ConfigMap > when the job failed and restarted. Thus the standby job manager couldn't > recover the jobId and checkpoint from the ConfigMap. And the job started > with a fresh state. While from the Flink docs > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up, > it mentions that HA related ConfigMaps would be retained and job would > recovered from the checkpoints stored in the ConfigMaps. Looks like the > Flink doesn't work as described. Are there some configs to persist the > configmap when the job fails or restarts? > > During my search via Google and ChatGPT, it recommends the following 2 > configs to keep the configmap during job cleanup. But I can't find any > Flink docs mentioning these configurations nor in the Flink code. Please > advise! > > high-availability.cleanup-on-shutdown > > or > > kubernetes.jobmanager.cleanup-ha-metadata > > Thanks, > Chen > > > > -- > > Chen Yang > Software Engineer, Data Infrastructure > > DoorDash.com <http://www.doordash.com/> > -- Chen Yang Software Engineer, Data Infrastructure DoorDash.com <http://www.doordash.com/>