Hi Zhanghao,

Thanks for the quick response! My current restart strategy type is
fixed-delay with 10 seconds delay as follows. I used the default restart
strategy exponential-delaybefore, but see high pressure in the Kafka
cluster during incidents. Do you know how long Flink will retain the HA
metadata? Any reference or configuration for this? I will try to use the
default exponential-delay. Understanding the HA metadata lifecycle will
enable me to fine-tune the backoff time, ensuring the job restarts promptly
before reaching a terminal state.

Current setting
restart-strategy.type: fixed-delay
restart-strategy.fixed-delay.delay: 10s

Thanks,
Chen

On Tue, Feb 4, 2025 at 5:14 PM Zhanghao Chen <zhanghao.c...@outlook.com>
wrote:

> Hi Yang,
>
> When the job failed temporarily, e.g. due to single machine failure, Flink
> will retain the HA metadata and try to recover. However, when the job has
> already reached the terminal failed status (controlled by the restart
> strategy [1]), Flink will delete all metadata and exit. In your case, you
> might want to revise the restart strategy of the job to avoid entering the
> terminal failed status too quickly.
>
> The two options are apocryphal. Don't trust LLMs too much :)
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-strategies
>
> Best,
> Zhanghao Chen
> ------------------------------
> *From:* Chen Yang via user <user@flink.apache.org>
> *Sent:* Wednesday, February 5, 2025 7:17
> *To:* user@flink.apache.org <user@flink.apache.org>
> *Cc:* Vignesh Chandramohan <vignesh.chandramo...@doordash.com>
> *Subject:* Flink High Availability Data Cleanup
>
> Hi Flink Community,
>
> I'm running the Flink jobs (standalone mode) with high availability in
> Kubernetes (Flink version 1.17.2). The job is deployed with two job
> managers. I noticed that the leader job manager would delete the ConfigMap
> when the job failed and restarted. Thus the standby job manager couldn't
> recover the jobId and checkpoint from the ConfigMap. And the job started
> with a fresh state. While from the Flink docs
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up,
> it mentions that HA related ConfigMaps would be retained and job would
> recovered from the checkpoints stored in the ConfigMaps. Looks like the
> Flink doesn't work as described. Are there some configs to persist the
> configmap when the job fails or restarts?
>
> During my search via Google and ChatGPT, it recommends the following 2
> configs to keep the configmap during job cleanup. But I can't find any
> Flink docs mentioning these configurations nor in the Flink code. Please
> advise!
>
> high-availability.cleanup-on-shutdown
>
> or
>
> kubernetes.jobmanager.cleanup-ha-metadata
>
> Thanks,
> Chen
>
>
>
> --
>
> Chen Yang
> Software Engineer, Data Infrastructure
>
> DoorDash.com <http://www.doordash.com/>
>


-- 

Chen Yang
Software Engineer, Data Infrastructure

DoorDash.com <http://www.doordash.com/>

Reply via email to