[ https://issues.apache.org/jira/browse/FLINK-33324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777757#comment-17777757 ]
dongwoo.kim commented on FLINK-33324: ------------------------------------- Hi, [~roman] Thanks for considering my suggestion. For some background, we are leveraging failure-rate restart strategy and monitoring cronJob to manage flink application. By marking long hanging restore operation as a failure retry can be initiated. And if the retries don't resolve the issue, the job should ultimately fail. This way, the cronJob monitoring the Flink application can quickly detect it and redeploy the job from its last state. (In this scenario new task manager pod is created so this specific issue could be solved) As you pointed out, introducing a timeout might have various side effects. Whether one can tolerate a lengthy restore or prefers quicker retries and redeployments might vary based on operational needs. What if we make this an optional feature? By default, there would be no timeout, but developers could configure it if desired. Thanks in advance. > Add flink managed timeout mechanism for backend restore operation > ----------------------------------------------------------------- > > Key: FLINK-33324 > URL: https://issues.apache.org/jira/browse/FLINK-33324 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends > Reporter: dongwoo.kim > Priority: Minor > Attachments: image-2023-10-20-15-16-53-324.png, > image-2023-10-20-17-42-11-504.png > > > Hello community, I would like to share an issue our team recently faced and > propose a feature to mitigate similar problems in the future. > h2. Issue > Our Flink streaming job encountered consecutive checkpoint failures and > subsequently attempted a restart. > This failure occurred due to timeouts in two subtasks located within the same > task manager. > The restore operation for this particular task manager also got stuck, > resulting in an "initializing" state lasting over an hour. > Once we realized the hang during the restore operation, we terminated the > task manager pod, resolving the issue. > !image-2023-10-20-15-16-53-324.png|width=683,height=604! > The sequence of events was as follows: > 1. Checkpoint timed out for subtasks within the task manager, referred to as > tm-32. > 2. The Flink job failed and initiated a restart. > 3. Restoration was successful for 282 subtasks, but got stuck for the 2 > subtasks in tm-32. > 4. While the Flink tasks weren't fully in running state, checkpointing was > still being triggered, leading to consecutive checkpoint failures. > 5. These checkpoint failures seemed to be ignored, and did not count to the > execution.checkpointing.tolerable-failed-checkpoints configuration. > As a result, the job remained in the initialization phase for very long > period. > 6. Once we found this, we terminated the tm-32 pod, leading to a successful > Flink job restart. > h2. Suggestion > I feel that, a Flink job remaining in the initializing state indefinitely is > not ideal. > To enhance resilience, I think it would be helpful if we could add timeout > feature for restore operation. > If the restore operation exceeds a specified duration, an exception should be > thrown, causing the job to fail. > This way, we can address restore-related issues similarly to how we handle > checkpoint failures. > h2. Notes > Just to add, I've made a basic version of this feature to see if it works as > expected. > I've attached a picture from the Flink UI that shows the timeout exception > happened during restore operation. > It's just a start, but I hope it helps with our discussion. > (I've simulated network chaos, using > [litmus|https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-network-latency/#destination-ips-and-destination-hosts] > chaos engineering tool.) > !image-2023-10-20-17-42-11-504.png|width=940,height=317! > > Thank you for considering my proposal. I'm looking forward to hear your > thoughts. > If there's agreement on this, I'd be happy to work on implementing this > feature. -- This message was sent by Atlassian Jira (v8.20.10#820010)