[ https://issues.apache.org/jira/browse/FLINK-33324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780439#comment-17780439 ]
dongwoo.kim commented on FLINK-33324: ------------------------------------- Thanks for the opnion [~roman] , and sorry for the late reply. I've been thinking about ways to detect the slow restore issue in Flink, and here are a couple of methods I've considered: *1. Alerts Based on Checkpoint Failures.* Since flink tries to trigger checkpoint even though all the subtasks are not in running state it always fails. While this indicates a possible issue, it doesn't specifically tell us if it's due to slow recovery *2. Using the REST API* Another approach could be to leveraging Flink's REST API to check the status of subtasks. I found that even if some subtasks are initializing state the *"/jobs/:jobid/status"* endpoint often shows them as {*}running{*}. This was unexpected. However, we can look at the *"/jobs/:jobid/vertices/:vertexid/taskmanagers"* endpoint to identify subtasks that are in the *initializing* state. A limitation here is that this endpoint just shows the current state and doesn't provide the duration for which a subtask has been in that state. So, some additional logic might be needed to track how long subtasks have been 'stuck'. Considering the above, if we're looking to enhance Flink's observability, what about introducing a metric that shows restore time? Perhaps something named {*}lastDurationOfRestore{*}? WDYT? Thanks. > Add flink managed timeout mechanism for backend restore operation > ----------------------------------------------------------------- > > Key: FLINK-33324 > URL: https://issues.apache.org/jira/browse/FLINK-33324 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / State Backends > Reporter: dongwoo.kim > Priority: Minor > Attachments: image-2023-10-20-15-16-53-324.png, > image-2023-10-20-17-42-11-504.png > > > Hello community, I would like to share an issue our team recently faced and > propose a feature to mitigate similar problems in the future. > h2. Issue > Our Flink streaming job encountered consecutive checkpoint failures and > subsequently attempted a restart. > This failure occurred due to timeouts in two subtasks located within the same > task manager. > The restore operation for this particular task manager also got stuck, > resulting in an "initializing" state lasting over an hour. > Once we realized the hang during the restore operation, we terminated the > task manager pod, resolving the issue. > !image-2023-10-20-15-16-53-324.png|width=683,height=604! > The sequence of events was as follows: > 1. Checkpoint timed out for subtasks within the task manager, referred to as > tm-32. > 2. The Flink job failed and initiated a restart. > 3. Restoration was successful for 282 subtasks, but got stuck for the 2 > subtasks in tm-32. > 4. While the Flink tasks weren't fully in running state, checkpointing was > still being triggered, leading to consecutive checkpoint failures. > 5. These checkpoint failures seemed to be ignored, and did not count to the > execution.checkpointing.tolerable-failed-checkpoints configuration. > As a result, the job remained in the initialization phase for very long > period. > 6. Once we found this, we terminated the tm-32 pod, leading to a successful > Flink job restart. > h2. Suggestion > I feel that, a Flink job remaining in the initializing state indefinitely is > not ideal. > To enhance resilience, I think it would be helpful if we could add timeout > feature for restore operation. > If the restore operation exceeds a specified duration, an exception should be > thrown, causing the job to fail. > This way, we can address restore-related issues similarly to how we handle > checkpoint failures. > h2. Notes > Just to add, I've made a basic version of this feature to see if it works as > expected. > I've attached a picture from the Flink UI that shows the timeout exception > happened during restore operation. > It's just a start, but I hope it helps with our discussion. > (I've simulated network chaos, using > [litmus|https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-network-latency/#destination-ips-and-destination-hosts] > chaos engineering tool.) > !image-2023-10-20-17-42-11-504.png|width=940,height=317! > > Thank you for considering my proposal. I'm looking forward to hear your > thoughts. > If there's agreement on this, I'd be happy to work on implementing this > feature. -- This message was sent by Atlassian Jira (v8.20.10#820010)