[ 
https://issues.apache.org/jira/browse/FLINK-36451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895936#comment-17895936
 ] 

Matthias Pohl edited comment on FLINK-36451 at 11/27/24 2:33 PM:
-----------------------------------------------------------------

The problem is that we have two code paths running concurrently (in the 
contenders main thread and in the main thread of the 
{{DefaultLeaderElectionService}}) that acquire the same locks in reversed order:
* Certain operations in the contender (the DispatcherLeaderProcess in 
FLINK-36451 and the JobMasterServiceLeadershipRunner) need to be called only as 
leader. These operations are executed from within the contender and acquire the 
contender’s lock first before acquiring the LeaderElectionService lock (to 
check for leadership)).
* In the meantime, any leadership change event will trigger the same locks in 
reverse order: First, locking the DefaultLeaderElectionService lock and then 
informing the registered contenders. Notifying the contenders will require 
acquiring their locks as part of the process.

With unstable leadership happening quite often, the chance of running into a 
deadlock increases. This is an issue that does not only affect the k8s but also 
the ZooKeeper leader election.

The solution should be to run any leader election-related operations solely on 
the {{DefaultLeaderElectionService}} main thread to ensure sequential 
execution. This allows us to remove the lock within the 
{{DefaultLeaderElectionService}}. We have to ensure, though, that the 
operations that are performed on the {{DefaultLeaderElectionService}} main 
thread are light-weight (similar to the requirements of Flink's {{RpcEndpoint}} 
main thread).


was (Author: mapohl):
The problem is that we have two code paths running concurrently (in the 
contenders main thread and in the main thread of the 
{{DefaultLeaderElectionService}}) that acquire the same locks in reversed order:
* Certain operations in the contender (the DispatcherLeaderProcess in 
FLINK-36451 and the JobMasterServiceLeadershipRunner in FLINKCC-1558) need to 
be called only as leader. These operations are executed from within the 
contender and acquire the contender’s lock first before acquiring the 
LeaderElectionService lock (to check for leadership)).
* In the meantime, any leadership change event will trigger the same locks in 
reverse order: First, locking the DefaultLeaderElectionService lock and then 
informing the registered contenders. Notifying the contenders will require 
acquiring their locks as part of the process.

With unstable leadership happening quite often, the chance of running into a 
deadlock increases. This is an issue that does not only affect the k8s but also 
the ZooKeeper leader election.

The solution should be to run any leader election-related operations solely on 
the {{DefaultLeaderElectionService}} main thread to ensure sequential 
execution. This allows us to remove the lock within the 
{{DefaultLeaderElectionService}}. We have to ensure, though, that the 
operations that are performed on the {{DefaultLeaderElectionService}} main 
thread are light-weight (similar to the requirements of Flink's {{RpcEndpoint}} 
main thread).

> Kubernetes Application JobManager Potential Deadlock and TaskManager Pod 
> Residuals
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-36451
>                 URL: https://issues.apache.org/jira/browse/FLINK-36451
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.1
>         Environment: * Flink version: 1.19.1
>  * - Deployment mode: Flink Kubernetes Application Mode
>  * - JVM version: OpenJDK 17
>  
>            Reporter: xiechenling
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 1.png, 2.png, jobmanager.log, jstack.txt
>
>
> In Kubernetes Application Mode, when there is significant etcd latency or 
> instability, the Flink JobManager may enter a deadlock situation. 
> Additionally, TaskManager pods are not cleaned up properly, resulting in 
> stale resources that prevent the Flink job from recovering correctly. This 
> issue occurs during frequent service restarts or network instability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to