[ 
https://issues.apache.org/jira/browse/KAFKA-4360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624155#comment-15624155
 ] 

huxi commented on KAFKA-4360:
-----------------------------

Excellent analysis! What I am intrigued is whether this is a deadlock issue or 
a liveness issue. Here is my analysis:
1. Say at time T1, the zookeeper session expires, so 'handleNewSession' methods 
for SessionExpirationListener is executed, therefore, obtaining the controller 
lock(controllerContext.controllerLock)
2. Then it invokes 'onControllerResignation' method to have the current 
controller quit, which will shutdown leader rebalance scheduler by calling 
KafkaScheduler.shutdown
3. In 'shutdown' method, it shuts down the ScheduledThreadPoolExecutor and 
blocks until all tasks have completed execution after a shutdown request
4. If there exists any tasks submitted before calling shutdown, the 
check-imbalance thread should get started with checking isActive which acquires 
the controller lock at the very beginning and then soon be blocked due to the 
lock has already been held by the main thread.
5. In that case, the main thread will block in onControllerResignation method 
until one day has elapsed by default or you just interrupt the check thread.

Does it make sense?


> Controller may deadLock when autoLeaderRebalance encounter zk expired
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-4360
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4360
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Json Tu
>              Labels: bugfix
>         Attachments: yf-mafka2-common02_jstack.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> when controller has checkAndTriggerPartitionRebalance task in 
> autoRebalanceScheduler,and then zk expired at that time. It will
> run into deadlock.
> we can restore the scene as below,when zk session expired,zk thread will call 
> handleNewSession which defined in SessionExpirationListener, and it will get 
> controllerContext.controllerLock,and then it will 
> autoRebalanceScheduler.shutdown(),which need complete all the task in the 
> autoRebalanceScheduler,but that threadPoll also need get 
> controllerContext.controllerLock,but it has already owned by zk callback 
> thread,which will then run into deadlock.
> because of that,it will cause two problems at least, first is the broker’s id 
> is cannot register to the zookeeper,and it will be considered as dead by new 
> controller,second this procedure can not be stop by kafka-server-stop.sh, 
> because shutdown function
> can not get controllerContext.controllerLock also, we cannot shutdown kafka 
> except using kill -9.
> In my attachment, I upload a jstack file, which was created when my kafka 
> procedure cannot shutdown by kafka-server-stop.sh.
> I have met this scenes for several times,I think this may be a bug that not 
> solved in kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to