Peter Nahas created KAFKA-12679:
-----------------------------------

             Summary: Rebalancing a restoring or running task may cause 
directory livelocking with newly created task
                 Key: KAFKA-12679
                 URL: https://issues.apache.org/jira/browse/KAFKA-12679
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.6.1
         Environment: Broker and client version 2.6.1
Multi-node broker cluster
Multi-node, auto scaling streams app instances

            Reporter: Peter Nahas
         Attachments: Backoff-between-directory-lock-attempts.patch

If a task that uses a state store is in the restoring state or in a running 
state and the task gets rebalanced to a separate thread on the same instance, 
the newly created task will attempt to lock the state store director while the 
first thread is continuing to use it. This is totally normal and expected 
behavior when the first thread is not yet aware of the rebalance. However, that 
newly created task is effectively running a while loop with no backoff waiting 
to lock the directory:
 # TaskManager tells the task to restore in `tryToCompleteRestoration`
 # The task attempts to lock the directory
 # The lock attempt fails and throws a 
`org.apache.kafka.streams.errors.LockException`
 # TaskManager catches the exception, stops further processing on the task and 
reports that not all tasks have restored
 # The StreamThread `runLoop` continues to run.

I've seen some documentation indicate that there is supposed to be a backoff 
when this condition occurs, but there does not appear to be any in the code. 
The result is that if this goes on for long enough, the lock-loop may dominate 
CPU usage in the process and starve out the old stream thread task processing.

 

When in this state, the DEBUG level logging for TaskManager will produce a 
steady stream of messages like the following:
{noformat}
2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager     
            : stream-thread [StreamThread-10] Could not initialize 0_34 due to 
the following exception; will retry
org.apache.kafka.streams.errors.LockException: stream-thread [StreamThread-10] 
standby-task [0_34] Failed to lock the state directory for task 0_34
{noformat}
 

 

I've attached a git formatted patch to resolve the issue. Simply detect the 
scenario and sleep for the backoff time in the appropriate StreamThread.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to