[ 
https://issues.apache.org/jira/browse/KAFKA-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630754#comment-17630754
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12679:
------------------------------------------------

Ah ok, thanks for the update. Regarding the retry.backoff.ms config, that's 
technically speaking a client config meaning it's intended to give Streams 
users control over the backoff between requests in the embedded consumer, 
producer, or admin client(s). This is what the Streams docs for that config 
mean by "The amount of time in milliseconds, before a request is retried" – 
it's not a universal config for backing off any and every operation within 
Streams.

There is one case where it's currently used by Streams itself however, vs just 
passing it through to the client configuration, which is when the assignor is 
validating and/or setting up internal topics during a rebalance – this is still 
kind of a client/request config since it controls how long to wait between 
various admin calls, but my point is it's not entirely out of the question to 
reuse this config for other things within Streams. Still, I would probably 
advocate for adding a new config for backing off something like this that's 
completely unrelated to any client requests.

All that is to say, we could probably improve the docs to clarify what this 
config actually controls, and possibly introduce a new config to address things 
like the busy loop while initializing/locking a task. I'll take a look at the 
current code on trunk to see if we've improved things since 2.6 as 
[~divijvaidya] was experiencing. Honestly it's probably sufficient to just hard 
code a short sleep rather than bother with a configurable backoff

 

By the way, there is actually ongoing work to move the restoration process to a 
separate thread, which will presumably include this part where we try to lock 
the task. cc [~cadonna]  – in case this isn't on your radar already, we should 
try and do this in a better way in the new restoration architecture

> Rebalancing a restoring or running task may cause directory livelocking with 
> newly created task
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-12679
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12679
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.1
>         Environment: Broker and client version 2.6.1
> Multi-node broker cluster
> Multi-node, auto scaling streams app instances
>            Reporter: Peter Nahas
>            Priority: Major
>             Fix For: 3.4.0
>
>         Attachments: Backoff-between-directory-lock-attempts.patch
>
>
> If a task that uses a state store is in the restoring state or in a running 
> state and the task gets rebalanced to a separate thread on the same instance, 
> the newly created task will attempt to lock the state store director while 
> the first thread is continuing to use it. This is totally normal and expected 
> behavior when the first thread is not yet aware of the rebalance. However, 
> that newly created task is effectively running a while loop with no backoff 
> waiting to lock the directory:
>  # TaskManager tells the task to restore in `tryToCompleteRestoration`
>  # The task attempts to lock the directory
>  # The lock attempt fails and throws a 
> `org.apache.kafka.streams.errors.LockException`
>  # TaskManager catches the exception, stops further processing on the task 
> and reports that not all tasks have restored
>  # The StreamThread `runLoop` continues to run.
> I've seen some documentation indicate that there is supposed to be a backoff 
> when this condition occurs, but there does not appear to be any in the code. 
> The result is that if this goes on for long enough, the lock-loop may 
> dominate CPU usage in the process and starve out the old stream thread task 
> processing.
>  
> When in this state, the DEBUG level logging for TaskManager will produce a 
> steady stream of messages like the following:
> {noformat}
> 2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager   
>               : stream-thread [StreamThread-10] Could not initialize 0_34 due 
> to the following exception; will retry
> org.apache.kafka.streams.errors.LockException: stream-thread 
> [StreamThread-10] standby-task [0_34] Failed to lock the state directory for 
> task 0_34
> {noformat}
>  
>  
> I've attached a git formatted patch to resolve the issue. Simply detect the 
> scenario and sleep for the backoff time in the appropriate StreamThread.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to