Hi John, i see in https://github.com/apache/kafka/pull/3653 there is discussion around swallowing of LockException and retry not being there. but dguy replied saying that "The retry doesn't happen in this block of code. It will happen the next time the runLoop executes."
but state of thread is being changed to RUNNING, hence updateNewAndRestoringTasks won't be called again inside runOnce of StreamThread In TaskManager#updateNewAndRestoringTasks at the end, there is IF condition which checks whether all active tasks are running. Do you we should change from if (active.allTasksRunning()) { ... } to if (active.allTasksRunning() && standby.allTasksRunning()) { ... } Thanks, Giridhar. On 2019/11/15 03:09:17, Navinder Brar <navinder_b...@yahoo.com.INVALID> wrote: > Hi John, > Thanks for the response. Yeah, by "marked for deletion" I meant the unlocking > of the store(by which in a way it is marked for deletion). From what I have > seen the standby task gets stuck in Created state and doesn't move to Running > and is not able to recreate the directory. Also, the point is not just that. > With the new KIP to support serving from replicas we want to have very less > downtime on replicas and in this case we already have a completely built > state directory which is getting deleted just because of the assignment > change on the thread(the host is still same). We have > StreamsMetadataState#allMetadata() which would be common for all threads of > all instances. Can't we have a conditional check during unlocking which > checksĀ allMetadata and finds out that the partition we are about to unlock is > assigned to this host(we don't care which thread of this host) and then we > don't unlock the task, meanwhile the Stream Thread-2 will take the lock on > its own when it moves to Running. > Thanks,Navinder > On Friday, 15 November, 2019, 02:55:40 am IST, John Roesler > <j...@confluent.io> wrote: > > Hey Navinder, > > I think what's happening is a little different. Let's see if my > worldview also explains your experiences. > > There is no such thing as "mark for deletion". When a thread loses a > task, it simply releases its lock on the directory. If no one else on > the instance claims that lock within `state.cleanup.delay.ms` amount > of milliseconds, then the state cleaner will itself grab the lock and > delete the directory. On the other hand, if another thread (or the > same thread) gets the task back and claims the lock before the > cleaner, it will be able to re-open the store and use it. > > The default for `state.cleanup.delay.ms` is 10 minutes, which is > actually short enough that it could pass during a single rebalance (if > Streams starts recovering a lot of state). I recommend you increase > `state.cleanup.delay.ms` by a lot, like maybe set it to one hour. > > One thing I'm curious about... You didn't mention if Thread-2 > eventually is able to re-create the state directory (after the cleaner > is done) and transition to RUNNING. This should be the case. If not, I > would consider it a bug. > > Thanks, > -John > > On Thu, Nov 14, 2019 at 3:02 PM Navinder Brar > <navinder_b...@yahoo.com.invalid> wrote: > > > > Hi, > > We are facing a peculiar situation in the 2.3 version of Kafka Streams. > > First of all, I want to clarify if it is possible that a Stream Thread (say > > Stream Thread-1) which had got an assignment for a standby task (say 0_0) > > can change to Stream Thread-2 on the same host post rebalancing. The issue > > we are facing is this is happening for us and post rebalancing since the > > Stream Thread-1 had 0_0 and is not assigned back to it, it closes that task > > and marks it for deletion(after cleanup delay time), and meanwhile, the > > task gets assigned to Stream Thread-2. When the Stream Thread-2 tries to > > transition this task to Running, it gets a LockException which is caught in > > AssignedTasks#initializeNewTasks(). This makes 0_0 stay in Created state on > > Stream Thread-2 and after the cleanup delay is over the task directories > > for 0_0 get deleted. > > Can someone please comment on this behavior. > > Thanks,Navinder