Hi,
So far in our experiment we have encountered 2 critical bugs.
1. If a thread takes more that MAX_POLL_INTERVAL_MS_CONFIG to compute a
cycle it gets evicted from group and rebalance takes place and it gets new
assignment.
However when this thread tries to commit offsets for the revoked partitions
in
onPartitionsRevoked it will again throw the CommitFailedException.

This gets handled by ConsumerCoordinatorso there is no point to assign this
exception to
rebalanceException in StreamThread and stop it. It has already been
assigned new partitions and it can continue.

So as fix in case on CommitFailedException I am not killing the StreamThrea.

2. Next we see a deadlock state when to process a task it takes longer
than MAX_POLL_INTERVAL_MS_CONFIG
time. Then this threads partitions are assigned to some other thread
including rocksdb lock. When it tries to process the next task it cannot
get rocks db lock and simply keeps waiting for that lock forever.

in retryWithBackoff for AbstractTaskCreator we have a backoffTimeMs = 50L.
If it does not get lock the we simply increase the time by 10x and keep
trying inside the while true loop.

We need to have a upper bound for this backoffTimeM. If the time is greater
than  MAX_POLL_INTERVAL_MS_CONFIG and it still hasn't got the lock means
this thread's partitions are moved somewhere else and it may not get the
lock again.

So I have added an upper bound check in that while loop.

The commits are here:
https://github.com/sjmittal/kafka/commit/6f04327c890c58cab9b1ae108af4ce5c4e3b89a1

please review and if you feel they make sense, please merge it to main
branch.

Thanks
Sachin

Reply via email to