[ https://issues.apache.org/jira/browse/KAFKA-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
A. Sophie Blee-Goldman resolved KAFKA-10793. -------------------------------------------- Fix Version/s: 2.7.1 Resolution: Fixed > Race condition in FindCoordinatorFuture permanently severs connection to > group coordinator > ------------------------------------------------------------------------------------------ > > Key: KAFKA-10793 > URL: https://issues.apache.org/jira/browse/KAFKA-10793 > Project: Kafka > Issue Type: Bug > Components: consumer, streams > Affects Versions: 2.5.0 > Reporter: A. Sophie Blee-Goldman > Assignee: A. Sophie Blee-Goldman > Priority: Critical > Fix For: 2.8.0, 2.7.1 > > > Pretty much as soon as we started actively monitoring the > _last-rebalance-seconds-ago_ metric in our Kafka Streams test environment, we > started seeing something weird. Every so often one of the StreamThreads (ie a > single Consumer instance) would appear to permanently fall out of the group, > as evidenced by a monotonically increasing _last-rebalance-seconds-ago._ We > inject artificial network failures every few hours at most, so the group > rebalances quite often. But the one consumer never rejoins, with no other > symptoms (besides a slight drop in throughput since the remaining threads had > to take over this member's work). We're confident that the problem exists in > the client layer, since the logs confirmed that the unhealthy consumer was > still calling poll. It was also calling Consumer#committed in its main poll > loop, which was consistently failing with a TimeoutException. > When I attached a remote debugger to an instance experiencing this issue, the > network client's connection to the group coordinator (the one that uses > MAX_VALUE - node.id as the coordinator id) was in the DISCONNECTED state. But > for some reason it never tried to re-establish this connection, although it > did successfully connect to that same broker through the "normal" connection > (ie the one that juts uses node.id). > The tl;dr is that the AbstractCoordinator's FindCoordinatorRequest has failed > (presumably due to a disconnect), but the _findCoordinatorFuture_ is non-null > so a new request is never sent. This shouldn't be possible since the > FindCoordinatorResponseHandler is supposed to clear the > _findCoordinatorFuture_ when the future is completed. But somehow that didn't > happen, so the consumer continues to assume there's still a FindCoordinator > request in flight and never even notices that it's dropped out of the group. > These are the only confirmed findings so far, however we have some guesses > which I'll leave in the comments. Note that we only noticed this due to the > newly added _last-rebalance-seconds-ago_ __metric, and there's no reason to > believe this bug hasn't been flying under the radar since the Consumer's > inception -- This message was sent by Atlassian Jira (v8.3.4#803005)