[ https://issues.apache.org/jira/browse/KAFKA-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187992#comment-17187992 ]
John Roesler commented on KAFKA-10429: -------------------------------------- Hi Navinder, My first thought is that version 1.1.1 is extremely old, and a lot has actually changed in the consumers since then. Is there any chance you can try with a newer version of Streams and see if you still observe the issue? Aside from that, from the logs you posted, it looks like in only took that instance a few seconds to re-acquire the connection to the coordinator, but the next paragraph implies that disconnections have lasted hours. Can you clarify? A few other notes: * Disconnecting from the coordinator shouldn't interrupt processing, since you can still fetch from the leader and followers of the topic partitions you're assigned * If an instance is disconnected for longer than the session interval, you would actually see rebalances caused by that interval having dropped out of the group * If the log cleaner removes some offsets after the consumer's current position, there would be an InvalidOffsetException (unless there's an auto-reset policy configured), so you wouldn't silently miss data > Group Coordinator unavailability leads to missing events > -------------------------------------------------------- > > Key: KAFKA-10429 > URL: https://issues.apache.org/jira/browse/KAFKA-10429 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 1.1.1 > Reporter: Navinder Brar > Priority: Major > > We are regularly getting this Exception in logs. > [2020-08-25 03:24:59,214] INFO [Consumer > clientId=appId-StreamThread-1-consumer, groupId=dashavatara] Group > coordinator ip:9092 (id: 1452096777 rack: null) is *unavailable* or invalid, > will attempt rediscovery > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > > And after sometime it becomes discoverable: > [2020-08-25 03:25:02,218] INFO [Consumer > clientId=appId-c3d1d186-e487-4993-ae3d-5fed75887e6b-StreamThread-1-consumer, > groupId=appId] Discovered group coordinator ip:9092 (id: 1452096777 rack: > null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > > Now, the doubt I have is why this unavailability doesn't trigger a rebalance > in the cluster. We have few hours of retention on the source Kafka Topics and > sometimes this unavailability stays over for more than few hours and since it > doesn't trigger a rebalance or stops processing on other nodes(which are > connected to GC) we never come to know that some issue has happened and till > then we lose events from our source topics. > > There are some resolutions mentioned on stackoverflow but those configs are > already set in our kafka: > default.replication.factor=3 > offsets.topic.replication.factor=3 > > It would be great to understand why this issue is happening and why it > doesn't trigger a rebalance and is there any known solution for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)