[ 
https://issues.apache.org/jira/browse/KAFKA-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187992#comment-17187992
 ] 
John Roesler commented on KAFKA-10429:
--------------------------------------

Hi Navinder,

My first thought is that version 1.1.1 is extremely old, and a lot has actually 
changed in the consumers since then. Is there any chance you can try with a 
newer version of Streams and see if you still observe the issue?

Aside from that, from the logs you posted, it looks like in only took that 
instance a few seconds to re-acquire the connection to the coordinator, but the 
next paragraph implies that disconnections have lasted hours. Can you clarify?

A few other notes:
 * Disconnecting from the coordinator shouldn't interrupt processing, since you 
can still fetch from the leader and followers of the topic partitions you're 
assigned
 * If an instance is disconnected for longer than the session interval, you 
would actually see rebalances caused by that interval having dropped out of the 
group
 * If the log cleaner removes some offsets after the consumer's current 
position, there would be an InvalidOffsetException (unless there's an 
auto-reset policy configured), so you wouldn't silently miss data 

> Group Coordinator unavailability leads to missing events
> --------------------------------------------------------
>
>                 Key: KAFKA-10429
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10429
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 1.1.1
>            Reporter: Navinder Brar
>            Priority: Major
>
> We are regularly getting this Exception in logs.
> [2020-08-25 03:24:59,214] INFO [Consumer 
> clientId=appId-StreamThread-1-consumer, groupId=dashavatara] Group 
> coordinator ip:9092 (id: 1452096777 rack: null) is *unavailable* or invalid, 
> will attempt rediscovery 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
>  
> And after sometime it becomes discoverable:
> [2020-08-25 03:25:02,218] INFO [Consumer 
> clientId=appId-c3d1d186-e487-4993-ae3d-5fed75887e6b-StreamThread-1-consumer, 
> groupId=appId] Discovered group coordinator ip:9092 (id: 1452096777 rack: 
> null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
>  
> Now, the doubt I have is why this unavailability doesn't trigger a rebalance 
> in the cluster. We have few hours of retention on the source Kafka Topics and 
> sometimes this unavailability stays over for more than few hours and since it 
> doesn't trigger a rebalance or stops processing on other nodes(which are 
> connected to GC) we never come to know that some issue has happened and till 
> then we lose events from our source topics. 
>  
> There are some resolutions mentioned on stackoverflow but those configs are 
> already set in our kafka:
> default.replication.factor=3
> offsets.topic.replication.factor=3
>  
> It would be great to understand why this issue is happening and why it 
> doesn't trigger a rebalance and is there any known solution for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to