Hi, We're experiencing this issue over and over again since we started to use Kafka. Basically, our issue is about network glitches (that we are trying to solve) which make consumers not see brokers temporarily and even brokers not see each other either.
So the scenario is the following: 1) 3 brokers up & running with several topics, where each topic (5 partitions) is consumed by a single consumer group (with 3 consumers average). 2) Everything works fine during the working day and we experience no issues whatsoever. 3) However, sometimes when we return back to the office in the morning, we realize that some consumers in some consumer groups are no longer consuming, but other in the same consumer group run normally. For example, in a consumer group named "absolutegrounds.helper.processor.datapipeline" we see that, out of 3 consumers, 2 of them stopped consuming, whereas 1 of them could "recover" and continue consuming. These are their (last) respective logs: One consumer in consumer group = " absolutegrounds.helper.processor.datapipeline": 2018-03-26 01:01:04,070 INFO -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:542 - Marking the coordinator 10.141.36.18:9092 (id: 2147483647 rack: null) dead for group absolutegrounds.helper.processor.datapipeline 2018-03-26 01:01:12,026 INFO -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:505 - Discovered coordinator 10.141.36.18:9092 (id: 2147483647 rack: null) for group absolutegrounds.helper.processor.datapipeline. Another consumer in consumer group = " absolutegrounds.helper.processor.datapipeline": 2018-03-26 01:01:04,157 INFO -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:542 - Marking the coordinator 10.141.36.18:9092 (id: 2147483647 rack: null) dead for group absolutegrounds.helper.processor.datapipeline 2018-03-26 01:01:12,040 INFO -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:505 - Discovered coordinator 10.141.36.18:9092 (id: 2147483647 rack: null) for group absolutegrounds.helper.processor.datapipeline. Last consumer in consumer group = " absolutegrounds.helper.processor.datapipeline": March 26th 2018, 03:01:07.757 -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:542 - Marking the coordinator 10.141.36.18:9092 (id: 2147483647 rack: null) dead for group absolutegrounds.helper.processor.datapipeline March 26th 2018, 03:01:11.561 -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:505 - Discovered coordinator 10.141.36.18:9092 (id: 2147483647 rack: null) for group absolutegrounds.helper.processor.datapipeline. March 26th 2018, 03:01:16.216 -kafka-consumer-1 o.a.k.c.c.i.ConsumerCoordinator:292 - Revoking previously assigned partitions [AG_TASK_SOURCE_DP-4] for group absolutegrounds.helper.processor.datapipeline March 26th 2018, 03:01:16.948 -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:326 - (Re-)joining group absolutegrounds.helper.processor.datapipeline March 26th 2018, 03:01:18.478 -kafka-consumer-1 o.a.k.c.c.i.AbstractCoordinator:434 - Successfully joined group absolutegrounds.helper.processor.datapipeline with generation 1 March 26th 2018, 03:01:18.478 -kafka-consumer-1 o.a.k.c.c.i.ConsumerCoordinator:231 - Setting newly assigned partitions [AG_TASK_SOURCE_DP-0, AG_TASK_SOURCE_DP-1, AG_TASK_SOURCE_DP-2, AG_TASK_SOURCE_DP-3, AG_TASK_SOURCE_DP-4] for group absolutegrounds.helper.processor.datapipeline March 26th 2018, 03:01:18.780 -kafka-listener-5 e.e.t.d.a.h.p.TaskProcessor:203 - Published Event CREATED. Task: {dossiertype=1, tasktype=current, taskdate=18/04/2018 00:00:00, examiner=, dossierid=017879332, tyoper=1, outcome={description=Pending, code=-1}, taskid=133838532, status={description=Completed, code=2}, logo=null, owner={ownerid=711016, ownername=Jaguar Land Rover Limited}, firstlang=EN, gsclasses=1;2;7;10;11;13;15;17;19;20;22;23;29;30;31;33;34;43;44;45, acl=f4b794ffba01d3c8d68d21e98f6d7f75, markdate=24/03/2018 19:18:31, kdmark=1, denomination=, milestone=EXAMINATION, marktype=2, clazz=f4b794ffba01d3c8d68d21e98f6d7f75, lct=false} So for the same consumer group = "absolutegrounds.helper.processor.datapipeline", 2 out of 3 consumers stopped consuming and the remaining one could recover and continue consuming, apparently getting all the partitions in the topic (probably because the other consumers were stuck). All of them did show the "Mark coordinator dead" message for the same broker (10.141.36.18:9092). 4) Checking with the network admins, they swear that they are not aware of any issues with the network, and the only thing that might be related is the fact that at that time some backup processes are triggered (we have not invested too much time on figuring out what the root cause for the network glitches is because eventually, they will impact equally our brokers). As mentioned in the subject of this message, we are using Kafka 0.10.0.0 for both the brokers and the clients/consumers and our consumers are using the high-level consumer API with Spring Kafka (actually Spring Cloud Stream with Kafka Binders). We have also tried to reproduce this issue without success as in a "controlled" environment the consumers always recover properly. Not sure whether this could be related to this issue --> https://issues.apache.org/jira/browse/KAFKA-6671 Anything we can try out to spot the issue? Thanks.