Hi all,

we are fighting with offset rewinds of seemingly random size and hitting
seemingly random partitions on restarting any node in our kafka cluster. We
are running out of ideas - any help or pointers to things to investigate
are highly appreciated.

Our kafka setup is dual data center with two local broker clusters (3 nodes
each) and two aggregate broker clusters (5 nodes each), the latter running
mirror maker to consume messages from the local cluster.

Issues seem to have appeared since we upgraded from 0.10.1.0 to 0.11, but
not entirely sure it’s related.

We first had the theory of too big a consumer offset topic (we use
compaction for it) causing the issues on restart, and indeed we found that
cleaner threads had died after the upgrade. But restarting and cleaning
this topic did not help the issue.

Logs are pretty silent when it happens, before we cleaned the consumer
offset topic, we got a few of these every time it happened, but no longer
now:

[2017-08-04 11:19:25,970] ERROR [Group Metadata Manager on Broker 472]:
Error loading offsets from __consumer_offsets-14
(kafka.coordinator.group.GroupMetadataManager)
java.lang.IllegalStateException: Unexpected unload of active group
tns-ticket-store-b144c9d1-425a-4b90-8310-f6e886741494 while loading
partition __consumer_offsets-14
        at
kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:600)
        at
kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsAndOffsets$6.apply(GroupMetadataManager.scala:595)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
        at
kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(GroupMetadataManager.scala:595)
        at
kafka.coordinator.group.GroupMetadataManager.kafka$coordinator$group$GroupMetadataManager$$doLoadGroupsAndOffsets$1(GroupMetadataManager.scala:455)
        at
kafka.coordinator.group.GroupMetadataManager$$anonfun$loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:441)
        at
kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
        at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Does this seem familiar to anyone? Is there any suggestion as to what to
look into closer to investigate this issue? Happy to give more details
about anything that might be helpful.

Thanks a lot in advance,

Christiane

Reply via email to