on `kafka_2.11-1.0.1-d04daf570` we are upgrading the log format from 0.9.0.1 to 0.11.0.1 and after the upgrade have set
inter.broker.protocol.version=1.0 log.message.format.version=0.11.0.1 We have applied this upgrade to 5 clusters by upgrading broker 1, leaving it for a day, then coming back when happy to upgrade the remaining brokers. 4 of those upgrades went without issue. However in one, when we upgraded the remaining brokers, we now start seeing these errors on broker1: Error processing fetch operation on partition __consumer_offsets-21 offset 200349244 For 4 consumer offset partitions, all which happen to be led by 1. kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 2017-12-15T07:39:40.380+0000 [ReplicaManager broker=1] Error processing fetch operation on partition __consumer_offsets-21 offset 200349244 kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing fetch operation on partition __consumer_offsets-11 offset 188709568 kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing fetch operation on partition __consumer_offsets-1 offset 2045483676 kafka-request-handler-5 74 ERROR kafka.server.ReplicaManager 2017-12-15T07:39:41.672+0000 [ReplicaManager broker=1] Error processing fetch operation on partition __consumer_offsets-31 offset 235294887 Every second or so. If we stop that broker, those errors simply shift to the next leader for those 4 partitions. And moving the partitions to completely new brokers just moves the errors. We only see this on kafka1. not the other 9 brokers which had the log message fromat upgraded a day or two later. Any suggestion on how to proceed? I'm not even sure yet if this is isolated to the cluster, or if it's related to a consumer misbehaving. Since our multiple clusters /should/ have the same set of producers/consumers working on them, I'm doubtful that it's a misbehaving client.