We were ultimately able to solve this issue - mainly by sitting and waiting
The issue was indeed that somewhen, somehow, the data on the leader of this __consumer_offset-18 partition got corrupted. This probably happened during the upgrade from Kafka 2.2 -> 2.6. We were doing this in a rather dangerous way, as we know now: we simply stopped all brokers, updated the SW on all, and restarted them. We thought since we can afford this outage on the weekend, this would be a safe way. But we will certainly never do that again. At least not unless we know 100% that all producers and all consumers are really stopped. This was NOT the case during that upgrade, we overlooked one consumer and left that running, and that consumer (group) was storing their offsets in the __consumer_offset-18 partition. So that action - taking all brokers down and upgrade them - while consumers/producers are still running, did probably cause the corruption Lesson learnt: never do that again, always do rolling upgrades, even if they take a lot longer. The issue was then actually solved through the nature of the compacting topic. The default settings for compaction are to run it every week (or when a segment gets bigger than 1GB, which will not happen in our case). Compaction of that __consumer_offset-18 partition kicked in yesterday evening. And by that the 2 corrupted offsets were purged away. After that it was uphill, the reassign of that partition to the new brokers then worked like a charm. We could certainly have speeded up this recovery by setting the topic parameters in a way that compaction would kick in earlier. But it was only on Wednesday when we reached the understanding that the problem could actually be resolved that way. We decided to leave everything as it was and wait another day. On 26.10.20 18:41, Joe Ammann wrote: > We did an upgrade from Kafka 2.2 to 2.6, followed by a migration > (through reassign-partitions) from old to new brokers. > > As described in > https://stackoverflow.com/questions/64514851/apache-kafka-kafka-common-offsetsoutoforderexception-when-reassigning-consume, > all but 1 partition (__consumer_offsets-18) were successfully migrated. > But that one stubbornly refuses to migrate, because new replicas > complain about OutOfOrderOffset. When we looked today at the exception > message more closely, we discovered that really in huge list of offsets, > there are 2 non-monotic offsets > > [2020-10-24 15:04:54,528] ERROR [ReplicaFetcher replicaId=10, > leaderId=3, fetcherId=0] Unexpected error occurred while processing data > for partition __consumer_offsets-18 at offset 1545264631 > (kafka.server.ReplicaFetcherThread) > kafka.common.OffsetsOutOfOrderException: Out of order offsets found in > append to __consumer_offsets-18: ArrayBuffer(1545264631, 1545264632, > ..., 1545271418, 1545271419, 1, 1, 1545271422, 1545271423, ... > 1545272005, 1545272006, 1545272007) > > Apparently, the leader - when asked for the list of offsets to replicate > - returns that list with the two weird '1' offsets. One would expect > 1545271420, 1545271421 in place of the two '1' entries. > > Does this mean that the data is really somehow corrupted? That offset > range (probably no by accident) is about when we started the upgrade of > 2.2 to 2.6. The last snapshot file written on the leader with the old > 2.2 version had offset 1545271411, only 8 messages before the one that > is now apparently causing issue. > > Any idea what could have happened here? And more importantly, any idea > how to get out of this mess :-) We would be able to accept the loss of > all data on __consumer_offsets-18. It happens to contain only > "non-essential" consumer groups. So that would also be an option. >