Hello! We are facing a really weird issue with our production Kafka cluster (60x AWS EC2 instances split evenly across 3 "racks" / Availability Zones, with EBS storage). The Kafka server and client version is 3.3.2.
Around 10 days ago one of the brokers all of a sudden became very slow as it was consuming almost all available CPU on the instance. This led to all sorts of ISR shrinks and leadership switches over the course of ~2 hours. The JVM process was busy on a number of "data-plane-kafka-..." threads (no further details could be seen unfortunately, as the thread name is truncated). It was ultimately resolved by recycling the EC2 instance and everything went back to normal (we thought). A few days after the outage we got internal reports that some consumers are receiving unexpected messages from some topics. In our setup every topic is expected to only contain messages conforming to a JSON schema specific to that topic, which is enforced by our validation API[1] before publishing to Kafka. In the process of validation, each message's metadata is additionally enriched with the topic name, which helped us in this case to find a number of messages that unexpectedly landed in a different topic. In total we have found that ~100 messages from two dozens of different topics (~40 different partitions) were sent to the wrong topics (15 topics, 18 partitions). All such messages were sent during the course of the outage and we have never seen reports of such an issue before. On top of that, in all 18 partitions where the wrong messages landed, the broker that experienced the high CPU utilization, is the leader according to the replica assignment. What is even more odd, is that for most of the actual target partitions, there is _no overlap_ in the replica assignment, so it's hard to see how the messages may have reached the problematic broker at all. We are using replication factor 3 (equal to number of racks) and rack awareness is enabled. For example, a message landed on a topic-partition with assignment: b495a44f-c673-4d33-8104-eb0357dd8596 partition 27: xxx23062, xxx23047, xxx23066 but was destined to a completely different topic and partition with assignment: f9f10d38-cffb-4931-abf2-0574a8207c54 partition 2: xxx30373, xxx23613, xxx23614. At last, from what we can see, the wrongly published messages were _also successfully published_ to the correct topic in all cases. We have reviewed our publishing API code and found no way the issue could be caused by it sending the messages to the wrong topic. Also, the fact that it did send the messages to the correct topic confirms our thinking that it is working correctly. This API is battle-tested over the course of the past 7 years in our organization at a very large scale and there were no changes to the publishing path recently that may have directly caused this completely new issue. Having that in mind, we reviewed the Kafka Java client code and also found no way the topic-partition and payload may have been mixed up (after packing in a ProducerRecord there seems to be no further modification of this immutable structure). There also doesn't seem to be a way the error handling due to that broker being slow could have triggered some rarely taken code path. We are using the following non-default Kafka client configs: retries=0 enable.idempotence=false We have found no pre-existing bug reports that may be relevant to our problem. We also don't know how the issue can be reproduced. AWS support did not find any hardware issues with the EC2 instance that we had to replace. What else have we missed? What may have caused the observed issue? Thank you for reading up to this point! Any hints / pointers / weird ideas to try are greatly appreciated! :-) Kind regards, -- Alex [1]: https://github.com/zalando/nakadi