Hello!

We are facing a really weird issue with our production Kafka cluster (60x
AWS EC2 instances split evenly across 3 "racks" / Availability Zones, with
EBS storage).  The Kafka server and client version is 3.3.2.

Around 10 days ago one of the brokers all of a sudden became very slow as
it was consuming almost all available CPU on the instance.  This led to all
sorts of ISR shrinks and leadership switches over the course of ~2 hours.
The JVM process was busy on a number of "data-plane-kafka-..." threads (no
further details could be seen unfortunately, as the thread name is
truncated).  It was ultimately resolved by recycling the EC2 instance and
everything went back to normal (we thought).

A few days after the outage we got internal reports that some consumers are
receiving unexpected messages from some topics.  In our setup every topic
is expected to only contain messages conforming to a JSON schema specific
to that topic, which is enforced by our validation API[1] before publishing
to Kafka.  In the process of validation, each message's metadata is
additionally enriched with the topic name, which helped us in this case to
find a number of messages that unexpectedly landed in a different topic.

In total we have found that ~100 messages from two dozens of different
topics (~40 different partitions) were sent to the wrong topics (15
topics, 18 partitions).  All such messages were sent during the course of
the outage and we have never seen reports of such an issue before.  On top
of that, in all 18 partitions where the wrong messages landed, the broker
that experienced the high CPU utilization, is the leader according to the
replica assignment.  What is even more odd, is that for most of the actual
target partitions, there is _no overlap_ in the replica assignment, so it's
hard to see how the messages may have reached the problematic broker at
all.  We are using replication factor 3 (equal to number of racks) and rack
awareness is enabled.

For example, a message landed on a topic-partition with assignment:
b495a44f-c673-4d33-8104-eb0357dd8596 partition 27: xxx23062, xxx23047,
xxx23066

but was destined to a completely different topic and partition with
assignment:
f9f10d38-cffb-4931-abf2-0574a8207c54 partition  2: xxx30373, xxx23613,
xxx23614.


At last, from what we can see, the wrongly published messages were _also
successfully published_ to the correct topic in all cases.

We have reviewed our publishing API code and found no way the issue could
be caused by it sending the messages to the wrong topic.  Also, the fact
that it did send the messages to the correct topic confirms our thinking
that it is working correctly.  This API is battle-tested over the course of
the past 7 years in our organization at a very large scale and there were
no changes to the publishing path recently that may have directly caused
this completely new issue.

Having that in mind, we reviewed the Kafka Java client code and also found
no way the topic-partition and payload may have been mixed up (after
packing in a ProducerRecord there seems to be no further modification of
this immutable structure).  There also doesn't seem to be a way the error
handling due to that broker being slow could have triggered some rarely
taken code path.

We are using the following non-default Kafka client configs:
retries=0
enable.idempotence=false

We have found no pre-existing bug reports that may be relevant to our
problem.
We also don't know how the issue can be reproduced.
AWS support did not find any hardware issues with the EC2 instance that we
had to replace.

What else have we missed?  What may have caused the observed issue?

Thank you for reading up to this point!
Any hints / pointers / weird ideas to try are greatly appreciated! :-)


Kind regards,
--
Alex

[1]: https://github.com/zalando/nakadi

Reply via email to