[ https://issues.apache.org/jira/browse/KAFKA-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939731#comment-17939731 ]
Donny Nadolny commented on KAFKA-19012: --------------------------------------- * what are the versions of the clients that don't have this problem? ** In our setup almost all traffic goes through our proxy, which only runs a single client version (right now 3.3.1). We previously used 2.5.1 for quite a long time and never experienced this, but also didn't experience it for many months after upgrading the client from 2.5.1 -> 3.3.1. There is another system which doesn't use our proxy and uses client 3.4.0 and hasn't experienced these misroutings, though I'm not sure how much to read in to the version difference because it's much lower volume than our proxy and also doesn't use linger.ms which is what my strong suspicion is. Also we run multiple variations of our proxy, the one which does not use linger.ms has not experienced any misrouting and runs the same kafka client version (and same proxy code as well), 3.3.1 currently. * when you see misrouted messages, are they all in the same batch (i.e. the whole batch misrouted) or a batch contains both correct and misrouted messages? ** We don't have enough logging to tell for sure, but what I can say is that we did have one occurrence where two messages to the same topic were both misrouted at the same time (the same proxy instance published them, and they were sequential in the same partition i.e. offset difference of 1), so presumably from the same batch ** We haven't seen multiple messages misrouted at the exact same time (except for the occurrence above). ** These taken together lead me to believe that within a batch of messages only messages destined for the same topic are misrouted when this bug occurs, not all messages published within the linger time * when you see bursts, are they always between same topics or different messages in the same burst are misrouted between different topics? ** It's the latter, different messages to seemingly random topics. eg one occurrence might be a message from high volume topicA -> medium volume topicB, as well as one from topicC -> topicD. then another occurrence could be a single misrouted message from topicA -> topicE * could you check that you don't have "interceptor.classes" config defined in any of the clients? ** We do have a rarely used API path which uses a client that is configured with interceptor.classes (a single class which records metrics in {{onAcknowledgement}} and it also adds a header to the event as well as recording metrics in {{{}onSend{}}}), however it is never used in the variant of our proxy that has experienced these misroutings. * any chance you can share the custom partitioner code? ** Possibly - let me check. I'll also see if I can share the interceptor class above > Messages ending up on the wrong topic > ------------------------------------- > > Key: KAFKA-19012 > URL: https://issues.apache.org/jira/browse/KAFKA-19012 > Project: Kafka > Issue Type: Bug > Components: clients, producer > Affects Versions: 3.2.3, 3.8.1 > Reporter: Donny Nadolny > Assignee: Kirk True > Priority: Major > > We're experiencing messages very occasionally ending up on a different topic > than what they were published to. That is, we publish a message to topicA and > consumers of topicB see it and fail to parse it because the message contents > are meant for topicA. This has happened for various topics. > We've begun adding a header with the intended topic (which we get just by > reading the topic from the record that we're about to pass to the OSS client) > right before we call producer.send, this header shows the correct topic > (which also matches up with the message contents itself). Similarly we're > able to use this header and compare it to the actual topic to prevent > consuming these misrouted messages, but this is still concerning. > Some details: > - This happens rarely: it happened approximately once per 10 trillion > messages for a few months, though there was a period of a week or so where it > happened more frequently (once per 1 trillion messages or so) > - It often happens in a small burst, eg 2 or 3 messages very close in time > (but from different hosts) will be misrouted > - It often but not always coincides with some sort of event in the cluster > (a broker restarting or being replaced, network issues causing errors, etc). > Also these cluster events happen quite often with no misrouted messages > - We run many clusters, it has happened for several of them > - There is no pattern between intended and actual topic, other than the > intended topic tends to be higher volume ones (but I'd attribute that to > there being more messages published -> more occurrences affecting it rather > than it being more likely per-message) > - It only occurs with clients that are using a non-zero linger > - Once it happened with two sequential messages, both were intended for > topicA but both ended up on topicB, published by the same host (presumably > within the same linger batch) > - Most of our clients are 3.2.3 and it has only affected those, most of our > brokers are 3.2.3 but it has also happened with a cluster that's running > 3.8.1 (but I suspect a client rather than broker problem because of it never > happening with clients that use 0 linger) -- This message was sent by Atlassian Jira (v8.20.10#820010)