On Fri, Jan 17, 2025 at 8:31 PM Donny Nadolny <do...@stripe.com.invalid> wrote:
> We're experiencing messages very occasionally ending up on a different > topic than what they were published to. That is, we publish a message to > topicA and consumers of topicB see it and fail to parse it because the > message contents are meant for topicA. This has happened for various > topics. Searching existing bug reports hasn't shown anything, has anyone > seen anything like this? > > We've begun adding a header with the intended topic (which we get just by > reading the topic from the record that we're about to pass to the OSS > client) right before we call producer.send, this header shows the correct > topic (which also matches up with the message contents itself). Similarly > we're able to use this header and compare it to the actual topic to prevent > consuming these misrouted messages, but it causes work for us to replay > these messages to the right topic and is also pretty concerning. > > Some details: > - This happens rarely: approximately once per 10 trillion messages > - It often happens in a small burst, eg 2 or 3 messages very close in time > (but from different hosts) will be misrouted > - It often but not always coincides with some sort of event in the cluster > (a broker restarting or being replaced, network issues causing errors, > etc). Also these cluster events happen quite often with no misrouted > messages > - We run many clusters, it has happened for several of them > - There is no pattern between intended and actual topic, other than the > intended topic tends to be higher volume ones (but I'd attribute that to > there being more messages published -> more occurrences affecting it rather > than it being more likely per-message) > - It only occurs with clients that are using a non-zero linger > - Once it happened with two sequential messages, both were intended for > topicA but both ended up on topicB, published by the same host (presumably > within the same linger batch) > - Most of our clients are 3.2.3 and it has only affected those, our > brokers are 3.2.3 as well (but I suspect a client rather than broker > problem because of it never happening with clients that use 0 linger) > Hi Donny, We have observed a very similar problem in late 2023, reported here: https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy And here (for the related high CPU issue, resolved by upgrading Linux kernel): https://issues.apache.org/jira/browse/KAFKA-16054 Some details from our case: - The messages were routed *both* to the intended topic and to some other topic - The problem also appeared in short bursts, also correlating with Kafka broker maintenance (or outage, see JIRA) We still don't have a plausible theory to explain the observed problem. Cheers, -- Alex