On Fri, Jan 17, 2025 at 8:31 PM Donny Nadolny <do...@stripe.com.invalid>
wrote:

> We're experiencing messages very occasionally ending up on a different
> topic than what they were published to. That is, we publish a message to
> topicA and consumers of topicB see it and fail to parse it because the
> message contents are meant for topicA. This has happened for various
> topics. Searching existing bug reports hasn't shown anything, has anyone
> seen anything like this?
>
> We've begun adding a header with the intended topic (which we get just by
> reading the topic from the record that we're about to pass to the OSS
> client) right before we call producer.send, this header shows the correct
> topic (which also matches up with the message contents itself). Similarly
> we're able to use this header and compare it to the actual topic to prevent
> consuming these misrouted messages, but it causes work for us to replay
> these messages to the right topic and is also pretty concerning.
>
> Some details:
>  - This happens rarely: approximately once per 10 trillion messages
>  - It often happens in a small burst, eg 2 or 3 messages very close in time
> (but from different hosts) will be misrouted
>  - It often but not always coincides with some sort of event in the cluster
> (a broker restarting or being replaced, network issues causing errors,
> etc). Also these cluster events happen quite often with no misrouted
> messages
>  - We run many clusters, it has happened for several of them
>  - There is no pattern between intended and actual topic, other than the
> intended topic tends to be higher volume ones (but I'd attribute that to
> there being more messages published -> more occurrences affecting it rather
> than it being more likely per-message)
>  - It only occurs with clients that are using a non-zero linger
>  - Once it happened with two sequential messages, both were intended for
> topicA but both ended up on topicB, published by the same host (presumably
> within the same linger batch)
>  - Most of our clients are 3.2.3 and it has only affected those, our
> brokers are 3.2.3 as well (but I suspect a client rather than broker
> problem because of it never happening with clients that use 0 linger)
>

Hi Donny,

We have observed a very similar problem in late 2023, reported here:
https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy
And here (for the related high CPU issue, resolved by upgrading Linux
kernel): https://issues.apache.org/jira/browse/KAFKA-16054

Some details from our case:
- The messages were routed *both* to the intended topic and to some other
topic
- The problem also appeared in short bursts, also correlating with Kafka
broker maintenance (or outage, see JIRA)

We still don't have a plausible theory to explain the observed problem.

Cheers,
--
Alex

Reply via email to