Hi folks, I'm currently digging into a customer issue we've seen where the retry logic isn't working well. Basically, they have a ton of producers and a load of partitions, and when they all connect at the same time, they bury the brokers for a few minutes. I'm looking at this from the consumer point of view. The consumer is a flink pipeline, so one consumer failing brings down the whole pipeline, meaning all of the flood back to try to connect again.
Anyhow, the bit that I'm confused about is the logic here: https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8 Basically, TooManyRequest is being considered a non-retriable error. I think it should be exponential backoff with jitter, but I'm wondering if there's any strong reason it was specifically excluded from retry. Anyone have any insights? -Ivan