Hi folks,

I'm currently digging into a customer issue we've seen where the retry
logic isn't working well. Basically, they have a ton of producers and
a load of partitions, and when they all connect at the same time, they
bury the brokers for a few minutes. I'm looking at this from the
consumer point of view. The consumer is a flink pipeline, so one
consumer failing brings down the whole pipeline, meaning all of the
flood back to try to connect again.

Anyhow, the bit that I'm confused about is the logic here:
https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8
Basically, TooManyRequest is being considered a non-retriable error. I
think it should be exponential backoff with jitter, but I'm wondering
if there's any strong reason it was specifically excluded from retry.

Anyone have any insights?

-Ivan

Reply via email to