Lack of retries on TooManyRequests

Ivan Kelly Thu, 05 Aug 2021 11:07:44 -0700

Hi folks,

I'm currently digging into a customer issue we've seen where the retry
logic isn't working well. Basically, they have a ton of producers and
a load of partitions, and when they all connect at the same time, they
bury the brokers for a few minutes. I'm looking at this from the
consumer point of view. The consumer is a flink pipeline, so one
consumer failing brings down the whole pipeline, meaning all of the
flood back to try to connect again.


Anyhow, the bit that I'm confused about is the logic here:
https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8
Basically, TooManyRequest is being considered a non-retriable error. I
think it should be exponential backoff with jitter, but I'm wondering
if there's any strong reason it was specifically excluded from retry.

Anyone have any insights?

-Ivan

Lack of retries on TooManyRequests

Reply via email to