Another strangeness I've seen with partition metadata requests.

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java#L886

It takes a backoff, and it does retries, but the opTimeoutMs variable
is only decremented by the retry delay, not the time spent making the
request. So if the request is timing out after the read timeout (30s)
or the connect timeout (10s), this time isn't taking from the
opTimeoutMs. I would expect opTimeout to be the maximum time spent
waiting for a request to complete. This is set to 30s by default.
However, in this case, it can be much much more. I suspect that fixing
this will result in a lot more people seeing timeouts though :/

-Ivan

On Thu, Aug 5, 2021 at 7:07 PM Ivan Kelly <iv...@apache.org> wrote:
>
> Hi folks,
>
> I'm currently digging into a customer issue we've seen where the retry
> logic isn't working well. Basically, they have a ton of producers and
> a load of partitions, and when they all connect at the same time, they
> bury the brokers for a few minutes. I'm looking at this from the
> consumer point of view. The consumer is a flink pipeline, so one
> consumer failing brings down the whole pipeline, meaning all of the
> flood back to try to connect again.
>
> Anyhow, the bit that I'm confused about is the logic here:
> https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8
> Basically, TooManyRequest is being considered a non-retriable error. I
> think it should be exponential backoff with jitter, but I'm wondering
> if there's any strong reason it was specifically excluded from retry.
>
> Anyone have any insights?
>
> -Ivan

Reply via email to