Currently, there are two ways to get the TooManyRequest errors in the client.
1) The client enforces a maximum number of pending lookups: https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L733 The max number can be set when creating the pulsar client via " maxLookupRequests()" or defaults to 50000. The client will internally throw a "TooManyRequest" exception if that limit is exceeded. 2) The broker also have enforced a limit on pending lookup requests: https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L429 If the number of pending lookup requests for broker exceeds the set limit, a "TooManyRequest" error would be sent back to the client I agree. I am not sure why we are not automatically retrying with backoff on TooManyRequest errors in the client especially if the error is sent from the Broker. However, what should be behavior of the client if the client hits the max pending lookup limit set in the client? In that scenario, should we block or fail fast and let the application decide which is what we do today? Also, should we distinguish between the two scenarios, i.e. broker sends the error vs client internally throws the error? Best, Jerry On Fri, Aug 6, 2021 at 10:50 AM Ivan Kelly <iv...@apache.org> wrote: > Another strangeness I've seen with partition metadata requests. > > > https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java#L886 > > It takes a backoff, and it does retries, but the opTimeoutMs variable > is only decremented by the retry delay, not the time spent making the > request. So if the request is timing out after the read timeout (30s) > or the connect timeout (10s), this time isn't taking from the > opTimeoutMs. I would expect opTimeout to be the maximum time spent > waiting for a request to complete. This is set to 30s by default. > However, in this case, it can be much much more. I suspect that fixing > this will result in a lot more people seeing timeouts though :/ > > -Ivan > > On Thu, Aug 5, 2021 at 7:07 PM Ivan Kelly <iv...@apache.org> wrote: > > > > Hi folks, > > > > I'm currently digging into a customer issue we've seen where the retry > > logic isn't working well. Basically, they have a ton of producers and > > a load of partitions, and when they all connect at the same time, they > > bury the brokers for a few minutes. I'm looking at this from the > > consumer point of view. The consumer is a flink pipeline, so one > > consumer failing brings down the whole pipeline, meaning all of the > > flood back to try to connect again. > > > > Anyhow, the bit that I'm confused about is the logic here: > > > https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8 > > Basically, TooManyRequest is being considered a non-retriable error. I > > think it should be exponential backoff with jitter, but I'm wondering > > if there's any strong reason it was specifically excluded from retry. > > > > Anyone have any insights? > > > > -Ivan >