Hi folks, Rajan's opinion would be particularly useful here as they appears to have done most of the TooManyRequests work in the past.
So, the original TooManyRequests works well in the case that you have a lot of topics and loads of lookups and PMR come in to the broker. In this case, most requests need to be served from zookeeper. These request will end up being async, so the broker will keep accepting requests and sending to zookeeper until it hits TooManyRequests. We have been seeing a slightly different scenario. There's not many topics (1 topic with 1k partitions), but there are millions of publishers. So when a rolling restart or something happens, some of the broker's io threads will get saturated with lookup requests and they won't work through the backlog before the client's timeout period (which results in the client crashing and retrying again in the current code). This is the scenario targeted by the lookup timeout change. However, we have also seen the case where the brokers were not overloaded at all, but one broker is just acting up (or the GCP load balancer is blackholing the connection, we never got to the root). >From the client PoV, this is indistinguishable from the overloaded broker scenario. With the lookup timeout change, the lookup will be retried. However, it will be retried on the same connection, so will fail again. This is the same reason max rejected was added in https://github.com/apache/pulsar/pull/274. My proposal is to extend it, so that timeouts on lookup type requests also closes the connection (and hopefully allows establishment of connection on a working broker). -Ivan