> so that timeouts on lookup type requests also closes the connection (and hopefully allows establishment of connection on a working broker).
+1. Thanks, Rajan On Thu, Aug 12, 2021 at 11:30 AM Ivan Kelly <iv...@apache.org> wrote: > Hi folks, > > Rajan's opinion would be particularly useful here as they appears to > have done most of the TooManyRequests work in the past. > > So, the original TooManyRequests works well in the case that you have > a lot of topics and loads of lookups and PMR come in to the broker. In > this case, most requests need to be served from zookeeper. These > request will end up being async, so the broker will keep accepting > requests and sending to zookeeper until it hits TooManyRequests. > > We have been seeing a slightly different scenario. There's not many > topics (1 topic with 1k partitions), but there are millions of > publishers. So when a rolling restart or something happens, some of > the broker's io threads will get saturated with lookup requests and > they won't work through the backlog before the client's timeout period > (which results in the client crashing and retrying again in the > current code). This is the scenario targeted by the lookup timeout > change. > > However, we have also seen the case where the brokers were not > overloaded at all, but one broker is just acting up (or the GCP load > balancer is blackholing the connection, we never got to the root). > From the client PoV, this is indistinguishable from the overloaded > broker scenario. With the lookup timeout change, the lookup will be > retried. However, it will be retried on the same connection, so will > fail again. > > This is the same reason max rejected was added in > https://github.com/apache/pulsar/pull/274. My proposal is to extend > it, so that timeouts on lookup type requests also closes the > connection (and hopefully allows establishment of connection on a > working broker). > > -Ivan >