Hi folks,

Rajan's opinion would be particularly useful here as they appears to
have done most of the TooManyRequests work in the past.

So, the original TooManyRequests works well in the case that you have
a lot of topics and loads of lookups and PMR come in to the broker. In
this case, most requests need to be served from zookeeper. These
request will end up being async, so the broker will keep accepting
requests and sending to zookeeper until it hits TooManyRequests.

We have been seeing a slightly different scenario. There's not many
topics (1 topic with 1k partitions), but there are millions of
publishers. So when a rolling restart or something happens, some of
the broker's io threads will get saturated with lookup requests and
they won't work through the backlog before the client's timeout period
(which results in the client crashing and retrying again in the
current code). This is the scenario targeted by the lookup timeout
change.

However, we have also seen the case where the brokers were not
overloaded at all, but one broker is just acting up (or the GCP load
balancer is blackholing the connection, we never got to the root).
>From the client PoV, this is indistinguishable from the overloaded
broker scenario. With the lookup timeout change, the lookup will be
retried. However, it will be retried on the same connection, so will
fail again.

This is the same reason max rejected was added in
https://github.com/apache/pulsar/pull/274. My proposal is to extend
it, so that timeouts on lookup type requests also closes the
connection (and hopefully allows establishment of connection on a
working broker).

-Ivan

Reply via email to