> so that timeouts on lookup type requests also closes the connection (and
hopefully allows establishment of connection on a working broker).

+1.

Thanks,
Rajan

On Thu, Aug 12, 2021 at 11:30 AM Ivan Kelly <iv...@apache.org> wrote:

> Hi folks,
>
> Rajan's opinion would be particularly useful here as they appears to
> have done most of the TooManyRequests work in the past.
>
> So, the original TooManyRequests works well in the case that you have
> a lot of topics and loads of lookups and PMR come in to the broker. In
> this case, most requests need to be served from zookeeper. These
> request will end up being async, so the broker will keep accepting
> requests and sending to zookeeper until it hits TooManyRequests.
>
> We have been seeing a slightly different scenario. There's not many
> topics (1 topic with 1k partitions), but there are millions of
> publishers. So when a rolling restart or something happens, some of
> the broker's io threads will get saturated with lookup requests and
> they won't work through the backlog before the client's timeout period
> (which results in the client crashing and retrying again in the
> current code). This is the scenario targeted by the lookup timeout
> change.
>
> However, we have also seen the case where the brokers were not
> overloaded at all, but one broker is just acting up (or the GCP load
> balancer is blackholing the connection, we never got to the root).
> From the client PoV, this is indistinguishable from the overloaded
> broker scenario. With the lookup timeout change, the lookup will be
> retried. However, it will be retried on the same connection, so will
> fail again.
>
> This is the same reason max rejected was added in
> https://github.com/apache/pulsar/pull/274. My proposal is to extend
> it, so that timeouts on lookup type requests also closes the
> connection (and hopefully allows establishment of connection on a
> working broker).
>
> -Ivan
>

Reply via email to