Hey all, We're trying to improve the load balancing across multiple services handling a significant amount of traffic (up to 2M requests per second for the biggest ones) and each server is roughly handling 5.5K requests per second. Our services autoscale a lot across the day and we need to discover new servers within seconds. For performance and cost efficiency, each client is aware of all its servers, so the traffic does not flow through an AWS LB. So far, to discover new servers, we've been using the max connection age as recommended, which lead to some traffic imbalance impacting the performance and cost.
To improve the current setup we decided to: - Implement our own custom name resolver that periodically polls the available servers. - Leverage the round robin load balancing policy to evenly distribute the traffic. All of this is working well but we are observing latency spike whenever we autoscale (scale-in and out). Our understanding is that round robin will first establish the connection before utilizing the sub-channel, so we don't understand the spike when new server are discovered. And the spike when a server is deleted is even more unclear to us. We've tried many things but without luck, so any explanations or suggestions from your side would be really appreciated. Thanks -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/grpc-io/d2e2226d-eb15-4838-bc95-c1b1548f5d2en%40googlegroups.com.