Denis Chudov created IGNITE-25421: ------------------------------------- Summary: Add the requests throttling to raft client Key: IGNITE-25421 URL: https://issues.apache.org/jira/browse/IGNITE-25421 Project: Ignite Issue Type: Bug Reporter: Denis Chudov
InĀ RaftGroupServiceImpl we have following parameters for retrying the requests: * request timeout: the timeout of a single request to raft group, the completable future fails on the client side of timeout exceeded; * retry timeout: total timeout to get the successful response; includes all retry attempts; * retry delay: delay to schedule the next retry attempt in the case of failure. The problem is that the retry model is too simple: * in the case of overloaded raft group it throws "TimeoutException: Send with retry timed out" giving no useful information for the user * it perform retries after short delay, producing more repeated requests to overloaded group, while the old requests are still somewhere in queue * it doesn't limit the count of requests to the raft group. In the same time, the retries are useful: * raft leader can be changed at any moment; * network failure, gc pause on the leader, anything else may happen that will be seen on the client side as TimeoutException, and the request should be retried. Also, this is the reason why request timeout is less than retry timeout. *Proposal* The most simple solution would be: - dividing the requests to two groups: those that are being retried and those that are incoming into the client. Former should be retried until retry timeout exceeds, latter may be rejected with and exception instantly. To achieve this, we may maintain the "request capacity" per remote node; - increasing the request timeout until it reaches retry timeout in the case of TimeoutException. This will give a chance for requests that are being retried to be processed by raft group within timeout. The increased timeout should work for any request sent by the client to overloaded node, ideally it should work for any request for the sameĀ node because striped disruptors are shared between groups. So, there can be some shared context between clients that keeps remote nodes' capacities and request timeouts for each of them. -- This message was sent by Atlassian Jira (v8.20.10#820010)