Denis Chudov created IGNITE-25421:
-------------------------------------

             Summary: Add the requests throttling to raft client
                 Key: IGNITE-25421
                 URL: https://issues.apache.org/jira/browse/IGNITE-25421
             Project: Ignite
          Issue Type: Bug
            Reporter: Denis Chudov


InĀ RaftGroupServiceImpl we have following parameters for retrying the requests:
 * request timeout: the timeout of a single request to raft group, the 
completable future fails on the client side of timeout exceeded;
 * retry timeout: total timeout to get the successful response; includes all 
retry attempts;
 * retry delay: delay to schedule the next retry attempt in the case of failure.

The problem is that the retry model is too simple:
 * in the case of overloaded raft group it throws "TimeoutException: Send with 
retry timed out" giving no useful information for the user
 * it perform retries after short delay, producing more repeated requests to 
overloaded group, while the old requests are still somewhere in queue
 * it doesn't limit the count of requests to the raft group.

In the same time, the retries are useful:
 * raft leader can be changed at any moment;
 * network failure, gc pause on the leader, anything else may happen that will 
be seen on the client side as TimeoutException, and the request should be 
retried. Also, this is the reason why request timeout is less than retry 
timeout.

*Proposal*

The most simple solution would be:

- dividing the requests to two groups: those that are being retried and those 
that are incoming into the client. Former should be retried until retry timeout 
exceeds, latter may be rejected with and exception instantly. To achieve this, 
we may maintain the "request capacity" per remote node;

- increasing the request timeout until it reaches retry timeout in the case of 
TimeoutException. This will give a chance for requests that are being retried 
to be processed by raft group within timeout. The increased timeout should work 
for any request sent by the client to overloaded node, ideally it should work 
for any request for the sameĀ  node because striped disruptors are shared 
between groups.

So, there can be some shared context between clients that keeps remote nodes' 
capacities and request timeouts for each of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to