[ https://issues.apache.org/jira/browse/IGNITE-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Pligin reassigned IGNITE-25421: ---------------------------------------- Assignee: Denis Chudov > Add the requests throttling to raft client > ------------------------------------------ > > Key: IGNITE-25421 > URL: https://issues.apache.org/jira/browse/IGNITE-25421 > Project: Ignite > Issue Type: Bug > Reporter: Denis Chudov > Assignee: Denis Chudov > Priority: Major > Labels: ignite-3 > > InĀ RaftGroupServiceImpl we have following parameters for retrying the > requests: > * request timeout: the timeout of a single request to raft group, the > completable future fails on the client side of timeout exceeded; > * retry timeout: total timeout to get the successful response; includes all > retry attempts; > * retry delay: delay to schedule the next retry attempt in the case of > failure. > The problem is that the retry model is too simple: > * in the case of overloaded raft group it throws "TimeoutException: Send > with retry timed out" giving no useful information for the user > * it perform retries after short delay, producing more repeated requests to > overloaded group, while the old requests are still somewhere in queue > * it doesn't limit the count of requests to the raft group. > In the same time, the retries are useful: > * raft leader can be changed at any moment; > * network failure, gc pause on the leader, anything else may happen that > will be seen on the client side as TimeoutException, and the request should > be retried. Also, this is the reason why request timeout is less than retry > timeout. > *Proposal* > The most simple solution would be: > - dividing the requests to two groups: those that are being retried and > those that are incoming into the client. Former should be retried until retry > timeout exceeds, latter may be rejected with and exception instantly. To > achieve this, we may maintain the "request capacity" per remote node; > - increasing the request timeout until it reaches retry timeout in the case > of TimeoutException. This will give a chance for requests that are being > retried to be processed by raft group within timeout. The increased timeout > should work for any request sent by the client to overloaded node, ideally it > should work for any request for the sameĀ node because striped disruptors are > shared between groups > - request timeout may be decreased back when the response time in the last N > seconds becomes less than some threshold. > So, there can be some shared context between clients that keeps remote nodes' > capacities and request timeouts for each of them. -- This message was sent by Atlassian Jira (v8.20.10#820010)