[ 
https://issues.apache.org/jira/browse/IGNITE-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Pligin reassigned IGNITE-25421:
----------------------------------------

    Assignee: Denis Chudov

> Add the requests throttling to raft client
> ------------------------------------------
>
>                 Key: IGNITE-25421
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25421
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>
> InĀ RaftGroupServiceImpl we have following parameters for retrying the 
> requests:
>  * request timeout: the timeout of a single request to raft group, the 
> completable future fails on the client side of timeout exceeded;
>  * retry timeout: total timeout to get the successful response; includes all 
> retry attempts;
>  * retry delay: delay to schedule the next retry attempt in the case of 
> failure.
> The problem is that the retry model is too simple:
>  * in the case of overloaded raft group it throws "TimeoutException: Send 
> with retry timed out" giving no useful information for the user
>  * it perform retries after short delay, producing more repeated requests to 
> overloaded group, while the old requests are still somewhere in queue
>  * it doesn't limit the count of requests to the raft group.
> In the same time, the retries are useful:
>  * raft leader can be changed at any moment;
>  * network failure, gc pause on the leader, anything else may happen that 
> will be seen on the client side as TimeoutException, and the request should 
> be retried. Also, this is the reason why request timeout is less than retry 
> timeout.
> *Proposal*
> The most simple solution would be:
>  - dividing the requests to two groups: those that are being retried and 
> those that are incoming into the client. Former should be retried until retry 
> timeout exceeds, latter may be rejected with and exception instantly. To 
> achieve this, we may maintain the "request capacity" per remote node;
>  - increasing the request timeout until it reaches retry timeout in the case 
> of TimeoutException. This will give a chance for requests that are being 
> retried to be processed by raft group within timeout. The increased timeout 
> should work for any request sent by the client to overloaded node, ideally it 
> should work for any request for the sameĀ  node because striped disruptors are 
> shared between groups
>  - request timeout may be decreased back when the response time in the last N 
> seconds becomes less than some threshold.
> So, there can be some shared context between clients that keeps remote nodes' 
> capacities and request timeouts for each of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to