[jira] [Updated] (IGNITE-15655) Random tests fail due to TimeoutException while waiting for a response

Alexander Lapin (Jira) Mon, 11 Oct 2021 07:30:15 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-15655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Lapin updated IGNITE-15655:
-------------------------------------
    Description: 
h3. Problem
Sometimes tests fails on main branch with an abstract "{{Caused by: 
java.util.concurrent.TimeoutException: Did not observe any item or terminal 
signal within 1000ms in 'source(MonoDefer)' (and no fallback has been 
configured)"}} reason. 

Need to investigate and fix the root cause of this fails.

 

Test 
{{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testRenameColumn()}}
 in [^_Integration_Tests_Integration_Tests_3305.log.zip]

Test 
{{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testMergeChangesAddDropAdd()}}
 in [^_Integration_Tests_Integration_Tests_3322.log.zip]

h3. Upd 1
Quick research revealed that the main reason for mentioned above 
_TimeoutException_ is a poor consistency of raft timeouts .

Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
    // A follower would become a candidate if it doesn't receive any message
    // from the leader in |election_timeout_ms| milliseconds
    // Default: 1000 (1s)
    private int electionTimeoutMs = 1000; // follower to candidate timeout
{code}

For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):

{code:java}
        String name = "JRaft-VoteTimer-" + suffix;
        this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

        name = "JRaft-ElectionTimer-" + suffix;
        electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}

It actually means that Loza#TIMEOUT that was also initially set to 1000 ms just 
won't be enough in some cases. Seems that "operation timeout (LOZA#TIMEOUT)" 
should be an order of magnitude more than election timeout.
During the implementation of [naive data 
rebalance|https://issues.apache.org/jira/browse/IGNITE-15491] LOZA#TIMEOUT was 
set to 10_000 to satisfy this need. Also a 
[ticket|https://issues.apache.org/jira/browse/IGNITE-15705] for timeouts 
research and consolidation was created.

So, taking into consideration that raft operation timeout was already increased 
from 1 second to 10 it worth to just enable all tests that were disabled under 
IGNITE-15655 and check whether they are still fail or not.



  was:
h3. Problem
Sometimes tests fails on main branch with an abstract "{{Caused by: 
java.util.concurrent.TimeoutException: Did not observe any item or terminal 
signal within 1000ms in 'source(MonoDefer)' (and no fallback has been 
configured)"}} reason. 

Need to investigate and fix the root cause of this fails.

 

Test 
{{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testRenameColumn()}}
 in [^_Integration_Tests_Integration_Tests_3305.log.zip]

Test 
{{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testMergeChangesAddDropAdd()}}
 in [^_Integration_Tests_Integration_Tests_3322.log.zip]

h3. Upd 1
Quick research revealed that the main reason for mentioned above 
_TimeoutException_ is a poor consistency of raft timeouts .

h4. Poor consistency of raft timeouts
Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
    // A follower would become a candidate if it doesn't receive any message
    // from the leader in |election_timeout_ms| milliseconds
    // Default: 1000 (1s)
    private int electionTimeoutMs = 1000; // follower to candidate timeout
{code}

For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):

{code:java}
        String name = "JRaft-VoteTimer-" + suffix;
        this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

        name = "JRaft-ElectionTimer-" + suffix;
        electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}

It actually means that Loza#TIMEOUT that was also initially set to 1000 ms just 
won't be enough in some cases. Seems that "operation timeout (LOZA#TIMEOUT)" 
should be an order of magnitude more than election timeout.
During the implementation of [naive data 
rebalance|https://issues.apache.org/jira/browse/IGNITE-15491] LOZA#TIMEOUT was 
set to 10_000 to satisfy this need. Also a 
[ticket|https://issues.apache.org/jira/browse/IGNITE-15705] for timeouts 
research and consolidation was created.

So, taking into consideration that raft operation timeout was already increased 
from 1 second to 10 it worth to just enable all tests that were disabled under 
IGNITE-15655 and check whether they are still fail or not.




> Random tests fail due to TimeoutException while waiting for a response
> ----------------------------------------------------------------------
>
>                 Key: IGNITE-15655
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15655
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Konstantin Orlov
>            Priority: Blocker
>              Labels: ignite-3
>         Attachments: _Integration_Tests_Integration_Tests_3305.log.zip, 
> _Integration_Tests_Integration_Tests_3322.log.zip
>
>
> h3. Problem
> Sometimes tests fails on main branch with an abstract "{{Caused by: 
> java.util.concurrent.TimeoutException: Did not observe any item or terminal 
> signal within 1000ms in 'source(MonoDefer)' (and no fallback has been 
> configured)"}} reason. 
> Need to investigate and fix the root cause of this fails.
>  
> Test 
> {{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testRenameColumn()}}
>  in [^_Integration_Tests_Integration_Tests_3305.log.zip]
> Test 
> {{org.apache.ignite.internal.runner.app.ITSchemaChangeKVViewTest.testMergeChangesAddDropAdd()}}
>  in [^_Integration_Tests_Integration_Tests_3322.log.zip]
> h3. Upd 1
> Quick research revealed that the main reason for mentioned above 
> _TimeoutException_ is a poor consistency of raft timeouts .
> Most of jraft timeouts are based on electionTimeoutMs.
> {code:java}
>     // A follower would become a candidate if it doesn't receive any message
>     // from the leader in |election_timeout_ms| milliseconds
>     // Default: 1000 (1s)
>     private int electionTimeoutMs = 1000; // follower to candidate timeout
> {code}
> For example both voteTime and electionTime use exact value of 
> getElectionTimeoutMs (1000 ms):
> {code:java}
>         String name = "JRaft-VoteTimer-" + suffix;
>         this.voteTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};
>         name = "JRaft-ElectionTimer-" + suffix;
>         electionTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getElectionTimer(name)) {...};
> {code}
> It actually means that Loza#TIMEOUT that was also initially set to 1000 ms 
> just won't be enough in some cases. Seems that "operation timeout 
> (LOZA#TIMEOUT)" should be an order of magnitude more than election timeout.
> During the implementation of [naive data 
> rebalance|https://issues.apache.org/jira/browse/IGNITE-15491] LOZA#TIMEOUT 
> was set to 10_000 to satisfy this need. Also a 
> [ticket|https://issues.apache.org/jira/browse/IGNITE-15705] for timeouts 
> research and consolidation was created.
> So, taking into consideration that raft operation timeout was already 
> increased from 1 second to 10 it worth to just enable all tests that were 
> disabled under IGNITE-15655 and check whether they are still fail or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-15655) Random tests fail due to TimeoutException while waiting for a response

Reply via email to