[jira] [Updated] (IGNITE-24513) HA: stable is not expected after recovered availability and node restarts

Mirza Aliev (Jira) Fri, 14 Feb 2025 02:27:04 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-24513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mirza Aliev updated IGNITE-24513:
---------------------------------
    Description: 
See 
{{ItHighAvailablePartitionsRecoveryByFilterUpdateTest#testSeveralHaResetsAndSomeNodeRestart}}
 - the test that covers this scenario.

*Precondition*
     *   Create a zone in HA mode (7 nodes, A, B, C, D, E, F, G) - phase 1
     *   Insert data and wait for replication to all nodes.
     *   Stop a majority of nodes (4 nodes A, B, C, D)
     *   Wait for the partition to become available (E, F, G), no new writes - 
phase 2
     *   Stop a majority of nodes once again (E, F)
     *   Wait for the partition to become available (G), no new writes - phase 3
     *   Stop the last node G
     *   Start one node from phase 1, A
     *   Start one node from phase 3, G
     *   Start one node from phase 2, E
     *   No data should be lost (reads from partition on A and E must be 
consistent with G)

*Result*
Before last step we check that stable is A, G, E, but it times out with stable 
equals to G

 
*Expected result*

 Stable is A, G, E after restart A, G, E

h3. Implementation notes
First of all, for debug purposes, I would simplify test to restart only A and 
G, and assert that stable is (A, G)
The second thought is to check if scale Up after A and G are restarted is 
scheduled. And also check that there is no redundant partition reset actions, I 
bet we have reset after nodes are restarted, because we check majority using 
replica factor but not the actual stable size 
 

  was:
See 
{{ItHighAvailablePartitionsRecoveryByFilterUpdateTest#testSeveralHaResetsAndSomeNodeRestart}}
 - the test that covers this scenario.

*Precondition*
     *   Create a zone in HA mode (7 nodes, A, B, C, D, E, F, G) - phase 1
     *   Insert data and wait for replication to all nodes.
     *   Stop a majority of nodes (4 nodes A, B, C, D)
     *   Wait for the partition to become available (E, F, G), no new writes - 
phase 2
     *   Stop a majority of nodes once again (E, F)
     *   Wait for the partition to become available (G), no new writes - phase 3
     *   Stop the last node G
     *   Start one node from phase 1, A
     *   Start one node from phase 3, G
     *   Start one node from phase 2, E
     *   No data should be lost (reads from partition on A and E must be 
consistent with G)

*Result*
Before last step we check that stable is A, G, E, but it times out with stable 
equals to G

 
*Expected result*

 Stable is A, G, E after restart A, G, E

 


> HA: stable is not expected after recovered availability and node restarts 
> --------------------------------------------------------------------------
>
>                 Key: IGNITE-24513
>                 URL: https://issues.apache.org/jira/browse/IGNITE-24513
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Priority: Major
>              Labels: ignite-3
>
> See 
> {{ItHighAvailablePartitionsRecoveryByFilterUpdateTest#testSeveralHaResetsAndSomeNodeRestart}}
>  - the test that covers this scenario.
> *Precondition*
>      *   Create a zone in HA mode (7 nodes, A, B, C, D, E, F, G) - phase 1
>      *   Insert data and wait for replication to all nodes.
>      *   Stop a majority of nodes (4 nodes A, B, C, D)
>      *   Wait for the partition to become available (E, F, G), no new writes 
> - phase 2
>      *   Stop a majority of nodes once again (E, F)
>      *   Wait for the partition to become available (G), no new writes - 
> phase 3
>      *   Stop the last node G
>      *   Start one node from phase 1, A
>      *   Start one node from phase 3, G
>      *   Start one node from phase 2, E
>      *   No data should be lost (reads from partition on A and E must be 
> consistent with G)
> *Result*
> Before last step we check that stable is A, G, E, but it times out with 
> stable equals to G
>  
> *Expected result*
>  Stable is A, G, E after restart A, G, E
> h3. Implementation notes
> First of all, for debug purposes, I would simplify test to restart only A and 
> G, and assert that stable is (A, G)
> The second thought is to check if scale Up after A and G are restarted is 
> scheduled. And also check that there is no redundant partition reset actions, 
> I bet we have reset after nodes are restarted, because we check majority 
> using replica factor but not the actual stable size 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-24513) HA: stable is not expected after recovered availability and node restarts

Reply via email to