[jira] [Commented] (IGNITE-19655) Distributed Sql keeps mapping query fragments to a node that has already left

Roman Puchkovskiy (Jira) Wed, 07 Jun 2023 08:09:03 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730155#comment-17730155
 ]


Roman Puchkovskiy commented on IGNITE-19655:
--------------------------------------------

There are 2 weirdnesses:
 # Why does it take 10 seconds for a node to be removed from the physical 
topology if it left gracefully?
 # Why didn't the size of the topology shrink to 2 nodes when the node has 
actually left the topology?

The second weirdness is harmless, it just looks weird in the logs. The thing is 
that the node re-joined before it was considered as left, so the 'Node left' 
was about an outdated version of the node (that was already replaced with the 
new one on the second join). Logically, it's all ok: a node was 'detected' to 
be missing too late, so there was no time window when the cluster would know 
that the node was gone.

The first weirdness is really weird. I'm digging this, but we should probably 
remove the node from the topology as soon as it says 'good bye' (LEAVING event) 
as otherwise the suspicion check mechanism is used for it.

> Distributed Sql keeps mapping query fragments to a node that has already left
> -----------------------------------------------------------------------------
>
>                 Key: IGNITE-19655
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19655
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Maksim Zhuravkov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> There are two test failures: 
> [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7271211?expandCode+Inspection=true&expandBuildProblemsSection=true&hideProblemsFromDependencies=false&expandBuildTestsSection=true&hideTestsFromDependencies=false]
>  and 
> [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7272905?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandCode+Inspection=true&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true]
>  
> (org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.entriesKeepAppendedAfterSnapshotInstallation
>  and 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.snapshotInstallTimeoutDoesNotBreakSubsequentInstallsWhenSecondAttemptIsIdenticalToFirst,
>  correspondingly).
> In both cases, the test code creates a table with 3 replicas on a cluster of 
> 3 nodes, then it stops the last node and tries to make an insert using one of 
> the 2 remaining nodes. The RAFT majority (2 of 3) is still preserved, so the 
> insert should succeed. It's understood that the insert might be issued before 
> the remaining nodes understand that the third node has left, so we have a 
> retry mechanism in place, it makes up to 5 attempts for almost 8 seconds (in 
> total).
> But in both the failed runs, each of 5 attempts failed because a fragment of 
> the INSERT query was mapped to the missing node. This seems to be a bad luck 
> (as the tests pass most of the time, fail rate is about 2.5%), but anyway: 
> the SQL engine does not seem to care about the fact that the node has already 
> left.
> Probably, the SQL engine should track the Logical Topology events and avoid 
> mapping query fragments to the missing nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-19655) Distributed Sql keeps mapping query fragments to a node that has already left

Reply via email to