[ https://issues.apache.org/jira/browse/IGNITE-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730155#comment-17730155 ]
Roman Puchkovskiy commented on IGNITE-19655: -------------------------------------------- There are 2 weirdnesses: # Why does it take 10 seconds for a node to be removed from the physical topology if it left gracefully? # Why didn't the size of the topology shrink to 2 nodes when the node has actually left the topology? The second weirdness is harmless, it just looks weird in the logs. The thing is that the node re-joined before it was considered as left, so the 'Node left' was about an outdated version of the node (that was already replaced with the new one on the second join). Logically, it's all ok: a node was 'detected' to be missing too late, so there was no time window when the cluster would know that the node was gone. The first weirdness is really weird. I'm digging this, but we should probably remove the node from the topology as soon as it says 'good bye' (LEAVING event) as otherwise the suspicion check mechanism is used for it. > Distributed Sql keeps mapping query fragments to a node that has already left > ----------------------------------------------------------------------------- > > Key: IGNITE-19655 > URL: https://issues.apache.org/jira/browse/IGNITE-19655 > Project: Ignite > Issue Type: Bug > Reporter: Roman Puchkovskiy > Assignee: Maksim Zhuravkov > Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > > There are two test failures: > [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7271211?expandCode+Inspection=true&expandBuildProblemsSection=true&hideProblemsFromDependencies=false&expandBuildTestsSection=true&hideTestsFromDependencies=false] > and > [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7272905?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandCode+Inspection=true&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true] > > (org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.entriesKeepAppendedAfterSnapshotInstallation > and > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.snapshotInstallTimeoutDoesNotBreakSubsequentInstallsWhenSecondAttemptIsIdenticalToFirst, > correspondingly). > In both cases, the test code creates a table with 3 replicas on a cluster of > 3 nodes, then it stops the last node and tries to make an insert using one of > the 2 remaining nodes. The RAFT majority (2 of 3) is still preserved, so the > insert should succeed. It's understood that the insert might be issued before > the remaining nodes understand that the third node has left, so we have a > retry mechanism in place, it makes up to 5 attempts for almost 8 seconds (in > total). > But in both the failed runs, each of 5 attempts failed because a fragment of > the INSERT query was mapped to the missing node. This seems to be a bad luck > (as the tests pass most of the time, fail rate is about 2.5%), but anyway: > the SQL engine does not seem to care about the fact that the node has already > left. > Probably, the SQL engine should track the Logical Topology events and avoid > mapping query fragments to the missing nodes. -- This message was sent by Atlassian Jira (v8.20.10#820010)