[jira] [Commented] (IGNITE-19655) Distributed Sql keeps mapping query fragments to a node that has already left

Roman Puchkovskiy (Jira) Thu, 08 Jun 2023 07:48:06 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730596#comment-17730596
 ]


Roman Puchkovskiy commented on IGNITE-19655:
--------------------------------------------

The first weirdness was fixed in IGNITE-19685. To make the situation with 
removal of an outdated node less confusing, I've changed logging a bit to make 
this case stand out and differ from the normal 'node left' case.

> Distributed Sql keeps mapping query fragments to a node that has already left
> -----------------------------------------------------------------------------
>
>                 Key: IGNITE-19655
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19655
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Maksim Zhuravkov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> There are two test failures: 
> [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7271211?expandCode+Inspection=true&expandBuildProblemsSection=true&hideProblemsFromDependencies=false&expandBuildTestsSection=true&hideTestsFromDependencies=false]
>  and 
> [https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_RunAllTests/7272905?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandCode+Inspection=true&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true]
>  
> (org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.entriesKeepAppendedAfterSnapshotInstallation
>  and 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.snapshotInstallTimeoutDoesNotBreakSubsequentInstallsWhenSecondAttemptIsIdenticalToFirst,
>  correspondingly).
> In both cases, the test code creates a table with 3 replicas on a cluster of 
> 3 nodes, then it stops the last node and tries to make an insert using one of 
> the 2 remaining nodes. The RAFT majority (2 of 3) is still preserved, so the 
> insert should succeed. It's understood that the insert might be issued before 
> the remaining nodes understand that the third node has left, so we have a 
> retry mechanism in place, it makes up to 5 attempts for almost 8 seconds (in 
> total).
> But in both the failed runs, each of 5 attempts failed because a fragment of 
> the INSERT query was mapped to the missing node. This seems to be a bad luck 
> (as the tests pass most of the time, fail rate is about 2.5%), but anyway: 
> the SQL engine does not seem to care about the fact that the node has already 
> left.
> Probably, the SQL engine should track the Logical Topology events and avoid 
> mapping query fragments to the missing nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-19655) Distributed Sql keeps mapping query fragments to a node that has already left

Reply via email to