[ https://issues.apache.org/jira/browse/IGNITE-25137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vadim Pakhnushev updated IGNITE-25137: -------------------------------------- Description: Even after IGNITE-24910 is fixed, the test still fails. This happens when the remote compute job is submitted from node1 to node2 and cancel query is executed on node3. In this case {{ExecutionManager}} on node3 doesn't contain the execution so the {{ComputeComponentImpl}} broadcasts the cancel request to node1 and node2. node1 holds a {{RemoteJobExecution}} which doesn't participate in the cancellation after IGNITE-24910. In addition, the node1 holds a {{FailSafeJobExecution}} which sends a message to the node2. node2 holds a {{DelegatingJobExecution}}. One of the requests succeeds and cancels the job, returning {{true}}, other returns {{false}}. The broadcast method in the {{ComputeMessaging}} completes a result future with the first received response which could happen to be {{false}}. When {{true}} response arrives, the future is already complete and so the result of the cancel on node3 is {{false}}. The solution is to assign a unique job id to the {{FailSafeJobExecution}} rather than copying it from the underlying job. This will ensure that only a locally running {{DelegatingJobExecution}} will be cancelled, and it will be cancelled only once. was:Even after IGNITE-24910 is fixed, the test still fails sometimes. > ItSqlKillCommandTest#killComputeJobFromRemote is flaky > ------------------------------------------------------ > > Key: IGNITE-25137 > URL: https://issues.apache.org/jira/browse/IGNITE-25137 > Project: Ignite > Issue Type: Bug > Components: compute > Reporter: Vadim Pakhnushev > Assignee: Vadim Pakhnushev > Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > Even after IGNITE-24910 is fixed, the test still fails. > This happens when the remote compute job is submitted from node1 to node2 and > cancel query is executed on node3. In this case {{ExecutionManager}} on node3 > doesn't contain the execution so the {{ComputeComponentImpl}} broadcasts the > cancel request to node1 and node2. > node1 holds a {{RemoteJobExecution}} which doesn't participate in the > cancellation after IGNITE-24910. > In addition, the node1 holds a {{FailSafeJobExecution}} which sends a message > to the node2. > node2 holds a {{DelegatingJobExecution}}. > One of the requests succeeds and cancels the job, returning {{true}}, other > returns {{false}}. > The broadcast method in the {{ComputeMessaging}} completes a result future > with the first received response which could happen to be {{false}}. When > {{true}} response arrives, the future is already complete and so the result > of the cancel on node3 is {{false}}. > The solution is to assign a unique job id to the {{FailSafeJobExecution}} > rather than copying it from the underlying job. > This will ensure that only a locally running {{DelegatingJobExecution}} will > be cancelled, and it will be cancelled only once. -- This message was sent by Atlassian Jira (v8.20.10#820010)