[ 
https://issues.apache.org/jira/browse/IGNITE-25137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vadim Pakhnushev updated IGNITE-25137:
--------------------------------------
    Description: 
Even after IGNITE-24910 is fixed, the test still fails.

This happens when the remote compute job is submitted from node1 to node2 and 
cancel query is executed on node3. In this case {{ExecutionManager}} on node3 
doesn't contain the execution so the {{ComputeComponentImpl}} broadcasts the 
cancel request to node1 and node2.
node1 holds a {{RemoteJobExecution}} which doesn't participate in the 
cancellation after IGNITE-24910.
In addition, the node1 holds a {{FailSafeJobExecution}} which sends a message 
to the node2.
node2 holds a {{DelegatingJobExecution}}.
One of the requests succeeds and cancels the job, returning {{true}}, other 
returns {{false}}.
The broadcast method in the {{ComputeMessaging}} completes a result future with 
the first received response which could happen to be {{false}}. When {{true}} 
response arrives, the future is already complete and so the result of the 
cancel on node3 is {{false}}.

The solution is to assign a unique job id to the {{FailSafeJobExecution}} 
rather than copying it from the underlying job.
This will ensure that only a locally running {{DelegatingJobExecution}} will be 
cancelled, and it will be cancelled only once.

  was:Even after IGNITE-24910 is fixed, the test still fails sometimes.


> ItSqlKillCommandTest#killComputeJobFromRemote is flaky
> ------------------------------------------------------
>
>                 Key: IGNITE-25137
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25137
>             Project: Ignite
>          Issue Type: Bug
>          Components: compute
>            Reporter: Vadim Pakhnushev
>            Assignee: Vadim Pakhnushev
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Even after IGNITE-24910 is fixed, the test still fails.
> This happens when the remote compute job is submitted from node1 to node2 and 
> cancel query is executed on node3. In this case {{ExecutionManager}} on node3 
> doesn't contain the execution so the {{ComputeComponentImpl}} broadcasts the 
> cancel request to node1 and node2.
> node1 holds a {{RemoteJobExecution}} which doesn't participate in the 
> cancellation after IGNITE-24910.
> In addition, the node1 holds a {{FailSafeJobExecution}} which sends a message 
> to the node2.
> node2 holds a {{DelegatingJobExecution}}.
> One of the requests succeeds and cancels the job, returning {{true}}, other 
> returns {{false}}.
> The broadcast method in the {{ComputeMessaging}} completes a result future 
> with the first received response which could happen to be {{false}}. When 
> {{true}} response arrives, the future is already complete and so the result 
> of the cancel on node3 is {{false}}.
> The solution is to assign a unique job id to the {{FailSafeJobExecution}} 
> rather than copying it from the underlying job.
> This will ensure that only a locally running {{DelegatingJobExecution}} will 
> be cancelled, and it will be cancelled only once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to