Houston Putman created SOLR-17709:
-------------------------------------

             Summary: Fix race condition when checking distrib async cmd status
                 Key: SOLR-17709
                 URL: https://issues.apache.org/jira/browse/SOLR-17709
             Project: Solr
          Issue Type: Bug
            Reporter: Houston Putman
            Assignee: Houston Putman


The {{DistributedApiAsyncTracker}} mentioned that there could be a race 
condition between completing an asynchronous request and checking its status. 
This is causing very infrequent test failures, such as: 
{{{}ReindexCollectionTest.testAbort{}}}.

The solution is to just check the ZK paths in reverse order from how they are 
updated.

So when completing or canceling tasks, they are updated in the following order:
 # {{trackedAsyncTasks.put(asyncId, ...)}} or 
{{trackedAsyncTasks.remove(asyncId)}}
 # {{inFlightAsyncTasks.deleteInFlightTask(asyncId)}}

Therefore in {{{}getAsyncTaskRequestStatus(asyncId){}}}, we need to check 
{{inFlightAsyncTasks}} before {{{}trackedAsyncTasks{}}}. This means we can get 
a false-positive "Submitted" or "Running" result (race condition described 
below). But that will just lead to the client checking again at a later time, 
and the next time they call, {{inFlightAsyncTasks}} will have been updated and 
we will get the actual response from {{{}trackedAsyncTasks{}}}.

Before this PR, the race condition would give us a false-negative "Operation 
failed. Please resubmit" result. (race condition described below). This would 
tell the client to try again, when in fact the task could have been successful. 
This false-negative is much worse than the false-positive described above.

Race condition before this PR: (false-negative)
 # {{getAsyncTaskRequestStatus()}} -- {{trackedAsyncTasks}} is checked -- no 
response is found
 # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response is 
put into ZK
 # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID is 
deleted from ZK
 # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- 
asyncId is not found
 ** Return a failure - Assume node died because {{inFlightAsyncTasks }} 
ephemeral node is gone

Race condition after this PR: (false-positive)
 # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response is 
put into ZK
 # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- 
asyncId is found
 ** Return that the task is in progress
 # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID is 
deleted from ZK



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to