[ 
https://issues.apache.org/jira/browse/SOLR-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936920#comment-17936920
 ] 

ASF subversion and git services commented on SOLR-17709:
--------------------------------------------------------

Commit e51dd47d88445259d57fc63dc655aecaafecf265 in solr's branch 
refs/heads/branch_9x from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=e51dd47d884 ]

SOLR-17709: Fix race condition when checking distrib async cmd status (#3268)

(cherry picked from commit d0d4f280b6410d8996fa998620d9b6661848d1f0)


> Fix race condition when checking distrib async cmd status
> ---------------------------------------------------------
>
>                 Key: SOLR-17709
>                 URL: https://issues.apache.org/jira/browse/SOLR-17709
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Houston Putman
>            Assignee: Houston Putman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{DistributedApiAsyncTracker}} mentioned that there could be a race 
> condition between completing an asynchronous request and checking its status. 
> This is causing very infrequent test failures, such as: 
> {{{}ReindexCollectionTest.testAbort{}}}.
> The solution is to just check the ZK paths in reverse order from how they are 
> updated.
> So when completing or canceling tasks, they are updated in the following 
> order:
>  # {{trackedAsyncTasks.put(asyncId, ...)}} or 
> {{trackedAsyncTasks.remove(asyncId)}}
>  # {{inFlightAsyncTasks.deleteInFlightTask(asyncId)}}
> Therefore in {{{}getAsyncTaskRequestStatus(asyncId){}}}, we need to check 
> {{inFlightAsyncTasks}} before {{{}trackedAsyncTasks{}}}. This means we can 
> get a false-positive "Submitted" or "Running" result (race condition 
> described below). But that will just lead to the client checking again at a 
> later time, and the next time they call, {{inFlightAsyncTasks}} will have 
> been updated and we will get the actual response from 
> {{{}trackedAsyncTasks{}}}.
> Before this PR, the race condition would give us a false-negative "Operation 
> failed. Please resubmit" result. (race condition described below). This would 
> tell the client to try again, when in fact the task could have been 
> successful. This false-negative is much worse than the false-positive 
> described above.
> Race condition before this PR: (false-negative)
>  # {{getAsyncTaskRequestStatus()}} -- {{trackedAsyncTasks}} is checked -- no 
> response is found
>  # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response 
> is put into ZK
>  # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID 
> is deleted from ZK
>  # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- 
> asyncId is not found
>  ** Return a failure - Assume node died because {{inFlightAsyncTasks }} 
> ephemeral node is gone
> Race condition after this PR: (false-positive)
>  # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response 
> is put into ZK
>  # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- 
> asyncId is found
>  ** Return that the task is in progress
>  # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID 
> is deleted from ZK



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to