[ https://issues.apache.org/jira/browse/SOLR-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936920#comment-17936920 ]
ASF subversion and git services commented on SOLR-17709: -------------------------------------------------------- Commit e51dd47d88445259d57fc63dc655aecaafecf265 in solr's branch refs/heads/branch_9x from Houston Putman [ https://gitbox.apache.org/repos/asf?p=solr.git;h=e51dd47d884 ] SOLR-17709: Fix race condition when checking distrib async cmd status (#3268) (cherry picked from commit d0d4f280b6410d8996fa998620d9b6661848d1f0) > Fix race condition when checking distrib async cmd status > --------------------------------------------------------- > > Key: SOLR-17709 > URL: https://issues.apache.org/jira/browse/SOLR-17709 > Project: Solr > Issue Type: Bug > Reporter: Houston Putman > Assignee: Houston Putman > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The {{DistributedApiAsyncTracker}} mentioned that there could be a race > condition between completing an asynchronous request and checking its status. > This is causing very infrequent test failures, such as: > {{{}ReindexCollectionTest.testAbort{}}}. > The solution is to just check the ZK paths in reverse order from how they are > updated. > So when completing or canceling tasks, they are updated in the following > order: > # {{trackedAsyncTasks.put(asyncId, ...)}} or > {{trackedAsyncTasks.remove(asyncId)}} > # {{inFlightAsyncTasks.deleteInFlightTask(asyncId)}} > Therefore in {{{}getAsyncTaskRequestStatus(asyncId){}}}, we need to check > {{inFlightAsyncTasks}} before {{{}trackedAsyncTasks{}}}. This means we can > get a false-positive "Submitted" or "Running" result (race condition > described below). But that will just lead to the client checking again at a > later time, and the next time they call, {{inFlightAsyncTasks}} will have > been updated and we will get the actual response from > {{{}trackedAsyncTasks{}}}. > Before this PR, the race condition would give us a false-negative "Operation > failed. Please resubmit" result. (race condition described below). This would > tell the client to try again, when in fact the task could have been > successful. This false-negative is much worse than the false-positive > described above. > Race condition before this PR: (false-negative) > # {{getAsyncTaskRequestStatus()}} -- {{trackedAsyncTasks}} is checked -- no > response is found > # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response > is put into ZK > # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID > is deleted from ZK > # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- > asyncId is not found > ** Return a failure - Assume node died because {{inFlightAsyncTasks }} > ephemeral node is gone > Race condition after this PR: (false-positive) > # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response > is put into ZK > # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- > asyncId is found > ** Return that the task is in progress > # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID > is deleted from ZK -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org