Houston Putman created SOLR-17709: ------------------------------------- Summary: Fix race condition when checking distrib async cmd status Key: SOLR-17709 URL: https://issues.apache.org/jira/browse/SOLR-17709 Project: Solr Issue Type: Bug Reporter: Houston Putman Assignee: Houston Putman
The {{DistributedApiAsyncTracker}} mentioned that there could be a race condition between completing an asynchronous request and checking its status. This is causing very infrequent test failures, such as: {{{}ReindexCollectionTest.testAbort{}}}. The solution is to just check the ZK paths in reverse order from how they are updated. So when completing or canceling tasks, they are updated in the following order: # {{trackedAsyncTasks.put(asyncId, ...)}} or {{trackedAsyncTasks.remove(asyncId)}} # {{inFlightAsyncTasks.deleteInFlightTask(asyncId)}} Therefore in {{{}getAsyncTaskRequestStatus(asyncId){}}}, we need to check {{inFlightAsyncTasks}} before {{{}trackedAsyncTasks{}}}. This means we can get a false-positive "Submitted" or "Running" result (race condition described below). But that will just lead to the client checking again at a later time, and the next time they call, {{inFlightAsyncTasks}} will have been updated and we will get the actual response from {{{}trackedAsyncTasks{}}}. Before this PR, the race condition would give us a false-negative "Operation failed. Please resubmit" result. (race condition described below). This would tell the client to try again, when in fact the task could have been successful. This false-negative is much worse than the false-positive described above. Race condition before this PR: (false-negative) # {{getAsyncTaskRequestStatus()}} -- {{trackedAsyncTasks}} is checked -- no response is found # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response is put into ZK # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID is deleted from ZK # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- asyncId is not found ** Return a failure - Assume node died because {{inFlightAsyncTasks }} ephemeral node is gone Race condition after this PR: (false-positive) # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response is put into ZK # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked -- asyncId is found ** Return that the task is in progress # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID is deleted from ZK -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org