Pierre Salagnac created SOLR-17754:
--------------------------------------

             Summary: Race condition in overseer main loop leaves 
collection/shard/replica locked
                 Key: SOLR-17754
                 URL: https://issues.apache.org/jira/browse/SOLR-17754
             Project: Solr
          Issue Type: Task
    Affects Versions: 9.8
            Reporter: Pierre Salagnac


We hit a rare race condition in overseer main loop, but it led to a full stuck 
overseer. All operations submitted to the collection API were then never 
dequeued and never processed until a restart of the overseer.

For the race condition to reproduce, we need at least 100 concurrent tasks 
running in the collection API. This bug is reproducible only when the overseer 
is under high load.

 

 *Note on overseer threading*

The overseer main loop in class {{OverseerTaskProcessor}} processes tasks that 
were previously enqueued to the overseer distributed queue (stored in 
Zookeeper). This class has a producer/consumer pattern:
 - a single thread dequeues tasks from the distributed queue (named later the 
_main loop_ thread)

 - before submitting the task, the _main loop_ thread tries to lock on the 
collection/shard/replica (depending on the task type). If the lock was already 
taken because another task is already running on same resource, the task is not 
submitted and will be reconsidered during a later iteration.

 - then a pool with up to 100 threads executes the tasks (named later the 
_runner_ threads).

 - at the end of the task, the _runner_ thread unlocks the 
collection/shard/replica.

 
Since the pool uses a {{{}SynchronousQueue{}}}, it is not allowed to submit 
more than 100 concurrent tasks to the pool, otherwise the task is rejected by 
the pool and an exception is raised (see the exception below). To avoid this 
exception, the _main loop_ threads makes sure it does not submit more than 100 
concurrent tasks by maintaining the IDs of the current running tasks in an 
in-memory collection. Each time a task is being submitted, the task ID is added 
into collection {{{}runningTasks{}}}. Then, once the _runner_ thread completed 
the task (same in case of a failure), it removes the task ID from this 
collection. Before submitting a task to the _runner_ pool, the _main loop_ 
thread checks the size of collection runningTasks. In case there are already 
100 items in the collection, the task is not submitted but instead added into 
another collection ({{{}blockedTasks{}}}) so it will be submitted to the pool 
in a later iteration, once the size of runningTask will be below 100 (full 
mechanism not detailed here).
 

*Root cause race condition*

The bug is that the list of IDs in collection {{runningTasks}} is not fully 
aligned with list of _runner_ threads currently executing a task. The instant 
right after the _runner_ thread removed the ID, the thread is still running and 
hasn't returned back to the pool. But since the task ID is not in the 
collection anymore, the _main loop_ thread thinks we have a room available to 
submit a new task to the pool. There is a time window where the _main loop_ 
thread may submit a 101th task the pool. Bellow exception is raised:
{code:java}
java.util.concurrent.RejectedExecutionException: Task 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda/0x00000070013f2a30@53c6a1bb
 rejected from 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1f9c1653[Running,
 pool size = 100, active threads = 100, queued tasks = 0, completed tasks = 100]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2081)
 ~[?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
 ~[?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376)
 ~[?:?]
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:361)
 ~[main/:?]
        at 
org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:376) 
~[main/:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
{code}
 

*Impact*

Once we hit the above exception, the lock on the collection/shard/replica is 
not released (it was acquired before submitting a task to the _runner_ thread 
pool). This means that all the future tasks that require to lock on the same 
collection/shard/replica won't be processed and will remaining the Zookeeper 
queue forever, until the overseer is restarted.

 
Another impact is the overseer does not read more than 1000 tasks from the 
queue. Consequently, if more than 1000 are blocked by the unreleased lock, 
later tasks will not be read (and not processed) even if they don't require the 
same lock. After 1000 tasks waiting for the lock, the overseer is fully 
blocked. It may take a long time between the root race condition and reaching a 
fully blocked overseer (days/weeks depending on the number of submitted API 
calls).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to