Pierre Salagnac created SOLR-17754: -------------------------------------- Summary: Race condition in overseer main loop leaves collection/shard/replica locked Key: SOLR-17754 URL: https://issues.apache.org/jira/browse/SOLR-17754 Project: Solr Issue Type: Task Affects Versions: 9.8 Reporter: Pierre Salagnac
We hit a rare race condition in overseer main loop, but it led to a full stuck overseer. All operations submitted to the collection API were then never dequeued and never processed until a restart of the overseer. For the race condition to reproduce, we need at least 100 concurrent tasks running in the collection API. This bug is reproducible only when the overseer is under high load. *Note on overseer threading* The overseer main loop in class {{OverseerTaskProcessor}} processes tasks that were previously enqueued to the overseer distributed queue (stored in Zookeeper). This class has a producer/consumer pattern: - a single thread dequeues tasks from the distributed queue (named later the _main loop_ thread) - before submitting the task, the _main loop_ thread tries to lock on the collection/shard/replica (depending on the task type). If the lock was already taken because another task is already running on same resource, the task is not submitted and will be reconsidered during a later iteration. - then a pool with up to 100 threads executes the tasks (named later the _runner_ threads). - at the end of the task, the _runner_ thread unlocks the collection/shard/replica. Since the pool uses a {{{}SynchronousQueue{}}}, it is not allowed to submit more than 100 concurrent tasks to the pool, otherwise the task is rejected by the pool and an exception is raised (see the exception below). To avoid this exception, the _main loop_ threads makes sure it does not submit more than 100 concurrent tasks by maintaining the IDs of the current running tasks in an in-memory collection. Each time a task is being submitted, the task ID is added into collection {{{}runningTasks{}}}. Then, once the _runner_ thread completed the task (same in case of a failure), it removes the task ID from this collection. Before submitting a task to the _runner_ pool, the _main loop_ thread checks the size of collection runningTasks. In case there are already 100 items in the collection, the task is not submitted but instead added into another collection ({{{}blockedTasks{}}}) so it will be submitted to the pool in a later iteration, once the size of runningTask will be below 100 (full mechanism not detailed here). *Root cause race condition* The bug is that the list of IDs in collection {{runningTasks}} is not fully aligned with list of _runner_ threads currently executing a task. The instant right after the _runner_ thread removed the ID, the thread is still running and hasn't returned back to the pool. But since the task ID is not in the collection anymore, the _main loop_ thread thinks we have a room available to submit a new task to the pool. There is a time window where the _main loop_ thread may submit a 101th task the pool. Bellow exception is raised: {code:java} java.util.concurrent.RejectedExecutionException: Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda/0x00000070013f2a30@53c6a1bb rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1f9c1653[Running, pool size = 100, active threads = 100, queued tasks = 0, completed tasks = 100] at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2081) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376) ~[?:?] at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:361) ~[main/:?] at org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:376) ~[main/:?] at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] {code} *Impact* Once we hit the above exception, the lock on the collection/shard/replica is not released (it was acquired before submitting a task to the _runner_ thread pool). This means that all the future tasks that require to lock on the same collection/shard/replica won't be processed and will remaining the Zookeeper queue forever, until the overseer is restarted. Another impact is the overseer does not read more than 1000 tasks from the queue. Consequently, if more than 1000 are blocked by the unreleased lock, later tasks will not be read (and not processed) even if they don't require the same lock. After 1000 tasks waiting for the lock, the overseer is fully blocked. It may take a long time between the root race condition and reaching a fully blocked overseer (days/weeks depending on the number of submitted API calls). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org