Pierre Salagnac created SOLR-17421: -------------------------------------- Summary: With overseer node role enabled, overseer may be stopped without giving-up leadership Key: SOLR-17421 URL: https://issues.apache.org/jira/browse/SOLR-17421 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 9.6, 8.11 Reporter: Pierre Salagnac
Overseer may retain the leadership status while the thread pool that is supposed to consume the collection state mutator queue was already shut down. Occurrences of this but are probably not frequent. But when it happens, it has a huge impact. The overseer cluster state updater is stuck and all collection admin requests are very likely to fail. Because of the stuck overseer, all the enqueued operations (collection creation, deletion...) fail and remain in the collection API queue. h2. Root cause Root cause is the {{QUIT}} command does not cancel overseer election if any error happens while shutting down the state updater thread pool. {code:java} level: ERROR logger: org.apache.solr.cloud.Overseer message: Overseer could not process the current clusterstate state update message, skipping the message: { "operation":"quit", "id":"72073405485023239-<host>_solr-n_0000000948"} node_name: <host>:8983_solr threadId: 281272 threadName: OverseerStateUpdate-72073405485023239-<host>_solr-n_0000000948 thrown: java.lang.RuntimeException: Timeout waiting for pool org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@2c1da18d[Shutting down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0] to shutdown. at org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:142) at org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:129) at org.apache.solr.common.util.ExecutorUtil.shutdownAndAwaitTermination(ExecutorUtil.java:112) at org.apache.solr.cloud.OverseerTaskProcessor.close(OverseerTaskProcessor.java:431) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processMessage(Overseer.java:601) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:450) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:377) at java.base/java.lang.Thread.run(Thread.java:1583) {code} h2. Proximate cause It seems to me long running operations in the collection API could trigger the bug more frequently. Because of a long running operation, we get an exception when shutting down the thread pool. This has a 60 seconds timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org