Pierre Salagnac created SOLR-17421:
--------------------------------------

             Summary: With overseer node role enabled, overseer may be stopped 
without giving-up leadership
                 Key: SOLR-17421
                 URL: https://issues.apache.org/jira/browse/SOLR-17421
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
    Affects Versions: 9.6, 8.11
            Reporter: Pierre Salagnac

Overseer may retain the leadership status while the thread pool that is 
supposed to consume the collection state mutator queue was already shut down.

Occurrences of this but are probably not frequent. But when it happens, it has 
a huge impact. The overseer cluster state updater is stuck and all collection 
admin requests are very likely to fail. Because of the stuck overseer, all the 
enqueued operations (collection creation, deletion...) fail and remain in the 
collection API queue.
h2. Root cause

Root cause is the {{QUIT}} command does not cancel overseer election if any 
error happens while shutting down the state updater thread pool.
{code:java}
level:  ERROR
    logger:  org.apache.solr.cloud.Overseer
    message:  Overseer could not process the current clusterstate state update 
message, skipping the message: {
"operation":"quit",
"id":"72073405485023239-<host>_solr-n_0000000948"}
    node_name:  <host>:8983_solr
    threadId:  281272
    threadName:  OverseerStateUpdate-72073405485023239-<host>_solr-n_0000000948
    thrown:  java.lang.RuntimeException: Timeout waiting for pool 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@2c1da18d[Shutting
 down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 
0] to shutdown.
at 
org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:142)
at 
org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:129)
at 
org.apache.solr.common.util.ExecutorUtil.shutdownAndAwaitTermination(ExecutorUtil.java:112)
at 
org.apache.solr.cloud.OverseerTaskProcessor.close(OverseerTaskProcessor.java:431)
at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.processMessage(Overseer.java:601)
at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:450)
at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:377)
at java.base/java.lang.Thread.run(Thread.java:1583)
{code}
h2. Proximate cause

It seems to me long running operations in the collection API could trigger the 
bug more frequently. Because of a long running operation, we get an exception 
when shutting down the thread pool. This has a 60 seconds timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to