[ 
https://issues.apache.org/jira/browse/SOLR-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley resolved SOLR-17421.
---------------------------------
    Fix Version/s: 9.8
       Resolution: Fixed

Merged.

Thanks for contributing!

> With overseer node role enabled, overseer may be stopped without giving-up 
> leadership
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-17421
>                 URL: https://issues.apache.org/jira/browse/SOLR-17421
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.11, 9.6
>            Reporter: Pierre Salagnac
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 9.8
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Overseer may retain the leadership status while the thread pool that is 
> supposed to consume the collection state mutator queue was already shut down.
> Occurrences of this but are probably not frequent. But when it happens, it 
> has a huge impact. The overseer cluster state updater is stuck and all 
> collection admin requests are very likely to fail. Because of the stuck 
> overseer, all the enqueued operations (collection creation, deletion...) fail 
> and remain in the collection API queue.
> h2. Root cause
> Root cause is the {{QUIT}} command does not cancel overseer election if any 
> error happens while shutting down the state updater thread pool.
> {code:java}
> level:  ERROR
>     logger:  org.apache.solr.cloud.Overseer
>     message:  Overseer could not process the current clusterstate state 
> update message, skipping the message: {
> "operation":"quit",
> "id":"72073405485023239-<host>_solr-n_0000000948"}
>     node_name:  <host>:8983_solr
>     threadId:  281272
>     threadName:  
> OverseerStateUpdate-72073405485023239-<host>_solr-n_0000000948
>     thrown:  java.lang.RuntimeException: Timeout waiting for pool 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@2c1da18d[Shutting
>  down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 
> 0] to shutdown.
> at 
> org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:142)
> at 
> org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:129)
> at 
> org.apache.solr.common.util.ExecutorUtil.shutdownAndAwaitTermination(ExecutorUtil.java:112)
> at 
> org.apache.solr.cloud.OverseerTaskProcessor.close(OverseerTaskProcessor.java:431)
> at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processMessage(Overseer.java:601)
> at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:450)
> at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:377)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> {code}
> h2. Proximate cause
> It seems to me long running operations in the collection API could trigger 
> the bug more frequently. Because of a long running operation, we get an 
> exception when shutting down the thread pool. This has a 60 seconds timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to