[ 
https://issues.apache.org/jira/browse/SOLR-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787081#comment-17787081
 ] 

Michael Dürr commented on SOLR-17060:
-------------------------------------

Not sure whether it is related, but at least I wanted to mention it here:

We have a similar issue since a while, but I was not able to reproduce this 
problem on a testing environment.
The problem occurs when we restart our 2 solr servers (cloud mode) while a 
metrics exporter (running as a separate instance at another server) tries to 
retrieve metrics from the restarting nodes. The result is that sometimes a core 
of a collection cannot be loaded (state "down" in admin ui). The reason also 
seems to be the call to SolrCore.getSearcher() but in our case it is called by 
MetricsHandler.handleRequestBody(). As the metrics exporter will continue 
causing that call to happen, we see an increasing number of threads that get 
stuck waiting for SolrCore.getSearcher() to return.

Our "solution" to this problem yet is to stop the mertics exporter before we 
restart a solr node. 

The corresponding mail thread can be found here:

[https://www.mail-archive.com/users@solr.apache.org/msg06133.html]

> CoreContainer#create may deadlock with concurrent requests for metrics
> ----------------------------------------------------------------------
>
>                 Key: SOLR-17060
>                 URL: https://issues.apache.org/jira/browse/SOLR-17060
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: multicore
>    Affects Versions: 9.2.1, 9.4
>            Reporter: Andreas Hubold
>            Priority: Major
>              Labels: deadlock, metrics, monitoring
>
> CoreContainer#create registers metrics for the created core quite early, and 
> metrics can be requested for a core that is still being created. If a metrics 
> request calls SolrCore#getSearcher with unlucky timing (race condition), then 
> CoreContainer#create can deadlock.
> This problem was described on the users lists: 
> [https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw]
> We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and 
> a periodic thread that requests all available JMX metrics.
> A CoreAdmin CREATE request was received, but its thread waits forever because 
> onDeckSearchers is 1 and _searcher is null (variables checked in heap dump):
> {code:none}
> "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s 
> tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>     - waiting on <no object reference available>
>     at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>     at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>     - locked <0x00000000e25dd2c8> (a java.lang.Object)
>     at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
>     at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
>     at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
>  Source)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
> {code}
>  
> There's no way for clients to retry the CREATE request, because CoreContainer 
> rejects additional requests for good reason ("Already creating a core with 
> name ...", see SOLR-14969).
> The only solution is to restart Solr.
>  
> A thread for metrics was also waiting in the same way:
> {code:none}
> prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms 
> elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  
> [0x00007fe6c680b000]
>    java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>        - waiting on <no object reference available>
>        at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>        - locked <0x00000000e25dd2c8> (a java.lang.Object)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
>        at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
>        at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
>        at 
> org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown 
> Source)
>        at 
> org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
>        at 
> org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207
> {code}
> This thread gathers multiple metrics one after the other. We can assume, that 
> it had called SolrCore#getSearcher successfully before, and that the previous 
> call has incremented onDeckSearchers to 1.
> A previous call to #getSearcher should eventually decrement onDeckSearchers 
> again, and also set the _searcher field, but that's not happening. Normally, 
> this should happen in SolrCore#registerSearcher, but that method is executed 
> on the searcherExecutor, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681]
> The problem is, that searcherExecutor will never execute #registerSearcher 
> because the executor is still blocked by another job. This is because the 
> core is still being created, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160]
> {code:none}
> "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 
> cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on 
> condition  [0x00007fe6c637a000]
>    java.lang.Thread.State: WAITING (parking)
>        at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
>        - parking to wait for  <0x00000000e738c1a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>        at 
> java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  
> Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown 
> Source)
>        at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  
> Source) 
> {code}
> In summary, we have
>  - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core, 
> which is waiting for
>  - #registerSearcher job in the searcherExecutor work queue, which is waiting 
> to be executed by
>  - thread "searcherExecutor-14-thread-1-processing-studio" which waits for 
> the first thread "qtp1800649922-28" to unblock it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to