[ 
https://issues.apache.org/jira/browse/SOLR-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Hubold updated SOLR-17060:
----------------------------------
    Description: 
CoreContainer#create registers metrics for the created core quite early, and 
metrics can be requested for a core that is still being created. If a metrics 
request calls SolrCore#getSearcher with unlucky timing (race condition), then 
CoreContainer#create can deadlock.

This problem was described on the users lists: 
[https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw]

We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and a 
periodic thread that requests all available JMX metrics.

A CoreAdmin CREATE request was received, but its thread waits forever because 
onDeckSearchers is 1 and _searcher is null (variables checked in heap dump):
{code:none}
"qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s 
tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
    - waiting on <no object reference available>
    at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
    - locked <0x00000000e25dd2c8> (a java.lang.Object)
    at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
    at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
    at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
    at 
org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
 Source)
    at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
    at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
    at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
{code}
 
There's no way for clients to retry the CREATE request, because CoreContainer 
rejects additional requests for good reason ("Already creating a core with name 
...", see SOLR-14969).
The only solution is to restart Solr.
 
A thread for metrics was also waiting in the same way:
{code:none}
prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms 
elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  
[0x00007fe6c680b000]
   java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
       - waiting on <no object reference available>
       at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
       at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
       - locked <0x00000000e25dd2c8> (a java.lang.Object)
       at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
       at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
       at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
       at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
       at 
org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
       at 
org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown 
Source)
       at 
org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
       at 
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207
{code}
This thread gathers multiple metrics one after the other. We can assume, that 
it had called SolrCore#getSearcher successfully before, and that the previous 
call has incremented onDeckSearchers to 1.

A previous call to #getSearcher should eventually decrement onDeckSearchers 
again, and also set the _searcher field, but that's not happening. Normally, 
this should happen in SolrCore#registerSearcher, but that method is executed on 
the searcherExecutor, see
[https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681]

The problem is, that searcherExecutor will never execute #registerSearcher 
because the executor is still blocked by another job. This is because the core 
is still being created, see
[https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160]
{code:none}
"searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 
cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on 
condition  [0x00007fe6c637a000]
   java.lang.Thread.State: WAITING (parking)
       at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
       - parking to wait for  <0x00000000e738c1a8> (a 
java.util.concurrent.CountDownLatch$Sync)
       at 
java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  Source)
       at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown
  Source)
       at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown
  Source)
       at java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown  
Source)
       at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
       at 
org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown 
Source)
       at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  
Source) 
{code}
In summary, we have
 - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core,
and waiting for
 - #registerSearcher job in the searcherExecutor work queue, and waiting
to be executed by
 - thread "searcherExecutor-14-thread-1-processing-studio" which waits
for the first thread "qtp1800649922-28" to unblock it

  was:
CoreContainer#create registers metrics for the created core quite early, and 
metrics can be requested for a core that is still being created. If a metrics 
request calls SolrCore#getSearcher with unlucky timing (race condition), then 
CoreContainer#create can deadlock.

This problem was described on the users lists: 
[https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw]

We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and a 
periodic thread that requests all available JMX metrics.

A CoreAdmin CREATE request was received, but its thread waits forever because 
onDeckSearchers is 1 and _searcher is null (variables checked in heap dump):

{code:none}
"qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s 
tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
    - waiting on <no object reference available>
    at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
    - locked <0x00000000e25dd2c8> (a java.lang.Object)
    at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
    at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
    at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
    at 
org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
 Source)
    at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
    at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
    at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
{code}
 
There's no way for clients to retry the CREATE request, because CoreContainer 
rejects additional requests for good reason ("Already creating a core with name 
...", see SOLR-14969).
The only solution is to restart Solr.
 
A thread for metrics was also waiting in the same way:

{code:none}
prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms 
elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  
[0x00007fe6c680b000]
   java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
       - waiting on <no object reference available>
       at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
       at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
       - locked <0x00000000e25dd2c8> (a java.lang.Object)
       at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
       at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
       at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
       at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
       at 
org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
       at 
org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown 
Source)
       at 
org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
       at 
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207
{code}

This thread gathers multiple metrics one after the other. We can assume, that 
it had called 
SolrCore#getSearcher successfully before, and that the previous call has 
incremented onDeckSearchers to 1. 

A previous call to #getSearcher should eventually decrement onDeckSearchers 
again, and also set the _searcher field, but that's not happening. Normally, 
this should happen in SolrCore#registerSearcher, but that method is executed on 
the searcherExecutor, see
https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681

The problem is, that searcherExecutor will never execute
#registerSearcher because the executor is still blocked by another job.
This is because the core is still being created, see
https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160

{code:none}
"searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 
cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on 
condition  [0x00007fe6c637a000]
   java.lang.Thread.State: WAITING (parking)
       at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
       - parking to wait for  <0x00000000e738c1a8> (a 
java.util.concurrent.CountDownLatch$Sync)
       at 
java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  Source)
       at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown
  Source)
       at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown
  Source)
       at java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown  
Source)
       at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
       at 
org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown 
Source)
       at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  
Source) 
{code}


In summary, we have

- thread "qtp1800649922-28" in SolrCore.<init> trying to create a core,
and waiting for
- #registerSearcher job in the searcherExecutor work queue, and waiting
to be executed by
- thread "searcherExecutor-14-thread-1-processing-studio" which waits
for the first thread "qtp1800649922-28" to unblock it



> CoreContainer#create may deadlock with concurrent requests for metrics
> ----------------------------------------------------------------------
>
>                 Key: SOLR-17060
>                 URL: https://issues.apache.org/jira/browse/SOLR-17060
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: multicore
>    Affects Versions: 9.2.1
>            Reporter: Andreas Hubold
>            Priority: Major
>              Labels: deadlock
>
> CoreContainer#create registers metrics for the created core quite early, and 
> metrics can be requested for a core that is still being created. If a metrics 
> request calls SolrCore#getSearcher with unlucky timing (race condition), then 
> CoreContainer#create can deadlock.
> This problem was described on the users lists: 
> [https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw]
> We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and 
> a periodic thread that requests all available JMX metrics.
> A CoreAdmin CREATE request was received, but its thread waits forever because 
> onDeckSearchers is 1 and _searcher is null (variables checked in heap dump):
> {code:none}
> "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s 
> tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>     - waiting on <no object reference available>
>     at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>     at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>     - locked <0x00000000e25dd2c8> (a java.lang.Object)
>     at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
>     at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
>     at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
>     at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
>  Source)
>     at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
>     at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
> {code}
>  
> There's no way for clients to retry the CREATE request, because CoreContainer 
> rejects additional requests for good reason ("Already creating a core with 
> name ...", see SOLR-14969).
> The only solution is to restart Solr.
>  
> A thread for metrics was also waiting in the same way:
> {code:none}
> prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms 
> elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  
> [0x00007fe6c680b000]
>    java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>        - waiting on <no object reference available>
>        at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>        - locked <0x00000000e25dd2c8> (a java.lang.Object)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
>        at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
>        at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
>        at 
> org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown 
> Source)
>        at 
> org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
>        at 
> org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207
> {code}
> This thread gathers multiple metrics one after the other. We can assume, that 
> it had called SolrCore#getSearcher successfully before, and that the previous 
> call has incremented onDeckSearchers to 1.
> A previous call to #getSearcher should eventually decrement onDeckSearchers 
> again, and also set the _searcher field, but that's not happening. Normally, 
> this should happen in SolrCore#registerSearcher, but that method is executed 
> on the searcherExecutor, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681]
> The problem is, that searcherExecutor will never execute #registerSearcher 
> because the executor is still blocked by another job. This is because the 
> core is still being created, see
> [https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160]
> {code:none}
> "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 
> cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on 
> condition  [0x00007fe6c637a000]
>    java.lang.Thread.State: WAITING (parking)
>        at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
>        - parking to wait for  <0x00000000e738c1a8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>        at 
> java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  
> Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown
>   Source)
>        at 
> java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown  Source)
>        at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
>        at 
> org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown 
> Source)
>        at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  
> Source) 
> {code}
> In summary, we have
>  - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core,
> and waiting for
>  - #registerSearcher job in the searcherExecutor work queue, and waiting
> to be executed by
>  - thread "searcherExecutor-14-thread-1-processing-studio" which waits
> for the first thread "qtp1800649922-28" to unblock it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to