Hi Solr devs,

I've seen the same problem with Solr 9.2.1 now, and I think we have a bug here: Core creation may deadlock if there are concurrent requests for metrics

The situation is a complex race condition. I might be wrong, but think I can explain what happened:

A CoreAdmin CREATE request was received, but its thread waits forever at

"qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s 
tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
    - waiting on <no object reference available>
    at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
    - locked <0x00000000e25dd2c8> (a java.lang.Object)
    at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
    at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
    at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
    at 
org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
 Source)
    at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
    at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
    at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)


I've looked into a heap dump: the thread is waiting because onDeckSearchers is 1 and _searcher is null.

Note, that it's not possible for clients to retry the CREATE request while this thread is waiting, because CoreContainer rejects additional requests for good reason ("Already creating a core with name ...")

Interestingly, there's another thread that waits at the same place. It's a bit specific to our setup: it repeatedly collects metrics via JMX. But it could be any kind of concurrent metrics request for that core. We can see that it's getting metrics for the same core (<0x00000000e25dd2c8>), which is currently being created but already visible in the metrics:

prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms 
elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  
[0x00007fe6c680b000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
        - waiting on <no object reference available>
        at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
        - locked <0x00000000e25dd2c8> (a java.lang.Object)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
        at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
        at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
        at 
org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
        at 
org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown 
Source)
        at 
org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
        at 
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207)
...
        at 
com.sun.jmx.mbeanserver.JmxMBeanServer.getAttributes(java.management@17.0.8.1/Unknown
  Source)
        at io.prometheus.jmx.JmxScraper.scrapeBean(JmxScraper.java:153)


This second thread is waiting just like the former. It collects multiple metrics per core, one after the other. To that end, it also calls SolrCore#getSearcher multiple times. I assume that it has called SolrCore#getSearcher successfully before, and that the previous call has incremented onDeckSearchers to 1.

Of course, such a previous call to #getSearcher should eventually decrement onDeckSearchers again, and also set the _searcher field, but that's not happening. Normally, this should happen in SolrCore#registerSearcher, but that method is executed on the searcherExecutor and #getSearcher returns successfully before that, see https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681

The problem is, that searcherExecutor will never execute #registerSearcher because the executor is still blocked by another job. This is because the core is still being created, see https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160

"searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 
cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on condition  
[0x00007fe6c637a000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
        - parking to wait for  <0x00000000e738c1a8> (a 
java.util.concurrent.CountDownLatch$Sync)
        at 
java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown
  Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown
  Source)
        at java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown 
 Source)
        at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
        at 
org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown 
Source)
        at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  
Source)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$418/0x00007fe6603f7048.run(Unknown
 Source)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.8.1/Unknown  
Source)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.8.1/Unknown  
Source)
        at java.lang.Thread.run(java.base@17.0.8.1/Unknown  Source)


In summary, we have

- thread "qtp1800649922-28" in SolrCore.<init> trying to create a core, and waiting for - #registerSearcher job in the searcherExecutor work queue, and waiting to be executed by - thread "searcherExecutor-14-thread-1-processing-studio" which waits for the first thread "qtp1800649922-28" to unblock it

We haven't seen this problem with 8.11.2 or previous versions so far. But maybe that was pure luck.

Maybe someone can confirm this? I could also write a JIRA ticket, if you like.

Thanks,
Andreas


Pieper, Stefan schrieb am 10.10.23 um 16:44:

Hello,

We experienced a problem when creating a core via a request on Solr 9.2.1. It seems that the core creation for “blueprint_helios_comments“ just “hangs”. Consequently, with subsequent external requests to create the same core (since it does not seem to be there yet), Solr detects that the core creation is already in progress and returns an error.

This is what Solr’s log tells me:

[ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} status=0 QTime=1

[ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create command name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data

[ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene MatchVersion: 9.4.0

[ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic

[ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema elastic/1.6 with uniqueid field id

[ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore 'blueprint_helios_comments' using configuration from configset /opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true

[ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory solr.data.home = /var/solr/data

[ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at [/var/solr/data/blueprint_helios_comments], dataDir=[/var/solr/data/blueprint_helios_comments/data/]

[ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a new SolrRequestAuthorizer

[ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog implementation: org.apache.solr.update.UpdateLog

[ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536

[ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if uncommitted for 15000ms;

[ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if uncommitted for 3000ms;

The creation stops after “Soft AutoCommit”. During the same run of Solr and our test suite, other cores are created successfully.

With the next run of the same software on empty directories and attempt to create the core, everything succeeds:

[ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} status=0 QTime=1

[ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create command name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data

[ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene MatchVersion: 9.4.0

[ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic

[ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema elastic/1.6 with uniqueid field id

[ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore 'blueprint_helios_comments' using configuration from configset /opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true

[ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory solr.data.home = /var/solr/data

[ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at [/var/solr/data/blueprint_helios_comments], dataDir=[/var/solr/data/blueprint_helios_comments/data/]

[ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a new SolrRequestAuthorizer

[ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog implementation: org.apache.solr.update.UpdateLog

[ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536

[ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if uncommitted for 15000ms;

[ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if uncommitted for 3000ms;

[ blueprint_helios_comments] o.a.s.r.ManagedResourceStorage Cannot write to config directory /opt/solr-9.2.1/server/solr/configsets/elastic/conf ; switching to use InMemory storage instead.

[ blueprint_helios_comments] o.a.s.s.s.Suggester init: {threshold=5.0E-4, name=text, classname=org.apache.solr.spelling.suggest.Suggester, lookupImpl=org.apache.solr.spelling.suggest.fst.WFSTLookupFactory, field=text}

[ blueprint_helios_comments] o.a.s.h.ReplicationHandler Commits will be reserved for 10000 ms

[ blueprint_helios_comments] o.a.s.u.UpdateLog Could not find max version in index or recent updates, using new clock 1779334761811017728

I can’t recall having seen this issue before nor can I reproduce it, but I’d like to ask whether this behaviour is known (couldn’t find a Jira issue or mail thread connected to this) or anybody has an idea what might be going wrong.

Thanks!

Stefan

Reply via email to