Thanks Ishan, I've created https://issues.apache.org/jira/browse/SOLR-17060

Looks serious, a JIRA would be very helpful. Thanks!

On Fri, 27 Oct, 2023, 9:53 pm Andreas Hubold,
<andreas.hub...@coremedia.com.invalid> wrote:

Hi Solr devs,

I've seen the same problem with Solr 9.2.1 now, and I think we have a
bug here: Core creation may deadlock if there are concurrent requests
for metrics

The situation is a complex race condition. I might be wrong, but think I
can explain what happened:

A CoreAdmin CREATE request was received, but its thread waits forever at

"qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s
tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
     java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
      - waiting on <no object reference available>
      at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
      at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
      - locked <0x00000000e25dd2c8> (a java.lang.Object)
      at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
      at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
      at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
      at
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
      at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
      at
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
      at
org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
Source)
      at
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
      at
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
      at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)


I've looked into a heap dump: the thread is waiting because
onDeckSearchers is 1 and _searcher is null.

Note, that it's not possible for clients to retry the CREATE request
while this thread is waiting, because CoreContainer rejects additional
requests for good reason ("Already creating a core with name ...")

Interestingly, there's another thread that waits at the same place. It's
a bit specific to our setup: it repeatedly collects metrics via JMX. But
it could be any kind of concurrent metrics request for that core. We can
see that it's getting metrics for the same core (<0x00000000e25dd2c8>),
which is currently being created but already visible in the metrics:

prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms
elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()
[0x00007fe6c680b000]
     java.lang.Thread.State: WAITING (on object monitor)
         at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
         - waiting on <no object reference available>
         at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
         - locked <0x00000000e25dd2c8> (a java.lang.Object)
         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
         at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
         at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
         at
org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
         at
org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown
Source)
         at
org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
         at
org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207)
...
          at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttributes(
java.management@17.0.8.1/Unknown  Source)
         at io.prometheus.jmx.JmxScraper.scrapeBean(JmxScraper.java:153)


This second thread is waiting just like the former. It collects multiple
metrics per core, one after the other. To that end, it also calls
SolrCore#getSearcher multiple times. I assume that it has called
SolrCore#getSearcher successfully before, and that the previous call has
incremented onDeckSearchers to 1.

Of course, such a previous call to #getSearcher should eventually
decrement onDeckSearchers again, and also set the _searcher field, but
that's not happening.
Normally, this should happen in SolrCore#registerSearcher, but that
method is executed on the searcherExecutor and #getSearcher returns
successfully before that, see

https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681

The problem is, that searcherExecutor will never execute
#registerSearcher because the executor is still blocked by another job.
This is because the core is still being created, see

https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160

"searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0
cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on
condition  [0x00007fe6c637a000]
     java.lang.Thread.State: WAITING (parking)
         at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native
Method)
         - parking to wait for  <0x00000000e738c1a8> (a
java.util.concurrent.CountDownLatch$Sync)
         at java.util.concurrent.locks.LockSupport.park(
java.base@17.0.8.1/Unknown  Source)
         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(
java.base@17.0.8.1/Unknown  Source)
         at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(
java.base@17.0.8.1/Unknown  Source)
         at java.util.concurrent.CountDownLatch.await(
java.base@17.0.8.1/Unknown  Source)
         at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
         at
org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown
Source)
         at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown
Source)
         at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289)
         at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$418/0x00007fe6603f7048.run(Unknown
Source)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(
java.base@17.0.8.1/Unknown  Source)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
java.base@17.0.8.1/Unknown  Source)
         at java.lang.Thread.run(java.base@17.0.8.1/Unknown  Source)


In summary, we have

- thread "qtp1800649922-28" in SolrCore.<init> trying to create a core,
and waiting for
- #registerSearcher job in the searcherExecutor work queue, and waiting
to be executed by
- thread "searcherExecutor-14-thread-1-processing-studio" which waits
for the first thread "qtp1800649922-28" to unblock it

We haven't seen this problem with 8.11.2 or previous versions so far.
But maybe that was pure luck.

Maybe someone can confirm this? I could also write a JIRA ticket, if you
like.

Thanks,
Andreas


Pieper, Stefan schrieb am 10.10.23 um 16:44:
Hello,

We experienced a problem when creating a core via a request on Solr
9.2.1. It seems that the core creation for “blueprint_helios_comments“
just “hangs”. Consequently, with subsequent external requests to
create the same core (since it does not seem to be there yet), Solr
detects that the core creation is already in progress and returns an
error.

This is what Solr’s log tells me:

[ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null
path=/admin/cores

params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2}

status=0 QTime=1

[ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create
command

name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data
[ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene
MatchVersion: 9.4.0

[ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic

[ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema
elastic/1.6 with uniqueid field id

[ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore
'blueprint_helios_comments' using configuration from configset
/opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true

[ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory
solr.data.home = /var/solr/data

[ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at
[/var/solr/data/blueprint_helios_comments],
dataDir=[/var/solr/data/blueprint_helios_comments/data/]

[ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a
new SolrRequestAuthorizer

[ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog
implementation: org.apache.solr.update.UpdateLog

[ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog:
dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100
maxNumLogsToKeep=10 numVersionBuckets=65536

[ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if
uncommitted for 15000ms;

[ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if
uncommitted for 3000ms;

The creation stops after “Soft AutoCommit”. During the same run of
Solr and our test suite, other cores are created successfully.

With the next run of the same software on empty directories and
attempt to create the core, everything succeeds:

[ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null
path=/admin/cores

params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2}

status=0 QTime=1

[ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create
command

name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data
[ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene
MatchVersion: 9.4.0

[ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic

[ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema
elastic/1.6 with uniqueid field id

[ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore
'blueprint_helios_comments' using configuration from configset
/opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true

[ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory
solr.data.home = /var/solr/data

[ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at
[/var/solr/data/blueprint_helios_comments],
dataDir=[/var/solr/data/blueprint_helios_comments/data/]

[ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a
new SolrRequestAuthorizer

[ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog
implementation: org.apache.solr.update.UpdateLog

[ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog:
dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100
maxNumLogsToKeep=10 numVersionBuckets=65536

[ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if
uncommitted for 15000ms;

[ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if
uncommitted for 3000ms;

[ blueprint_helios_comments] o.a.s.r.ManagedResourceStorage Cannot
write to config directory
/opt/solr-9.2.1/server/solr/configsets/elastic/conf ; switching to use
InMemory storage instead.

[ blueprint_helios_comments] o.a.s.s.s.Suggester init:
{threshold=5.0E-4, name=text,
classname=org.apache.solr.spelling.suggest.Suggester,
lookupImpl=org.apache.solr.spelling.suggest.fst.WFSTLookupFactory,
field=text}

[ blueprint_helios_comments] o.a.s.h.ReplicationHandler Commits will
be reserved for 10000 ms

[ blueprint_helios_comments] o.a.s.u.UpdateLog Could not find max
version in index or recent updates, using new clock 1779334761811017728

I can’t recall having seen this issue before nor can I reproduce it,
but I’d like to ask whether this behaviour is known (couldn’t find a
Jira issue or mail thread connected to this) or anybody has an idea
what might be going wrong.

Thanks!

Stefan


Reply via email to