Looks serious, a JIRA would be very helpful. Thanks!

On Fri, 27 Oct, 2023, 9:53 pm Andreas Hubold,
<andreas.hub...@coremedia.com.invalid> wrote:

> Hi Solr devs,
>
> I've seen the same problem with Solr 9.2.1 now, and I think we have a
> bug here: Core creation may deadlock if there are concurrent requests
> for metrics
>
> The situation is a complex race condition. I might be wrong, but think I
> can explain what happened:
>
> A CoreAdmin CREATE request was received, but its thread waits forever at
>
> "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s
> tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
>     java.lang.Thread.State: WAITING (on object monitor)
>      at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>      - waiting on <no object reference available>
>      at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>      at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>      - locked <0x00000000e25dd2c8> (a java.lang.Object)
>      at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
>      at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
>      at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
>      at
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
>      at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
>      at
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
>      at
> org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown
> Source)
>      at
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
>      at
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
>      at
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
>
>
> I've looked into a heap dump: the thread is waiting because
> onDeckSearchers is 1 and _searcher is null.
>
> Note, that it's not possible for clients to retry the CREATE request
> while this thread is waiting, because CoreContainer rejects additional
> requests for good reason ("Already creating a core with name ...")
>
> Interestingly, there's another thread that waits at the same place. It's
> a bit specific to our setup: it repeatedly collects metrics via JMX. But
> it could be any kind of concurrent metrics request for that core. We can
> see that it's getting metrics for the same core (<0x00000000e25dd2c8>),
> which is currently being created but already visible in the metrics:
>
> prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms
> elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()
> [0x00007fe6c680b000]
>     java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
>         - waiting on <no object reference available>
>         at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
>         - locked <0x00000000e25dd2c8> (a java.lang.Object)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
>         at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
>         at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
>         at
> org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
>         at
> org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown
> Source)
>         at
> org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
>         at
> org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207)
> ...
>          at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttributes(
> java.management@17.0.8.1/Unknown  Source)
>         at io.prometheus.jmx.JmxScraper.scrapeBean(JmxScraper.java:153)
>
>
> This second thread is waiting just like the former. It collects multiple
> metrics per core, one after the other. To that end, it also calls
> SolrCore#getSearcher multiple times. I assume that it has called
> SolrCore#getSearcher successfully before, and that the previous call has
> incremented onDeckSearchers to 1.
>
> Of course, such a previous call to #getSearcher should eventually
> decrement onDeckSearchers again, and also set the _searcher field, but
> that's not happening.
> Normally, this should happen in SolrCore#registerSearcher, but that
> method is executed on the searcherExecutor and #getSearcher returns
> successfully before that, see
>
> https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681
>
> The problem is, that searcherExecutor will never execute
> #registerSearcher because the executor is still blocked by another job.
> This is because the core is still being created, see
>
> https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160
>
> "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0
> cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on
> condition  [0x00007fe6c637a000]
>     java.lang.Thread.State: WAITING (parking)
>         at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native
> Method)
>         - parking to wait for  <0x00000000e738c1a8> (a
> java.util.concurrent.CountDownLatch$Sync)
>         at java.util.concurrent.locks.LockSupport.park(
> java.base@17.0.8.1/Unknown  Source)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(
> java.base@17.0.8.1/Unknown  Source)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(
> java.base@17.0.8.1/Unknown  Source)
>         at java.util.concurrent.CountDownLatch.await(
> java.base@17.0.8.1/Unknown  Source)
>         at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
>         at
> org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown
> Source)
>         at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown
> Source)
>         at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289)
>         at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$418/0x00007fe6603f7048.run(Unknown
> Source)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> java.base@17.0.8.1/Unknown  Source)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> java.base@17.0.8.1/Unknown  Source)
>         at java.lang.Thread.run(java.base@17.0.8.1/Unknown  Source)
>
>
> In summary, we have
>
> - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core,
> and waiting for
> - #registerSearcher job in the searcherExecutor work queue, and waiting
> to be executed by
> - thread "searcherExecutor-14-thread-1-processing-studio" which waits
> for the first thread "qtp1800649922-28" to unblock it
>
> We haven't seen this problem with 8.11.2 or previous versions so far.
> But maybe that was pure luck.
>
> Maybe someone can confirm this? I could also write a JIRA ticket, if you
> like.
>
> Thanks,
> Andreas
>
>
> Pieper, Stefan schrieb am 10.10.23 um 16:44:
> >
> > Hello,
> >
> > We experienced a problem when creating a core via a request on Solr
> > 9.2.1. It seems that the core creation for “blueprint_helios_comments“
> > just “hangs”. Consequently, with subsequent external requests to
> > create the same core (since it does not seem to be there yet), Solr
> > detects that the core creation is already in progress and returns an
> > error.
> >
> > This is what Solr’s log tells me:
> >
> > [ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null
> > path=/admin/cores
> >
> params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2}
>
> > status=0 QTime=1
> >
> > [ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create
> > command
> >
> name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data
> >
> > [ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene
> > MatchVersion: 9.4.0
> >
> > [ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic
> >
> > [ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema
> > elastic/1.6 with uniqueid field id
> >
> > [ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore
> > 'blueprint_helios_comments' using configuration from configset
> > /opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true
> >
> > [ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory
> > solr.data.home = /var/solr/data
> >
> > [ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at
> > [/var/solr/data/blueprint_helios_comments],
> > dataDir=[/var/solr/data/blueprint_helios_comments/data/]
> >
> > [ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a
> > new SolrRequestAuthorizer
> >
> > [ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog
> > implementation: org.apache.solr.update.UpdateLog
> >
> > [ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog:
> > dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100
> > maxNumLogsToKeep=10 numVersionBuckets=65536
> >
> > [ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if
> > uncommitted for 15000ms;
> >
> > [ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if
> > uncommitted for 3000ms;
> >
> > The creation stops after “Soft AutoCommit”. During the same run of
> > Solr and our test suite, other cores are created successfully.
> >
> > With the next run of the same software on empty directories and
> > attempt to create the core, everything succeeds:
> >
> > [ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null
> > path=/admin/cores
> >
> params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2}
>
> > status=0 QTime=1
> >
> > [ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create
> > command
> >
> name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data
> >
> > [ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene
> > MatchVersion: 9.4.0
> >
> > [ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic
> >
> > [ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema
> > elastic/1.6 with uniqueid field id
> >
> > [ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore
> > 'blueprint_helios_comments' using configuration from configset
> > /opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true
> >
> > [ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory
> > solr.data.home = /var/solr/data
> >
> > [ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at
> > [/var/solr/data/blueprint_helios_comments],
> > dataDir=[/var/solr/data/blueprint_helios_comments/data/]
> >
> > [ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a
> > new SolrRequestAuthorizer
> >
> > [ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog
> > implementation: org.apache.solr.update.UpdateLog
> >
> > [ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog:
> > dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100
> > maxNumLogsToKeep=10 numVersionBuckets=65536
> >
> > [ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if
> > uncommitted for 15000ms;
> >
> > [ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if
> > uncommitted for 3000ms;
> >
> > [ blueprint_helios_comments] o.a.s.r.ManagedResourceStorage Cannot
> > write to config directory
> > /opt/solr-9.2.1/server/solr/configsets/elastic/conf ; switching to use
> > InMemory storage instead.
> >
> > [ blueprint_helios_comments] o.a.s.s.s.Suggester init:
> > {threshold=5.0E-4, name=text,
> > classname=org.apache.solr.spelling.suggest.Suggester,
> > lookupImpl=org.apache.solr.spelling.suggest.fst.WFSTLookupFactory,
> > field=text}
> >
> > [ blueprint_helios_comments] o.a.s.h.ReplicationHandler Commits will
> > be reserved for 10000 ms
> >
> > [ blueprint_helios_comments] o.a.s.u.UpdateLog Could not find max
> > version in index or recent updates, using new clock 1779334761811017728
> >
> > I can’t recall having seen this issue before nor can I reproduce it,
> > but I’d like to ask whether this behaviour is known (couldn’t find a
> > Jira issue or mail thread connected to this) or anybody has an idea
> > what might be going wrong.
> >
> > Thanks!
> >
> > Stefan
> >
>

Reply via email to