Looks serious, a JIRA would be very helpful. Thanks! On Fri, 27 Oct, 2023, 9:53 pm Andreas Hubold, <andreas.hub...@coremedia.com.invalid> wrote:
> Hi Solr devs, > > I've seen the same problem with Solr 9.2.1 now, and I think we have a > bug here: Core creation may deadlock if there are concurrent requests > for metrics > > The situation is a complex race condition. I might be wrong, but think I > can explain what happened: > > A CoreAdmin CREATE request was received, but its thread waits forever at > > "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s > tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait() [0x00007fe6c68ca000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.8.1/Native Method) > - waiting on <no object reference available> > at java.lang.Object.wait(java.base@17.0.8.1/Unknown Source) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528) > - locked <0x00000000e25dd2c8> (a java.lang.Object) > at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051) > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532) > at > org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111) > at > org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown > Source) > at > org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398) > at > org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354) > at > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219) > > > I've looked into a heap dump: the thread is waiting because > onDeckSearchers is 1 and _searcher is null. > > Note, that it's not possible for clients to retry the CREATE request > while this thread is waiting, because CoreContainer rejects additional > requests for good reason ("Already creating a core with name ...") > > Interestingly, there's another thread that waits at the same place. It's > a bit specific to our setup: it repeatedly collects metrics via JMX. But > it could be any kind of concurrent metrics request for that core. We can > see that it's getting metrics for the same core (<0x00000000e25dd2c8>), > which is currently being created but already visible in the metrics: > > prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms > elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait() > [0x00007fe6c680b000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.8.1/Native Method) > - waiting on <no object reference available> > at java.lang.Object.wait(java.base@17.0.8.1/Unknown Source) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528) > - locked <0x00000000e25dd2c8> (a java.lang.Object) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106) > at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124) > at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534) > at > org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360) > at > org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown > Source) > at > org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779) > at > org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207) > ... > at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttributes( > java.management@17.0.8.1/Unknown Source) > at io.prometheus.jmx.JmxScraper.scrapeBean(JmxScraper.java:153) > > > This second thread is waiting just like the former. It collects multiple > metrics per core, one after the other. To that end, it also calls > SolrCore#getSearcher multiple times. I assume that it has called > SolrCore#getSearcher successfully before, and that the previous call has > incremented onDeckSearchers to 1. > > Of course, such a previous call to #getSearcher should eventually > decrement onDeckSearchers again, and also set the _searcher field, but > that's not happening. > Normally, this should happen in SolrCore#registerSearcher, but that > method is executed on the searcherExecutor and #getSearcher returns > successfully before that, see > > https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681 > > The problem is, that searcherExecutor will never execute > #registerSearcher because the executor is still blocked by another job. > This is because the core is still being created, see > > https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160 > > "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 > cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on > condition [0x00007fe6c637a000] > java.lang.Thread.State: WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native > Method) > - parking to wait for <0x00000000e738c1a8> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park( > java.base@17.0.8.1/Unknown Source) > at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire( > java.base@17.0.8.1/Unknown Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly( > java.base@17.0.8.1/Unknown Source) > at java.util.concurrent.CountDownLatch.await( > java.base@17.0.8.1/Unknown Source) > at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162) > at > org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown > Source) > at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown > Source) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$418/0x00007fe6603f7048.run(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > java.base@17.0.8.1/Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > java.base@17.0.8.1/Unknown Source) > at java.lang.Thread.run(java.base@17.0.8.1/Unknown Source) > > > In summary, we have > > - thread "qtp1800649922-28" in SolrCore.<init> trying to create a core, > and waiting for > - #registerSearcher job in the searcherExecutor work queue, and waiting > to be executed by > - thread "searcherExecutor-14-thread-1-processing-studio" which waits > for the first thread "qtp1800649922-28" to unblock it > > We haven't seen this problem with 8.11.2 or previous versions so far. > But maybe that was pure luck. > > Maybe someone can confirm this? I could also write a JIRA ticket, if you > like. > > Thanks, > Andreas > > > Pieper, Stefan schrieb am 10.10.23 um 16:44: > > > > Hello, > > > > We experienced a problem when creating a core via a request on Solr > > 9.2.1. It seems that the core creation for “blueprint_helios_comments“ > > just “hangs”. Consequently, with subsequent external requests to > > create the same core (since it does not seem to be there yet), Solr > > detects that the core creation is already in progress and returns an > > error. > > > > This is what Solr’s log tells me: > > > > [ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null > > path=/admin/cores > > > params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} > > > status=0 QTime=1 > > > > [ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create > > command > > > name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data > > > > [ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene > > MatchVersion: 9.4.0 > > > > [ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic > > > > [ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema > > elastic/1.6 with uniqueid field id > > > > [ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore > > 'blueprint_helios_comments' using configuration from configset > > /opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true > > > > [ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory > > solr.data.home = /var/solr/data > > > > [ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at > > [/var/solr/data/blueprint_helios_comments], > > dataDir=[/var/solr/data/blueprint_helios_comments/data/] > > > > [ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a > > new SolrRequestAuthorizer > > > > [ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog > > implementation: org.apache.solr.update.UpdateLog > > > > [ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog: > > dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100 > > maxNumLogsToKeep=10 numVersionBuckets=65536 > > > > [ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if > > uncommitted for 15000ms; > > > > [ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if > > uncommitted for 3000ms; > > > > The creation stops after “Soft AutoCommit”. During the same run of > > Solr and our test suite, other cores are created successfully. > > > > With the next run of the same software on empty directories and > > attempt to create the core, everything succeeds: > > > > [ blueprint_helios_comments] o.a.s.s.HttpSolrCall [admin] webapp=null > > path=/admin/cores > > > params={core=blueprint_helios_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} > > > status=0 QTime=1 > > > > [ blueprint_helios_comments] o.a.s.h.a.CoreAdminOperation core create > > command > > > name=blueprint_helios_comments&action=CREATE&configSet=elastic&wt=javabin&version=2&dataDir=data > > > > [ blueprint_helios_comments] o.a.s.c.SolrConfig Using Lucene > > MatchVersion: 9.4.0 > > > > [ blueprint_helios_comments] o.a.s.s.IndexSchema Schema name=elastic > > > > [ blueprint_helios_comments] o.a.s.s.IndexSchema Loaded schema > > elastic/1.6 with uniqueid field id > > > > [ blueprint_helios_comments] o.a.s.c.CoreContainer Creating SolrCore > > 'blueprint_helios_comments' using configuration from configset > > /opt/solr-9.2.1/server/solr/configsets/elastic, trusted=true > > > > [ blueprint_helios_comments] o.a.s.c.CachingDirectoryFactory > > solr.data.home = /var/solr/data > > > > [ blueprint_helios_comments] o.a.s.c.SolrCore Opening new SolrCore at > > [/var/solr/data/blueprint_helios_comments], > > dataDir=[/var/solr/data/blueprint_helios_comments/data/] > > > > [ blueprint_helios_comments] o.a.s.j.SolrRequestAuthorizer Creating a > > new SolrRequestAuthorizer > > > > [ blueprint_helios_comments] o.a.s.u.UpdateHandler Using UpdateLog > > implementation: org.apache.solr.update.UpdateLog > > > > [ blueprint_helios_comments] o.a.s.u.UpdateLog Initializing UpdateLog: > > dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100 > > maxNumLogsToKeep=10 numVersionBuckets=65536 > > > > [ blueprint_helios_comments] o.a.s.u.CommitTracker Hard AutoCommit: if > > uncommitted for 15000ms; > > > > [ blueprint_helios_comments] o.a.s.u.CommitTracker Soft AutoCommit: if > > uncommitted for 3000ms; > > > > [ blueprint_helios_comments] o.a.s.r.ManagedResourceStorage Cannot > > write to config directory > > /opt/solr-9.2.1/server/solr/configsets/elastic/conf ; switching to use > > InMemory storage instead. > > > > [ blueprint_helios_comments] o.a.s.s.s.Suggester init: > > {threshold=5.0E-4, name=text, > > classname=org.apache.solr.spelling.suggest.Suggester, > > lookupImpl=org.apache.solr.spelling.suggest.fst.WFSTLookupFactory, > > field=text} > > > > [ blueprint_helios_comments] o.a.s.h.ReplicationHandler Commits will > > be reserved for 10000 ms > > > > [ blueprint_helios_comments] o.a.s.u.UpdateLog Could not find max > > version in index or recent updates, using new clock 1779334761811017728 > > > > I can’t recall having seen this issue before nor can I reproduce it, > > but I’d like to ask whether this behaviour is known (couldn’t find a > > Jira issue or mail thread connected to this) or anybody has an idea > > what might be going wrong. > > > > Thanks! > > > > Stefan > > >