[ https://issues.apache.org/jira/browse/SOLR-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735796#comment-17735796 ]
Alex Deparvu commented on SOLR-16848: ------------------------------------- I think I have an idea of why this happens, but unfortunately I was unsuccessful in reproducing this, even on docker with weak resources. will post a suggestion for review based on code review and we'll see if this helps or not. basically what I am seeing is a race inside the initialization method of the JettySolrRunner#lifeCycleStarted (https://github.com/apache/solr/blob/3e10f8b8901751de78e3c5b93538be133b6336ff/solr/test-framework/src/java/org/apache/solr/embedded/JettySolrRunner.java#LL394C38-L394C54): - on one hand: CoreContainerProvider will start init (line 407) and eventually (and async) will call into the test's `testing_beforeRegisterInZk` callback where the callback method will attempt to get the CoreContainer. trouble is the getCoreContainer method relies on 'dispatchFilter' being initialized (see below). - on the other the dispatchFilter will be init after (on line 423). my theory is: because the CoreContainerProvider is async it can call into the callback before the dispatchFilter is there. so my proposal is to add a wait for 'dispatchFilter != null' inside the callback only (this change will only affect the code itself, nothing else). I verified this reproduces by introducing an artificial wait between line 407 and line 423 and the NPE is present consistently, the wait-fix will provide the opportunity for the init to catchup to the callback if needed. if this precondition is not met, the flow should not progress anyway. > Flaky DeleteReplicaTest.raceConditionOnDeleteAndRegisterReplica > --------------------------------------------------------------- > > Key: SOLR-16848 > URL: https://issues.apache.org/jira/browse/SOLR-16848 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Alex Deparvu > Priority: Minor > > Some stats first: > - Past 7 days trend: > {noformat} > Class: org.apache.solr.cloud.DeleteReplicaTest > Method: raceConditionOnDeleteAndRegisterReplica > Failures: 11.22% (44 / 392) > {noformat} > - Test failure is caused by a NullPointerException: > {noformat} > ERROR (coreZkRegister-772-thread-1-processing-127.0.0.1:40471_solr) > [n:127.0.0.1:40471_solr c:raceDeleteReplicaCollection s:shard1 r:core_node4 > x:raceDeleteReplicaCollection_shard1_replica_n2] o.a.s.c.DeleteReplicaTest > Failed to delete replica > => java.lang.NullPointerException: Cannot invoke > "org.apache.solr.core.CoreContainer.getZkController()" because the return > value of "org.apache.solr.embedded.JettySolrRunner.getCoreContainer()" is null > {noformat} > a more complete trace > {noformat} > o.a.s.c.DeleteReplicaTest Failed to delete replica > 2> => java.lang.NullPointerException > 2> at > org.apache.solr.cloud.DeleteReplicaTest.lambda$raceConditionOnDeleteAndRegisterReplica$10(DeleteReplicaTest.java:350) > 2> java.lang.NullPointerException: null > 2> at > org.apache.solr.cloud.DeleteReplicaTest.lambda$raceConditionOnDeleteAndRegisterReplica$10(DeleteReplicaTest.java:350) > 2> at > org.apache.solr.core.ZkContainer.lambda$registerInZk$1(ZkContainer.java:211) > [main/:?] > 2> at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289) > 2> at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > 2> at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > 2> at java.lang.Thread.run(Thread.java:829) [?:?] > o.a.s.c.u.ExecutorUtil Uncaught exception java.lang.AssertionError: Failed to > delete replica thrown by thread: > coreZkRegister-1586-thread-1-processing-127.0.0.1:34497_solr > 2> => java.lang.Exception: Submitter stack trace > 2> at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:256) > java.lang.Exception: Submitter stack trace > 2> at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:256) > 2> at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:244) > ~[main/:?] > 2> at > org.apache.solr.core.CoreContainer.lambda$loadInternal$12(CoreContainer.java:1025) > 2> at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:234) > 2> at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > 2> at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:289) > 2> at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > 2> at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > 2> at java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: java.lang.Exception: Submitter stack trace > 2> at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:256) > 2> at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140) > ~[?:?] > 2> at > com.codahale.metrics.InstrumentedExecutorService.submit(InstrumentedExecutorService.java:122) > 2> at > org.apache.solr.core.CoreContainer.loadInternal(CoreContainer.java:1009) > ~[main/:?] > 2> at org.apache.solr.core.CoreContainer.load(CoreContainer.java:753) > 2> at > org.apache.solr.servlet.CoreContainerProvider.createCoreContainer(CoreContainerProvider.java:411) > 2> at > org.apache.solr.servlet.CoreContainerProvider.init(CoreContainerProvider.java:230) > 2> at > org.apache.solr.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:407) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org