dsmiley commented on a change in pull request #580:
URL: https://github.com/apache/solr/pull/580#discussion_r797956081



##########
File path: solr/core/src/test/org/apache/solr/core/TestLazyCores.java
##########
@@ -899,4 +890,67 @@ private void check10(SolrCore core) {
         , "//result[@numFound='10']"
     );
   }
+
+  public void testDontEvictUsedCore() throws Exception {
+    // If a core is being used for a long time (say a long indexing batch) but 
nothing else for it,
+    // and if the transient cache has pressure and thus wants to unload a 
core, we should not
+    // unload it (yet).
+
+    CoreContainer cc = init();
+    String[] transientCoreNames = new String[]{
+        "collection2",
+        "collection3",
+        "collection6",
+        "collection7",
+        "collection8",
+        "collection9"
+    };
+
+    try (LogListener logs = 
LogListener.info(TransientSolrCoreCacheDefault.class.getName())
+        .substring("NOT evicting transient core [" + transientCoreNames[0] + 
"]")) {
+      cc.waitForLoadingCoresToFinish(1000);
+      var solr = new EmbeddedSolrServer(cc, null);
+      final var longReqTimeMs = 2000; // if lower too much, the test will fail 
on a slow/busy CI
+
+      // First, start a long request on the first transient core
+      var thread = new Thread(() -> {
+        try {
+          // TODO Solr ought to have a query test "latch" mechanism so we 
don't sleep arbitrarily
+          solr.query(transientCoreNames[0], params("q", "{!func}sleep(" + 
longReqTimeMs + ",1)"));
+        } catch (SolrServerException | IOException e) {
+          fail(e.toString());
+        }
+      }, "longRequest");
+      thread.start();
+
+      System.out.println("Inducing pressure on cache by querying many 
cores...");
+      // Now hammer on other transient cores to create transient cache pressure
+      for (int round = 0; round < 5 && logs.getCount() == 0; round++) {
+        // note: we skip over the first; we want the first to remain non-busy
+        for (int i = 1; i < transientCoreNames.length; i++) {
+          solr.query(transientCoreNames[i], params("q", "*:*"));
+        }
+      }
+      // Show that the cache logs that it was asked to evict but did not.
+      // To show the bug behavior, comment this out and also comment out the 
corresponding logic
+      // that fixes it at the spot this message is logged.
+      assertTrue(logs.getCount() > 0);

Review comment:
       ;-) it's complicated.  
   
   In a realistic setting, I saw Solr lose track of the fact that the SolrCore 
was actually being used, despite the closer thread calling close() on it.  The 
consequence of that was that the lock file is still on disk (while the core is 
still open), and thus a future request to the core would try to open a new 
SolrCore but fail to because of the presence of the lock file.  This failure is 
"sticky"; a node restart is needed to remedy for the affected core.
   
   In this test setting, the long running request hung indefinitely.  The test 
fails because it joins for a limited period of time and asserts that the thread 
is no longer alive.  When the request internally gets to the end and it closes 
its SolrCore reference, this does a real core close, not merely a decref 
because the core closer thread already decref'ed it.  For reasons I'm unsure 
of, I see it hanging on CachingDirectoryFactory.close.  There's a twelve second 
wait in there "Timeout waiting for all directory ref counts to be released".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to