dsmiley commented on a change in pull request #580: URL: https://github.com/apache/solr/pull/580#discussion_r797956081
########## File path: solr/core/src/test/org/apache/solr/core/TestLazyCores.java ########## @@ -899,4 +890,67 @@ private void check10(SolrCore core) { , "//result[@numFound='10']" ); } + + public void testDontEvictUsedCore() throws Exception { + // If a core is being used for a long time (say a long indexing batch) but nothing else for it, + // and if the transient cache has pressure and thus wants to unload a core, we should not + // unload it (yet). + + CoreContainer cc = init(); + String[] transientCoreNames = new String[]{ + "collection2", + "collection3", + "collection6", + "collection7", + "collection8", + "collection9" + }; + + try (LogListener logs = LogListener.info(TransientSolrCoreCacheDefault.class.getName()) + .substring("NOT evicting transient core [" + transientCoreNames[0] + "]")) { + cc.waitForLoadingCoresToFinish(1000); + var solr = new EmbeddedSolrServer(cc, null); + final var longReqTimeMs = 2000; // if lower too much, the test will fail on a slow/busy CI + + // First, start a long request on the first transient core + var thread = new Thread(() -> { + try { + // TODO Solr ought to have a query test "latch" mechanism so we don't sleep arbitrarily + solr.query(transientCoreNames[0], params("q", "{!func}sleep(" + longReqTimeMs + ",1)")); + } catch (SolrServerException | IOException e) { + fail(e.toString()); + } + }, "longRequest"); + thread.start(); + + System.out.println("Inducing pressure on cache by querying many cores..."); + // Now hammer on other transient cores to create transient cache pressure + for (int round = 0; round < 5 && logs.getCount() == 0; round++) { + // note: we skip over the first; we want the first to remain non-busy + for (int i = 1; i < transientCoreNames.length; i++) { + solr.query(transientCoreNames[i], params("q", "*:*")); + } + } + // Show that the cache logs that it was asked to evict but did not. + // To show the bug behavior, comment this out and also comment out the corresponding logic + // that fixes it at the spot this message is logged. + assertTrue(logs.getCount() > 0); Review comment: ;-) it's complicated. In a realistic setting, I saw Solr lose track of the fact that the SolrCore was actually being used, despite the closer thread calling close() on it. The consequence of that was that the lock file is still on disk (while the core is still open), and thus a future request to the core would try to open a new SolrCore but fail to because of the presence of the lock file. This failure is "sticky"; a node restart is needed to remedy for the affected core. In this test setting, the long running request hung indefinitely. The test fails because it joins for a limited period of time and asserts that the thread is no longer alive. When the request internally gets to the end and it closes its SolrCore reference, this does a real core close, not merely a decref because the core closer thread already decref'ed it. For reasons I'm unsure of, I see it hanging on CachingDirectoryFactory.close. There's a twelve second wait in there "Timeout waiting for all directory ref counts to be released". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org