magibney commented on PR #818: URL: https://github.com/apache/solr/pull/818#issuecomment-1112654117
From @dsmiley's [comment on PR 815](https://github.com/apache/solr/pull/815#issuecomment-1106880696): >I suspect the searcher and/or request being open after the core close will keep the core increment one higher and thus not actually closed until the searcher/request is. I dug into this a bit more and it's quite tricky. The problem manifests when the `core.close()` call happens before `searcher.decref()` ... but only if the `core.close()` call in question is the _last_ call to `close()` for the given core. Until `core.refCount` becomes `0`, `core.close()` is basically just bookkeeping; but the _last_ `core.close()` proceeds to `doClose()`, which calls `directoryFactory.close()` -- the `CachingDirectoryFactory` implementation of which blocks waiting for its cached `Directory` instances to be released. The `SolrIndexSearcher` whose closing was reordered to follow `core.close()` actually holds a reference to the cached `Directory` instance, not the `SolrCore` itself, so the thread indeed leaks/resource is not closed. Looking more closely at why this issue addressed by #815 was hard to reproduce: the bulk of the replication in `TestReplicationHandler` takes place via explicit `fetchindex` command -- almost always with the special test param `wait=true`, causing the associated replication fetch to take place (from start to finish) within the context/lifetime of the enclosing `SolrQueryRequest`, which itself caries a reference to the `SolrCore`, ensuring that the replication code will _not_ be the last call to `core.close()`. So `core.close()` amounts to just a decrement of a refcount -- it doesn't block, and the `searcher.decref()` happens ok, even though out-of-order. The problem happens when _polling_ replication ("indexFetcher" thread) happens to hit around the time of core close -- all the other core references have been decremented, and the affected code is left holding the only remaining reference to the core _and_ (via the `SolrIndexSearcher`) to the `Directory` instance. Polling replication is vulnerable (whereas explicit `fetchindex` command is not) because polling is a scheduled task that doesn't take place within the context of a `SolrQueryRequest` holding references to the `SolrCore`. Indeed, I verified that the problem addressed by #815 could be equally (though hackily) addressed by wrapping the polling `doFetch()` in a try-with-resources block artificially incrementing the refCount for the SolrCore! I'm not sure whether there might be other cases that would trigger this, especially with other areas of the code; but I think the prudent thing to do for now would be to take the instances identified in this PR and restore the original close order -- i.e. `SolrIndexSearcher.decref()` (or `SolrQueryRequest.close()`, which in turn may call `SolrIndexSearcher.decref()`) _before_ calling `close()` on the associated `SolrCore`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org