magibney commented on PR #818:
URL: https://github.com/apache/solr/pull/818#issuecomment-1112654117

   From @dsmiley's [comment on PR 
815](https://github.com/apache/solr/pull/815#issuecomment-1106880696):
   
   >I suspect the searcher and/or request being open after the core close will 
keep the core increment one higher and thus not actually closed until the 
searcher/request is.
   
   I dug into this a bit more and it's quite tricky.
   
   The problem manifests when the `core.close()` call happens before 
`searcher.decref()` ... but only if the `core.close()` call in question is the 
_last_ call to `close()` for the given core. Until `core.refCount` becomes `0`, 
`core.close()` is basically just bookkeeping; but the _last_ `core.close()` 
proceeds to `doClose()`, which calls `directoryFactory.close()` -- the 
`CachingDirectoryFactory` implementation of which blocks waiting for its cached 
`Directory` instances to be released. The `SolrIndexSearcher` whose closing was 
reordered to follow `core.close()` actually holds a reference to the cached 
`Directory` instance, not the `SolrCore` itself, so the thread indeed 
leaks/resource is not closed.
   
   Looking more closely at why this issue addressed by #815 was hard to 
reproduce: the bulk of the replication in `TestReplicationHandler` takes place 
via explicit `fetchindex` command -- almost always with the special test param 
`wait=true`, causing the associated replication fetch to take place (from start 
to finish) within the context/lifetime of the enclosing `SolrQueryRequest`, 
which itself caries a reference to the `SolrCore`, ensuring that the 
replication code will _not_ be the last call to `core.close()`. So 
`core.close()` amounts to just a decrement of a refcount -- it doesn't block, 
and the `searcher.decref()` happens ok, even though out-of-order.
   
   The problem happens when _polling_ replication ("indexFetcher" thread) 
happens to hit around the time of core close -- all the other core references 
have been decremented, and the affected code is left holding the only remaining 
reference to the core _and_ (via the `SolrIndexSearcher`) to the `Directory` 
instance. Polling replication is vulnerable (whereas explicit `fetchindex` 
command is not) because polling is a scheduled task that doesn't take place 
within the context of a `SolrQueryRequest` holding references to the 
`SolrCore`. Indeed, I verified that the problem addressed by #815 could be 
equally (though hackily) addressed by wrapping the polling `doFetch()` in a 
try-with-resources block artificially incrementing the refCount for the 
SolrCore!
   
   I'm not sure whether there might be other cases that would trigger this, 
especially with other areas of the code; but I think the prudent thing to do 
for now would be to take the instances identified in this PR and restore the 
original close order -- i.e. `SolrIndexSearcher.decref()` (or 
`SolrQueryRequest.close()`, which in turn may call 
`SolrIndexSearcher.decref()`) _before_ calling `close()` on the associated 
`SolrCore`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to