[
https://issues.apache.org/jira/browse/SOLR-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413936#comment-13413936
]
Mark Miller commented on SOLR-3620:
-----------------------------------
I've committed an attempted fix.
>From what I can tell this is a shutdown deadlock issue that involves recovery
>threads.
When a SolrCore hits a ref count of 0 (its closed by everyone using it) it will
try and cancel any recovery thread.
This is bad if it happens when a lock on 'CoreContainer#cores' is held. You can
end up with a deadlock of waiting. This only happens in CoreContainer#shutdown
that I know of. It holds the cores lock and calls close on all the cores. This
could cause a recovery to be canceled. That sequence is what can lead to
deadlock. We defend against this by canceling all recoveries *before* getting
the cores lock in CoreContainer#shutdown. We do that for just this case.
Otherwise, the 'cores' lock will be held when calling SolrCore#close which
could trigger a recovery cancel (because the ref count hits 0) which waits for
the recovery thread to finish ('#join'). But the recovery thread could be in
the middle of trying to recover - where it sometimes gets a core from the
CoreContainer which uses the 'cores' lock. It's waiting for #shutdown to give
up that lock while #shutdown waits for it to finish its loop and die.
So how is it happening here? And why did a collections API commit to improve
tests and add a RELOAD command expose this?
First, the only reason I can think of how this could be happening even with our
little defense cancelRecovery loop is that somehow a recovery is then getting
kicked off again before shutdown completes.
So the fix I have tried is to add a bit of code to make sure recoveries do not
start after the CoreContainer#shutdown method starts. Hopefully that plugs this.
If that is indeed the issue, this problem existed, and the new beefed up
collections api test exposed it because its a test that uses more SolrCores in
a single instance than most any other test. With more cores trying to recover
during shutdown, I think it's easier to expose this deadlock situation.
That's my initial guess and fix attempt.
> Almost every test relating to cloud hangs since July 12, 2012
> -------------------------------------------------------------
>
> Key: SOLR-3620
> URL: https://issues.apache.org/jira/browse/SOLR-3620
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Uwe Schindler
> Assignee: Mark Miller
> Priority: Blocker
>
> I have no idea, but please review the posts on the de@lao mailing list today!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]