[
https://issues.apache.org/jira/browse/SOLR-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erick Erickson updated SOLR-7836:
---------------------------------
Attachment: SOLR-7836.patch
This might do it. I'm running 1,000 iterations (or until morning, whichever
comes first) but it's gone through 150 or so already which was usually more
than enough to trigger the new deadlock I found so I'm hopeful. Actually, I
think this really the old deadlock, the first fix wasn't very good.
There are two things I'd appreciate any opinions on:
1> this patch moves the UpdateLog.add() command out of the ref counted
IndexWriter in DirectUpdateHandler2.addDoc0() in multiple places (refactored to
methods). But UpdateLog.add gets a new IndexWriter itself which is where the
deadlock appears to be as it's fighting with CoreContainer.reload() which also
calls DefaultSolrCoreState.newIndexWriter.
2> There are two nocommits, but they are to make it easy for someone to see the
other bit that other eyes would be most welcome on, I moved
if (ulog != null) ulog.add(cmd, true);
out of a synchronized block. This fits the pattern in other places in that code
too, so I'm not too worried, but wanted to draw attention to it if anyone wants
to look. Actually, there are a lot of operations on ulog.something that are
synchronized on solrCoreState.getUpdateLock(), and a bunch that aren't. What's
up there? The change marked by //nocommit matches the (non-synchronized) usages
in addDoc0() though.
Anyway, the problem was in DirectUpdateHandler2.addDoc0(). The ulog.add command
line being inside a ref-counted IndexWriter when adding documents (addAndDelete
case). ulog.add eventually tries to get a new Searcher which can be deadlocked
with another thread called from SolrCore.reload since the reload wants a new
IndexWriter.
I haven't run precommit or the full test suite yet, I want to make sure the
iterations I'm doing tonight work.
This looks more intrusive than it is. I refactored some methods out of
DirectUpdateHandler2.addDoc0() in order to make this pattern more visible. The
original code concealed the fact that the ref counted index writer surrounded
all the code, including the ulog.add.
this is the pattern for all the refactored methods:
RefCounted<IndexWriter> iw = solrCoreState.getIndexWriter(core);
try {
IndexWriter writer = iw.get();
do the right thing
} finally {
iw.decref();
}
if (ulog != null) ulog.add(cmd, true);
I think this is a better approach than the original fix which passed the index
writer down to the addAndDelete method. we'd have had to pass the IndexWriter
on to the ulog.add command or something which would have been a pain (even if
it was possible).
New test case is in this patch. Note that it's a race condition to get this to
fail, so it needs to be run a lot. And using tests.iters apparently triggers
the timeouts so....
> Possible deadlock when closing refcounted index writers.
> --------------------------------------------------------
>
> Key: SOLR-7836
> URL: https://issues.apache.org/jira/browse/SOLR-7836
> Project: Solr
> Issue Type: Bug
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Attachments: SOLR-7836.patch, SOLR-7836.patch
>
>
> Preliminary patch for what looks like a possible race condition between
> writerFree and pauseWriter in DefaultSorlCoreState.
> Looking for comments and/or why I'm completely missing the boat.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]