[
https://issues.apache.org/jira/browse/SOLR-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831703#comment-16831703
]
Andrzej Bialecki commented on SOLR-12833:
------------------------------------------
[~yuanyun.cn] Hmm, I'm seeing occasional lock-ups when beasting
{{PeerSyncTest}} with stacktraces that point to the newly refactored methods in
{{DistributedUpdateProcessor}} and {{VersionBucket}} (specifically, the code
that is using the intrinsic monitors for locking). If we can't find the reason
soon then we may need to revert this patch, at least from {{branch_8x}} and
{{branch_8_1}}.
Here's an example stacktrace:
{code:java}
[beaster] 2> 9903 INFO (qtp1564460830-112) [ x:collection1]
o.a.s.s.SolrIndexSearcher Opening [Searcher@2d936b61[collection1] realtime]
[beaster] 2> 9905 INFO (qtp1564460830-112) [ x:collection1]
o.a.s.s.SolrIndexSearcher Opening [Searcher@2c12d484[collection1] realtime]
[beaster] 2> 9907 INFO (qtp1564460830-112) [ x:collection1]
o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=/jeqeo/s path=/update
params={update.distrib=FROMLEADER&_version_=6004&wt=javabin&version=2}{deleteByQuery=val_i_dvo:6
(-6004)} 0 11
[beaster] 2> 9908 INFO (qtp1627373062-114) [ x:collection1]
o.a.s.u.PeerSync PeerSync: core=collection1 url= START
replicas=[http://127.0.0.1:50049/jeqeo/s/collection1] nUpdates=100
[beaster] 2> 9909 INFO (qtp1564460830-56) [ x:collection1]
o.a.s.u.IndexFingerprint IndexFingerprint millis:0.0
result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110,
maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219,
numDocs=219, maxDoc=111}
[beaster] 2> 9909 INFO (qtp1564460830-56) [ x:collection1]
o.a.s.c.S.Request [collection1] webapp=/jeqeo/s path=/get
params={distrib=false&qt=/get&getFingerprint=9223372036854775807&wt=javabin&version=2}
status=0 QTime=0
[beaster] 2> 9910 INFO (qtp1627373062-114) [ x:collection1]
o.a.s.u.IndexFingerprint IndexFingerprint millis:0.0
result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110,
maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219,
numDocs=219, maxDoc=110}
[beaster] 2> 9910 INFO (qtp1627373062-114) [ x:collection1]
o.a.s.u.PeerSync We are already in sync. No need to do a PeerSync
[beaster] 2> 9910 INFO (qtp1627373062-114) [ x:collection1]
o.a.s.c.S.Request [collection1] webapp=/jeqeo/s path=/get
params={qt=/get&getVersions=100&sync=http://127.0.0.1:50049/jeqeo/s/collection1&wt=javabin&version=2}
status=0 QTime=2
[beaster] 2> 129922 INFO (TEST-PeerSyncTest.test-seed#[A1B6A536E7B4423F])
[ ] o.a.s.SolrTestCaseJ4 ###Ending test
...
[beaster] 2> 144960 INFO (qtp1564460830-112) [ x:collection1]
o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=/jeqeo/s path=/update
params={update.distrib=FROMLEADER&distrib.inplace.prevversion=6000&wt=javabin&version=2}{}
0 135044
[beaster] 2> 144960 ERROR (qtp1564460830-112) [ x:collection1]
o.a.s.h.RequestHandlerBase java.lang.RuntimeException:
java.lang.InterruptedException
[beaster] 2> at
org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:68)
[beaster] 2> at
org.apache.solr.update.processor.DistributedUpdateProcessor.doWaitForDependentUpdates(DistributedUpdateProcessor.java:593)
[beaster] 2> at
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$waitForDependentUpdates$1(DistributedUpdateProcessor.java:536)
[beaster] 2> at
org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
[beaster] 2> at
org.apache.solr.update.processor.DistributedUpdateProcessor.waitForDependentUpdates(DistributedUpdateProcessor.java:536)
[beaster] 2> at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:327)
[beaster] 2> at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
...
[beaster] 2> Caused by: java.lang.InterruptedException
[beaster] 2> at java.base/java.lang.Object.wait(Native Method)
[beaster] 2> at
org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:66)
[beaster] 2> ... 52 more
{code}
Here's how to reproduce this (it usually fails within the first 10 rounds):
{code:java}
cd solr/core
ant beast -Dbeast.iters=50 -Dtestcase=PeerSyncTest -Dtests.method=test
-Dtests.slow=true -Dtests.badapples=true -Dtests.asserts=true
{code}
Some of the seeds that failed during beasting (but don't seem to fail when
running standalone):
{code:java}
ant test -Dtestcase=PeerSyncTest -Dtests.method=test
-Dtests.seed=35EDD6492A06CFE -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=fr-CD -Dtests.timezone=Europe/Brussels -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
ant test -Dtestcase=PeerSyncTest -Dtests.method=test
-Dtests.seed=A1B6A536E7B4423F -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=en-NF -Dtests.timezone=America/Dawson -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
ant test -Dtestcase=PeerSyncTest -Dtests.method=test
-Dtests.seed=A9180C308CF9355B -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=kab -Dtests.timezone=CTT -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
{code}
I also managed to capture a full thread dump when it locked-up (see the
attachment)
> Use timed-out lock in DistributedUpdateProcessor
> ------------------------------------------------
>
> Key: SOLR-12833
> URL: https://issues.apache.org/jira/browse/SOLR-12833
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: update, UpdateRequestProcessors
> Affects Versions: 7.5, 8.0
> Reporter: jefferyyuan
> Assignee: Mark Miller
> Priority: Minor
> Fix For: 7.7, 8.0
>
> Attachments: SOLR-12833-noint.patch, SOLR-12833.patch,
> SOLR-12833.patch
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> There is a synchronize block that blocks other update requests whose IDs fall
> in the same hash bucket. The update waits forever until it gets the lock at
> the synchronize block, this can be a problem in some cases.
>
> Some add/update requests (for example updates with spatial/shape analysis)
> like may take time (30+ seconds or even more), this would the request time
> out and fail.
> Client may retry the same requests multiple times or several minutes, this
> would make things worse.
> The server side receives all the update requests but all except one can do
> nothing, have to wait there. This wastes precious memory and cpu resource.
> We have seen the case 2000+ threads are blocking at the synchronize lock, and
> only a few updates are making progress. Each thread takes 3+ mb memory which
> causes OOM.
> Also if the update can't get the lock in expected time range, its better to
> fail fast.
>
> We can have one configuration in solrconfig.xml:
> updateHandler/versionLock/timeInMill, so users can specify how long they want
> to wait the version bucket lock.
> The default value can be -1, so it behaves same - wait forever until it gets
> the lock.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]