[
https://issues.apache.org/jira/browse/SOLR-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar updated SOLR-6956:
----------------------------------------
Attachment: DeleteReplicaTest-failure.log
The full test log from jenkins is attached.
The following is the interesting excerpt from the log:
{code}
[junit4] 2> 357522 T2350 oasco.ZkStateWriter.writePendingUpdates going to
update_collection /collections/delLiveColl/state.json version: 10
[junit4] 2> 357523 T2349 oascc.ZkStateReader$7.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/delLiveColl/state.json, has occurred - updating...
[junit4] 2> 357523 T2450 oascc.ZkStateReader$7.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/delLiveColl/state.json, has occurred - updating...
[junit4] 2> 357523 T2425 oascc.ZkStateReader$7.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/delLiveColl/state.json, has occurred - updating...
[junit4] 2> 357523 T2381 oascc.ZkStateReader$7.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/delLiveColl/state.json, has occurred - updating...
[junit4] 2> 357523 T2450 oascc.ZkStateReader.updateWatchedCollection
Updating data for delLiveColl to ver 11
[junit4] 2> 357524 T2425 oascc.ZkStateReader.updateWatchedCollection
Updating data for delLiveColl to ver 11
[junit4] 2> 357524 T2381 oascc.ZkStateReader.updateWatchedCollection
Updating data for delLiveColl to ver 11
[junit4] 2> 357559 T2321
oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished -
collection: delLiveColl
[junit4] 2> 357562 T2372 oass.SolrDispatchFilter.handleAdminRequest
[admin] webapp=null path=/admin/cores
params={core=delLiveColl_shard1_replica1&action=STATUS&indexInfo=true&wt=javabin&version=2}
status=0 QTime=1
[junit4] 2> 357565 T2339 oasha.CollectionsHandler.handleRemoveReplica
Remove replica:
replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2
[junit4] 2> 357566 T2364 oasc.DistributedQueue$LatchWatcher.process
NodeChildrenChanged fired on path /overseer/collection-queue-work state
SyncConnected
[junit4] 2> 357567 T2351 oasc.OverseerCollectionProcessor.run Overseer
Collection Processor: Get the message
id:/overseer/collection-queue-work/qn-0000000002 message:{
[junit4] 2> "operation":"deletereplica",
[junit4] 2> "collection":"delLiveColl",
[junit4] 2> "shard":"shard1",
[junit4] 2> "replica":"core_node1",
[junit4] 2> "onlyIfDown":"true"}
[junit4] 2> 357567 T2489 oasc.OverseerCollectionProcessor.processMessage
WARN OverseerCollectionProcessor.processMessage : deletereplica , {
[junit4] 2> "operation":"deletereplica",
[junit4] 2> "collection":"delLiveColl",
[junit4] 2> "shard":"shard1",
[junit4] 2> "replica":"core_node1",
[junit4] 2> "onlyIfDown":"true"}
[junit4] 2> 357567 T2489 oasc.SolrException.log ERROR Collection:
delLiveColl operation: deletereplica
failed:org.apache.solr.common.SolrException: Attempted to remove replica :
delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering'
[junit4] 2> at
org.apache.solr.cloud.OverseerCollectionProcessor.deleteReplica(OverseerCollectionProcessor.java:1079)
[junit4] 2> at
org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:631)
[junit4] 2> at
org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2862)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[junit4] 2> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[junit4] 2> at java.lang.Thread.run(Thread.java:745)
[junit4] 2>
[junit4] 2> 357569 T2489 oasc.OverseerCollectionProcessor$Runner.run
Overseer Collection Processor: Message
id:/overseer/collection-queue-work/qn-0000000002 complete, response:{Operation
deletereplica caused exception:=org.apache.solr.common.SolrException: Attempted
to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but
state is 'recovering',exception={msg=Attempted to remove replica :
delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is
'recovering',rspCode=400}}
[junit4] 2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process
NodeDataChanged fired on path /overseer/collection-queue-work/qnr-0000000002
state SyncConnected
[junit4] 2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process
NodeChildrenChanged fired on path /overseer/collection-queue-work state
SyncConnected
[junit4] 2> 357571 T2364 oasc.DistributedQueue$LatchWatcher.process
NodeChildrenChanged fired on path /overseer/collection-queue-work state
SyncConnected
[junit4] 2> 357572 T2339 oass.SolrDispatchFilter.handleAdminRequest
[admin] webapp=null path=/admin/collections
params={replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2}
status=400 QTime=7
[junit4] 2> 357574 T2321 oas.SolrTestCaseJ4.tearDown ###Ending
deleteLiveReplicaTest
[junit4] 2> 357577 T2321 oasc.ChaosMonkey.monkeyLog monkey: stop shard!
55369
[junit4] 2> 357577 T2321 oasc.CoreContainer.shutdown Shutting down
CoreContainer instance=802937226
[junit4] 2> 357577 T2321 oasc.ZkController.publish publishing
core=collection1 state=down collection=control_collection
[junit4] 2> 357577 T2321 oasc.ZkController.publish numShards not found on
descriptor - reading it from system property
[junit4] 2> 357578 T2364 oasc.DistributedQueue$LatchWatcher.process
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
[junit4] 2> 357578 T2321 oasc.ZkController.publish publishing
core=delLiveColl_shard1_replica2 state=down collection=delLiveColl
[junit4] 2> 357579 T2350 oasc.Overseer$ClusterStateUpdater.run
processMessage: queueSize: 1, message = {
[junit4] 2> "core":"collection1",
[junit4] 2> "core_node_name":"core_node1",
[junit4] 2> "roles":null,
[junit4] 2> "base_url":"http://127.0.0.1:55369",
[junit4] 2> "node_name":"127.0.0.1:55369_",
[junit4] 2> "numShards":"2",
[junit4] 2> "state":"down",
[junit4] 2> "shard":"shard1",
[junit4] 2> "collection":"control_collection",
[junit4] 2> "operation":"state"} current state version: 6
[junit4] 2> 357579 T2321 oasc.SolrCore.close [collection1] CLOSING
SolrCore org.apache.solr.core.SolrCore@6eb4277d
[junit4] 2> 357579 T2350 oasco.ReplicaMutator.updateState Update state
numShards=2 message={
[junit4] 2> "core":"collection1",
[junit4] 2> "core_node_name":"core_node1",
[junit4] 2> "roles":null,
[junit4] 2> "base_url":"http://127.0.0.1:55369",
[junit4] 2> "node_name":"127.0.0.1:55369_",
[junit4] 2> "numShards":"2",
[junit4] 2> "state":"down",
[junit4] 2> "shard":"shard1",
[junit4] 2> "collection":"control_collection",
[junit4] 2> "operation":"state"}
[junit4] 2> 357579 T2321 oasc.ZkController.unRegisterConfListener a
listener was removed because of core close
[junit4] 2> 357579 T2321 oasu.DirectUpdateHandler2.close closing
DirectUpdateHandler2{commits=0,autocommits=0,soft
autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
[junit4] 2> 357579 T2321 oasu.SolrCoreState.decrefSolrCoreState Closing
SolrCoreState
[junit4] 2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter
SolrCoreState ref count has reached 0 - closing IndexWriter
[junit4] 2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter
closing IndexWriter with IndexWriterCloser
[junit4] 2> 357580 T2364 oascc.ZkStateReader$2.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
[junit4] 2> 357580 T2381 oascc.ZkStateReader$2.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
[junit4] 2> 357580 T2425 oascc.ZkStateReader$2.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
[junit4] 2> 357580 T2450 oascc.ZkStateReader$2.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
[junit4] 2> 357580 T2403 oascc.ZkStateReader$2.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
[junit4] 2> 357580 T2321 oasc.SolrCore.closeSearcher [collection1] Closing
main searcher on request.
[junit4] 2> 357586 T2350 oasc.Overseer$ClusterStateUpdater.run
processMessage: queueSize: 1, message = {
[junit4] 2> "core":"delLiveColl_shard1_replica2",
[junit4] 2> "core_node_name":"core_node2",
[junit4] 2> "roles":null,
[junit4] 2> "base_url":"http://127.0.0.1:55369",
[junit4] 2> "node_name":"127.0.0.1:55369_",
[junit4] 2> "numShards":"2",
[junit4] 2> "state":"down",
[junit4] 2> "shard":"shard1",
[junit4] 2> "collection":"delLiveColl",
[junit4] 2> "operation":"state"} current state version: 7
[junit4] 2> 357587 T2350 oasco.ReplicaMutator.updateState Update state
numShards=2 message={
[junit4] 2> "core":"delLiveColl_shard1_replica2",
[junit4] 2> "core_node_name":"core_node2",
[junit4] 2> "roles":null,
[junit4] 2> "base_url":"http://127.0.0.1:55369",
[junit4] 2> "node_name":"127.0.0.1:55369_",
[junit4] 2> "numShards":"2",
[junit4] 2> "state":"down",
[junit4] 2> "shard":"shard1",
[junit4] 2> "collection":"delLiveColl",
[junit4] 2> "operation":"state"}
[junit4] 2> 357589 T2490 oasc.DistributedQueue$LatchWatcher.process
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
[junit4] 2> 357589 T2350 oasco.ZkStateWriter.writePendingUpdates going to
update_collection /collections/delLiveColl/state.json version: 11
[junit4] 2> 357589 T2425 oascc.ZkStateReader$7.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/delLiveColl/state.json, has occurred - updating...
[junit4] 2> 357589 T2450 oascc.ZkStateReader$7.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/delLiveColl/state.json, has occurred - updating...
[junit4] 2> 357589 T2381 oascc.ZkStateReader$7.process A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/delLiveColl/state.json, has occurred - updating...
[junit4] 2> 357590 T2425 oascc.ZkStateReader.updateWatchedCollection
Updating data for delLiveColl to ver 12
[junit4] 2> 357590 T2381 oascc.ZkStateReader.updateWatchedCollection
Updating data for delLiveColl to ver 12
[junit4] 2> 357590 T2450 oascc.ZkStateReader.updateWatchedCollection
Updating data for delLiveColl to ver 12
[junit4] 2> 357593 T2349 oascc.ZkStateReader.updateWatchedCollection
Updating data for delLiveColl to ver 12
{code}
In this run, the test has 4 nodes and the collection being deleted has 2 shards
and 4 replicas distributed across those 4 nodes. The overseer node itself hosts
a replica of the collection being deleted.
Notice that the Overseer updates the collection state currently at version 10
and 4 watcher threads fire to update state of 'delLiveColl' to version 11. But
the ZkStateReader.updateWatchedCollection method is only fired by three
watchers. The thread T2349 doesn't execute the updateWatchedCollection method
presumably because it is waiting for the update lock.
I think this failure is linked to SOLR-6554 where my effort in making the
overseer faster causes the update lock to be held up for a long time on the
overseer node causing state updates to be not visible to anyone except the
Overseer on that node. This is causing the OverseerCollectionProcessor to
operate on stale state. This may likely be the reason for various other
failures in our cloud tests.
> DeleteReplicaTest fails sometimes.
> ----------------------------------
>
> Key: SOLR-6956
> URL: https://issues.apache.org/jira/browse/SOLR-6956
> Project: Solr
> Issue Type: Test
> Reporter: Mark Miller
> Priority: Minor
> Attachments: DeleteReplicaTest-failure.log
>
>
> I still see fails of this test sometimes:
> {noformat}
> java.lang.AssertionError: Should have had a good message here
> at
> __randomizedtesting.SeedInfo.seed([D765D9019AAF2D1E:56835719EDF04D22]:0)
> at org.junit.Assert.fail(Assert.java:93)
> at org.junit.Assert.assertTrue(Assert.java:43)
> at
> org.apache.solr.cloud.DeleteReplicaTest.deleteLiveReplicaTest(DeleteReplicaTest.java:138)
> at
> org.apache.solr.cloud.DeleteReplicaTest.doTest(DeleteReplicaTest.java:89)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]