[
https://issues.apache.org/jira/browse/SOLR-12187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441279#comment-16441279
]
Tomás Fernández Löbbe commented on SOLR-12187:
----------------------------------------------
{code:java}
- if (watcher.onStateChanged(liveNodes, collectionState)) {
- removeCollectionStateWatcher(collection, watcher);
+ try {
+ if (watcher.onStateChanged(liveNodes, collectionState)) {
+ removeCollectionStateWatcher(collection, watcher);
+ }
+ } catch (Throwable throwable) {
+ LOG.warn("Error on calling watcher", throwable);
}
{code}
Why {{Throwable}} and not {{Exception}}?
{code}
+ while (true) {
+ try {
+ CollectionAdminRequest.addReplicaToShard(collectionName, "shard1")
+ .process(cluster.getSolrClient());
+ break;
+ } catch (Exception e) {
+ // expected, when the node is not fully started
+ Thread.sleep(500);
+ }
+ }
{code}
Maybe better to have some number of attempts or a timeout? Otherwise we'll get
a weird Suite timeout if this command keeps failing
> Replica should watch clusterstate and unload itself if its entry is removed
> ---------------------------------------------------------------------------
>
> Key: SOLR-12187
> URL: https://issues.apache.org/jira/browse/SOLR-12187
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Cao Manh Dat
> Assignee: Cao Manh Dat
> Priority: Major
> Attachments: SOLR-12187.patch, SOLR-12187.patch, SOLR-12187.patch,
> SOLR-12187.patch, SOLR-12187.patch, SOLR-12187.patch
>
>
> With the introduction of autoscaling framework, we have seen an increase in
> the number of issues related to the race condition between delete a replica
> and other stuff.
> Case 1: DeleteReplicaCmd failed to send UNLOAD request to a replica,
> therefore, forcefully remove its entry from clusterstate, but the replica
> still function normally and be able to become a leader -> SOLR-12176
> Case 2:
> * DeleteReplicaCmd enqueue a DELETECOREOP (without sending a request to
> replica because the node is not live)
> * The node start and the replica get loaded
> * DELETECOREOP has not processed hence the replica still present in
> clusterstate --> pass checkStateInZk
> * DELETECOREOP is executed, DeleteReplicaCmd finished
> ** result 1: the replica start recovering, finish it and publish itself as
> ACTIVE --> state of the replica is ACTIVE
> ** result 2: the replica throw an exception (probably: NPE)
> --> state of the replica is DOWN, not join leader election
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]