[
https://issues.apache.org/jira/browse/SOLR-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946068#comment-13946068
]
ASF subversion and git services commented on SOLR-5796:
-------------------------------------------------------
Commit 1581196 from [~steve_rowe] in branch 'dev/trunk'
[ https://svn.apache.org/r1581196 ]
SOLR-5796: move CHANGES.txt entries to 4.7.1 section
> With many collections, leader re-election takes too long when a node dies or
> is rebooted, leading to some shards getting into a "conflicting" state about
> who is the leader.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-5796
> URL: https://issues.apache.org/jira/browse/SOLR-5796
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Environment: Found on branch_4x
> Reporter: Timothy Potter
> Assignee: Mark Miller
> Fix For: 4.8, 5.0, 4.7.1
>
> Attachments: SOLR-5796.patch
>
>
> I'm doing some testing with a 4-node SolrCloud cluster against the latest rev
> in branch_4x having many collections, 150 to be exact, each having 4 shards
> with rf=3, so 450 cores per node. Nodes are decent in terms of resources:
> -Xmx6g with 4 CPU - m3.xlarge's in EC2.
> The problem occurs when rebooting one of the nodes, say as part of a rolling
> restart of the cluster. If I kill one node and then wait for an extended
> period of time, such as 3 minutes, then all of the leaders on the downed node
> (roughly 150) have time to failover to another node in the cluster. When I
> restart the downed node, since leaders have all failed over successfully, the
> new node starts up and all cores assume the replica role in their respective
> shards. This is goodness and expected.
> However, if I don't wait long enough for the leader failover process to
> complete on the other nodes before restarting the downed node,
> then some bad things happen. Specifically, when the dust settles, many of the
> previous leaders on the node I restarted get stuck in the "conflicting" state
> seen in the ZkController, starting around line 852 in branch_4x:
> {quote}
> 852 while (!leaderUrl.equals(clusterStateLeaderUrl)) {
> 853 if (tries == 60) {
> 854 throw new SolrException(ErrorCode.SERVER_ERROR,
> 855 "There is conflicting information about the leader of
> shard: "
> 856 + cloudDesc.getShardId() + " our state says:"
> 857 + clusterStateLeaderUrl + " but zookeeper says:" +
> leaderUrl);
> 858 }
> 859 Thread.sleep(1000);
> 860 tries++;
> 861 clusterStateLeaderUrl = zkStateReader.getLeaderUrl(collection,
> shardId,
> 862 timeoutms);
> 863 leaderUrl = getLeaderProps(collection, cloudDesc.getShardId(),
> timeoutms)
> 864 .getCoreUrl();
> 865 }
> {quote}
> As you can see, the code is trying to give a little time for this problem to
> work itself out, 1 minute to be exact. Unfortunately, that doesn't seem to be
> long enough for a busy cluster that has many collections. Now, one might
> argue that 450 cores per node is asking too much of Solr, however I think
> this points to a bigger issue of the fact that a node coming up isn't aware
> that it went down and leader election is running on other nodes and is just
> being slow. Moreover, once this problem occurs, it's not clear how to fix it
> besides shutting the node down again and waiting for leader failover to
> complete.
> It's also interesting to me that /clusterstate.json was updated by the
> healthy node taking over the leader role but the
> /collections/<coll>leaders/shard# was not updated? I added some debugging and
> it seems like the overseer queue is extremely backed up with work.
> Maybe the solution here is to just wait longer but I also want to get some
> feedback from the community on other options? I know there are some plans to
> help scale the Overseer (i.e. SOLR-5476) so maybe that helps and I'm trying
> to add more debug to see if this is really due to overseer backlog (which I
> suspect it is).
> In general, I'm a little confused by the keeping of leader state in multiple
> places in ZK. Is there any background information on why we have leader state
> in /clusterstate.json and in the leader path znode?
> Also, here are some interesting side observations:
> a. If I use rf=2, then this problem doesn't occur as leader failover happens
> more quickly and there's less overseer work?
> May be a red herring here, but I can consistently reproduce it with RF=3, but
> not with RF=2 ... suppose that is because there are only 300 cores per node
> versus 450 and that's just enough less work to make this issue work itself
> out.
> b. To support that many cores, I had to set -Xss256k to reduce the stack size
> as Solr uses a lot of threads during startup (high point was 800'ish)
>
> Might be something we should recommend on the mailing list / wiki somewhere.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]