[
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845847#comment-13845847
]
Timothy Potter commented on SOLR-4260:
--------------------------------------
I don't have fix yet, but I wanted to post an update here to get some feedback
on what I'm seeing ...
I have a simple SolrCloud configuration setup locally: 1 collection named
"cloud" with 1 shard and replicationFactor 2, i.e. here's what I use to create
it:
curl
"http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud"
The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985
with cloud84 being assigned the leader.
Here's an outline of the process I used to get my collection out-of-sync during
indexing:
1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to
the leader and replica remains in sync for as long as I let this process run
2) kill -9 the process for the replica cloud85
3) let indexing continue against cloud84 for a few seconds (just to get the
leader and replica out-of-sync once I bring the replica back online)
4) kill -9 the process for the leader cloud84 ... indexing halts of course as
there are no running servers
5) start the replica cloud85 but do not start the previous leader cloud84
Here are some key log messages as cloud85 - the replica - fires up ... my
annotations of the log messages are prefixed by [TJP >>
2013-12-11 11:43:22,076 [main-EventThread] INFO common.cloud.ZkStateReader -
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 1)
2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.ShardLeaderElectionContext - Waiting until we see more replicas up
for shard shard1: total=2 found=1 timeoutin=139841
[TJP >> This looks good and is expected because cloud85 was not the leader
before it died, so it should not immediately assume it is the leader until it
sees more replicas
6) now start the previous leader cloud84 ...
Here are some key log messages from cloud85 as the previous leader cloud84 is
coming up ...
2013-12-11 11:43:24,085 [main-EventThread] INFO common.cloud.ZkStateReader -
Updating live nodes... (2)
2013-12-11 11:43:24,136 [main-EventThread] INFO solr.cloud.DistributedQueue -
LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type
NodeChildrenChanged
2013-12-11 11:43:24,137 [Thread-13] INFO common.cloud.ZkStateReader -
Updating cloud state from ZooKeeper...
2013-12-11 11:43:24,138 [Thread-13] INFO solr.cloud.Overseer - Update state
numShards=1 message={
"operation":"state",
"state":"down",
"base_url":"http://cloud84:8984/solr",
"core":"cloud_shard1_replica2",
"roles":null,
"node_name":"cloud84:8984_solr",
"shard":"shard1",
"shard_range":null,
"shard_state":"active",
"shard_parent":null,
"collection":"cloud",
"numShards":"1",
"core_node_name":"core_node1"}
[TJP >> state of cloud84 looks correct as it is still initializing ...
2013-12-11 11:43:24,140 [main-EventThread] INFO solr.cloud.DistributedQueue -
LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type
NodeChildrenChanged
2013-12-11 11:43:24,141 [main-EventThread] INFO common.cloud.ZkStateReader -
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 2)
2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue.
[TJP >> hmmmm ... cloud84 is listed in /live_nodes but it isn't "active" yet or
even recovering (see state above - it's currently "down") ... My thinking here
is that the ShardLeaderElectionContext needs to take the state of the replica
into account before deciding it should continue.
2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync
2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.SyncStrategy - Sync replicas to
http://cloud85:8985/solr/cloud_shard1_replica1/
2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO
solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1
url=http://cloud85:8985/solr START
replicas=[http://cloud84:8984/solr/cloud_shard1_replica2/] nUpdates=100
2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN
solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1
url=http://cloud85:8985/solr couldn't connect to
http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
[TJP >> whoops! of course it couldn't connect to cloud84 as it's still
initializing ...
2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO
solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1
url=http://cloud85:8985/solr DONE. sync succeeded
2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.SyncStrategy - Sync Success - now sync replicas to me
2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/: try
and ask http://cloud84:8984/solr/cloud_shard1_replica2/ to sync
2013-12-11 11:43:25,938 [coreLoadExecutor-3-thread-1] ERROR
solr.cloud.SyncStrategy - Sync request error:
org.apache.solr.client.solrj.SolrServerException: Server refused connection at:
http://cloud84:8984/solr/cloud_shard1_replica2
[TJP >> ayep, cloud84 is still initializing so it can't respond to you Mr.
Impatient cloud85!
2013-12-11 11:43:25,939 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/:
Sync failed - asking replica (http://cloud84:8984/solr/cloud_shard1_replica2/)
to recover.
2013-12-11 11:43:25,940 [coreLoadExecutor-3-thread-1] INFO
solr.cloud.ShardLeaderElectionContext - I am the new leader:
http://cloud85:8985/solr/cloud_shard1_replica1/ shard1
[TJP >> oh no! the collection is now out-of-sync ... my test harness
periodically polls the replicas for their doc counts and at this point, we
ended up with:
shard1: {
http://cloud85:8985/solr/cloud_shard1_replica1/ = 300800 LEADER
http://cloud84:8984/solr/cloud_shard1_replica2/ = 447600 diff:-146800
<--- this should be the real leader!
}
Which of course is expected because cloud85 should *NOT* be the leader
So all that is interesting, but how to fix???
My first idea was to go tackle the decision making process
ShardLeaderElectionContext uses to decide if it has enough replicas to
continue.
It's easy enough to do something like the following:
int notDownCount = 0;
Map<String,Replica> replicasMap = slices.getReplicasMap();
for (Replica replica : replicasMap.values()) {
ZkCoreNodeProps replicaCoreProps = new ZkCoreNodeProps(replica);
String replicaState = replicaCoreProps.getState();
log.warn(">>>> State of replica "+replica.getName()+" is
"+replicaState+" <<<<");
if ("active".equals(replicaState) ||
"recovering".equals(replicaState)) {
++notDownCount;
}
}
Was thinking I could use the notDownCount to make a better decision, but then I
ran into another issue related to replica state being stale. In my cluster, if
I have /clusterstate.json:
{"cloud":{
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{
"core_node1":{
"state":"active",
"base_url":"http://cloud84:8984/solr",
"core":"cloud_shard1_replica2",
"node_name":"cloud84:8984_solr",
"leader":"true"},
"core_node2":{
"state":"active",
"base_url":"http://cloud85:8985/solr",
"core":"cloud_shard1_replica1",
"node_name":"cloud85:8985_solr"}}}},
"maxShardsPerNode":"1",
"router":{"name":"compositeId"},
"replicationFactor":"2"}}
If I kill the process using kill -9 PID for the Solr running on 8985 (the
replica), core_node2's state remains "active" in /clusterstate.json
When tailing the log on core_node1, I do see one notification coming in the
watcher setup by ZkStateReader from ZooKeeper about live nodes having changed:
2013-12-11 15:42:46,010 [main-EventThread] INFO common.cloud.ZkStateReader -
Updating live nodes... (1)
So after killing the process, /live_nodes is updated to only have one node, but
/clusterstate.json still thinks there are 2 healthy replicas for shard1,
instead of just 1.
Of course, if I restart 8985, then it goes through a series of state changes
until it is marked active again, which looks correct.
Bottom line ... it seems there is something in SolrCloud that does not update a
replica's state when the node is killed. If a change to /live_nodes doesn't
trigger a refresh of replica state, what does?
I'm seeing this stale replica state issue in Solr 4.6.0 and in revision 1550300
of branch_4x - the latest from svn.
Not having a fresh state of a replica prevents my idea for fixing
ShardLeaderElectionContext's decision making process. I'm also curious about
the decision to register a node under /live_nodes before it is fully
initialized, but maybe that is a discussion for another time.
In any case, I wanted to get some feedback on my findings before moving forward
with a solution.
> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
> Reporter: Markus Jelsma
> Assignee: Mark Miller
> Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png,
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using
> CloudSolrServer we see inconsistencies between the leader and replica for
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have
> a small deviation in then number of documents. The leader and slave deviate
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my
> attention, there were small IDF differences for exactly the same record
> causing a record to shift positions in the result set. During those tests no
> records were indexed. Consecutive catch all queries also return different
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor
> of two and frequently reindex using a fresh build from trunk. I've not seen
> this issue for quite some time until a few days ago.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]