[
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856052#comment-13856052
]
Timothy Potter commented on SOLR-4260:
--------------------------------------
Thanks Mark, I suspected my test case was a little cherry picked ... something
interesting happened when I also severed the connection between the replica and
ZK (ie. same test as above but I also dropped the ZK connection on the replica).
2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager
- Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181
got event WatchedEvent state:Disconnected type:None path:null path:null
type:None
2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager
- zkClient has disconnected
>>> fixed the connection between replica and ZK here <<<
2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager
- Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181
got event WatchedEvent state:Expired type:None path:null path:null type:None
2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager
- Our previous ZooKeeper session was expired. Attempting to reconnect to
recover relationship with ZooKeeper...
2013-12-23 15:40:45,580 [main-EventThread] INFO
common.cloud.DefaultConnectionStrategy - Connection expired - starting a new
one...
2013-12-23 15:40:45,586 [main-EventThread] INFO common.cloud.ConnectionManager
- Waiting for client to connect to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager
- Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181
got event WatchedEvent state:SyncConnected type:None path:null path:null
type:None
2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager
- Client is connected to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager
- Connection with ZooKeeper reestablished.
2013-12-23 15:40:45,596 [main-EventThread] WARN solr.cloud.RecoveryStrategy -
Stopping recovery for zkNodeName=core_node3core=cloud_shard1_replica3
2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController -
publishing core=cloud_shard1_replica3 state=down
2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController -
numShards not found on descriptor - reading it from system property
2013-12-23 15:40:45,905 [qtp2124890785-14] INFO handler.admin.CoreAdminHandler
- It has been requested that we recover
2013-12-23 15:40:45,906 [qtp2124890785-14] INFO
solr.servlet.SolrDispatchFilter - [admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=cloud_shard1_replica3&wt=javabin&version=2}
status=0 QTime=2
2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - publishing
core=cloud_shard1_replica3 state=recovering
2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - numShards
not found on descriptor - reading it from system property
2013-12-23 15:40:45,920 [Thread-17] INFO solr.update.DefaultSolrCoreState -
Running recovery - first canceling any ongoing recovery
2013-12-23 15:40:45,921 [RecoveryThread] INFO solr.cloud.RecoveryStrategy -
Starting recovery process. core=cloud_shard1_replica3
recoveringAfterStartup=false
2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController -
publishing core=cloud_shard1_replica3 state=recovering
2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController -
numShards not found on descriptor - reading it from system property
2013-12-23 15:40:48,613 [qtp2124890785-15] INFO solr.core.SolrCore -
[cloud_shard1_replica3] webapp=/solr path=/select
params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1
2013-12-23 15:42:42,770 [qtp2124890785-13] INFO solr.core.SolrCore -
[cloud_shard1_replica3] webapp=/solr path=/select
params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1
2013-12-23 15:42:45,650 [main-EventThread] ERROR solr.cloud.ZkController -
There was a problem making a request to the
leader:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I
was asked to wait on state down for cloud86:8986_solr but I still do not see
the requested state. I see state: recovering live:false
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
at
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1434)
at
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:347)
at org.apache.solr.cloud.ZkController.access$100(ZkController.java:85)
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:225)
at
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:118)
at
org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56)
at
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:93)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2013-12-23 15:42:45,963 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy -
Error while trying to recover.
core=cloud_shard1_replica3:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state recovering for cloud86:8986_solr but I still do
not see the requested state. I see state: recovering live:false
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:224)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)
2013-12-23 15:42:45,964 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy -
Recovery failed - trying again... (0) core=cloud_shard1_replica3
2013-12-23 15:42:45,964 [RecoveryThread] INFO solr.cloud.RecoveryStrategy -
Wait 2.0 seconds before trying to recover again (1)
2013-12-23 15:42:47,964 [RecoveryThread] INFO solr.cloud.ZkController -
publishing core=cloud_shard1_replica3 state=recovering
> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
> Reporter: Markus Jelsma
> Assignee: Mark Miller
> Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png,
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using
> CloudSolrServer we see inconsistencies between the leader and replica for
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have
> a small deviation in then number of documents. The leader and slave deviate
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my
> attention, there were small IDF differences for exactly the same record
> causing a record to shift positions in the result set. During those tests no
> records were indexed. Consecutive catch all queries also return different
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor
> of two and frequently reindex using a fresh build from trunk. I've not seen
> this issue for quite some time until a few days ago.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]