[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

Timothy Potter (JIRA) Mon, 23 Dec 2013 16:38:27 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856052#comment-13856052
 ]


Timothy Potter commented on SOLR-4260:
--------------------------------------

Thanks Mark, I suspected my test case was a little cherry picked ... something 
interesting happened when I also severed the connection between the replica and 
ZK (ie. same test as above but I also dropped the ZK connection on the replica).

2013-12-23 15:39:57,170 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 
got event WatchedEvent state:Disconnected type:None path:null path:null 
type:None
2013-12-23 15:39:57,170 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - zkClient has disconnected

>>> fixed the connection between replica and ZK here <<<

2013-12-23 15:40:45,579 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 
got event WatchedEvent state:Expired type:None path:null path:null type:None
2013-12-23 15:40:45,579 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Our previous ZooKeeper session was expired. Attempting to reconnect to 
recover relationship with ZooKeeper...
2013-12-23 15:40:45,580 [main-EventThread] INFO  
common.cloud.DefaultConnectionStrategy  - Connection expired - starting a new 
one...
2013-12-23 15:40:45,586 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Waiting for client to connect to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 
got event WatchedEvent state:SyncConnected type:None path:null path:null 
type:None
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Client is connected to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Connection with ZooKeeper reestablished.
2013-12-23 15:40:45,596 [main-EventThread] WARN  solr.cloud.RecoveryStrategy  - 
Stopping recovery for zkNodeName=core_node3core=cloud_shard1_replica3
2013-12-23 15:40:45,597 [main-EventThread] INFO  solr.cloud.ZkController  - 
publishing core=cloud_shard1_replica3 state=down
2013-12-23 15:40:45,597 [main-EventThread] INFO  solr.cloud.ZkController  - 
numShards not found on descriptor - reading it from system property
2013-12-23 15:40:45,905 [qtp2124890785-14] INFO  handler.admin.CoreAdminHandler 
 - It has been requested that we recover
2013-12-23 15:40:45,906 [qtp2124890785-14] INFO  
solr.servlet.SolrDispatchFilter  - [admin] webapp=null path=/admin/cores 
params={action=REQUESTRECOVERY&core=cloud_shard1_replica3&wt=javabin&version=2} 
status=0 QTime=2 
2013-12-23 15:40:45,909 [Thread-17] INFO  solr.cloud.ZkController  - publishing 
core=cloud_shard1_replica3 state=recovering
2013-12-23 15:40:45,909 [Thread-17] INFO  solr.cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2013-12-23 15:40:45,920 [Thread-17] INFO  solr.update.DefaultSolrCoreState  - 
Running recovery - first canceling any ongoing recovery
2013-12-23 15:40:45,921 [RecoveryThread] INFO  solr.cloud.RecoveryStrategy  - 
Starting recovery process.  core=cloud_shard1_replica3 
recoveringAfterStartup=false
2013-12-23 15:40:45,924 [RecoveryThread] INFO  solr.cloud.ZkController  - 
publishing core=cloud_shard1_replica3 state=recovering
2013-12-23 15:40:45,924 [RecoveryThread] INFO  solr.cloud.ZkController  - 
numShards not found on descriptor - reading it from system property
2013-12-23 15:40:48,613 [qtp2124890785-15] INFO  solr.core.SolrCore  - 
[cloud_shard1_replica3] webapp=/solr path=/select 
params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 
2013-12-23 15:42:42,770 [qtp2124890785-13] INFO  solr.core.SolrCore  - 
[cloud_shard1_replica3] webapp=/solr path=/select 
params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 
2013-12-23 15:42:45,650 [main-EventThread] ERROR solr.cloud.ZkController  - 
There was a problem making a request to the 
leader:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I 
was asked to wait on state down for cloud86:8986_solr but I still do not see 
the requested state. I see state: recovering live:false
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
        at 
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1434)
        at 
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:347)
        at org.apache.solr.cloud.ZkController.access$100(ZkController.java:85)
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:225)
        at 
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:118)
        at 
org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56)
        at 
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:93)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2013-12-23 15:42:45,963 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy  - 
Error while trying to recover. 
core=cloud_shard1_replica3:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 I was asked to wait on state recovering for cloud86:8986_solr but I still do 
not see the requested state. I see state: recovering live:false
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
        at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:224)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)

2013-12-23 15:42:45,964 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy  - 
Recovery failed - trying again... (0) core=cloud_shard1_replica3
2013-12-23 15:42:45,964 [RecoveryThread] INFO  solr.cloud.RecoveryStrategy  - 
Wait 2.0 seconds before trying to recover again (1)
2013-12-23 15:42:47,964 [RecoveryThread] INFO  solr.cloud.ZkController  - 
publishing core=cloud_shard1_replica3 state=recovering

> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>
>         Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

Reply via email to