[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

Timothy Potter (JIRA) Thu, 16 Jan 2014 11:54:40 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873841#comment-13873841
 ]


Timothy Potter commented on SOLR-4260:
--------------------------------------

I was able to reproduce this issue on EC2 without any over-sharding (on latest 
rev on branch_4x) ... basically 6 Solr nodes with 3 shards and RF=2, i.e. each 
replica gets its own Solr instance. Here's the output from my client app that 
traps the inconsistency:

>>>>>>
Found 1 shards with mis-matched doc counts.
At January 16, 2014 12:18:08 PM MST
shard2: {
        
http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/
 = 62984 LEADER
        
http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/ 
= 62980 diff:4
}
Details:
shard2
>> finished querying leader, found 62984 documents (62984)
>> finished querying 
>> http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/,
>>  found 62980 documents
Doc [182866] not found in replica: <doc boost="1.0"><field 
name="id">182866</field><field name="string_s">test</field><field 
name="int_i">-1257345242</field><field name="float_f">0.92657363</field><field 
name="double_d">0.5259114828332452</field><field name="text_en">this is a 
test</field><field name="_version_">1457415570117885953</field></doc>
Doc [182859] not found in replica: <doc boost="1.0"><field 
name="id">182859</field><field name="string_s">test</field><field 
name="int_i">991366909</field><field name="float_f">0.5311716</field><field 
name="double_d">0.10846350752086309</field><field name="text_en">this is a 
test</field><field name="_version_">1457415570117885952</field></doc>
Doc [182872] not found in replica: <doc boost="1.0"><field 
name="id">182872</field><field name="string_s">test</field><field 
name="int_i">824512897</field><field name="float_f">0.830366</field><field 
name="double_d">0.6560223698806142</field><field name="text_en">this is a 
test</field><field name="_version_">1457415570117885954</field></doc>
Doc [182876] not found in replica: <doc boost="1.0"><field 
name="id">182876</field><field name="string_s">test</field><field 
name="int_i">-1657831473</field><field name="float_f">0.4877965</field><field 
name="double_d">0.9214420679315872</field><field name="text_en">this is a 
test</field><field name="_version_">1457415570117885955</field></doc>
Sending hard commit after mis-match and then will wait for user to handle it ...
<<<<<<

So four missing docs: 182866, 182859, 182872, 182876

Now I'm thinking this might be in the ConcurrentUpdateSolrServer logic. I added 
some detailed logging to show when JavabinLoader unmarshals a doc and when it 
is offered on the CUSS queue (to be sent to the replica). On the leader, here's 
the log around some messages that were lost:

2014-01-16 14:16:37,534 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182857
2014-01-16 14:16:37,534 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182857
/////////////////////////////////////
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182859
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182859
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182866
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182866
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182872
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182872
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182876
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182876
2014-01-16 14:16:37,558 [qtp417447538-17] INFO  
update.processor.LogUpdateProcessor  - [test_shard2_replica2] webapp=/solr 
path=/update params={wt=javabin&version=2} {add=[182704 (1457415570048679936), 
182710 (1457415570049728512), 182711 (1457415570049728513), 182717 
(1457415570056019968), 182720 (1457415570056019969), 182722 
(1457415570057068544), 182723 (1457415570057068545), 182724 
(1457415570058117120), 182730 (1457415570058117121), 182735 
(1457415570059165696), ... (61 adds)]} 0 72
/////////////////////////////////////
2014-01-16 14:16:37,764 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182880
2014-01-16 14:16:37,764 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182880
 

As you can see, the leader received doc with ID:182859 at 2014-01-16 
14:16:37,552 and the queued it on the CUSS queue to be sent to the replica. On 
the replica, the log shows it receiving 182857 and then 182880 ... the 4 
missing docs (182866, 182859, 182872, 182876) were definitely queued in CUSS on 
the leader. I've checked the logs on all the other replicas and the docs didn't 
go there either.


2014-01-16 14:16:37,292 [qtp417447538-14] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica1 add: 182857
2014-01-16 14:16:37,293 [qtp417447538-14] INFO  
update.processor.LogUpdateProcessor  - [test_shard2_replica1] webapp=/solr 
path=/update 
params={distrib.from=http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[182841 (1457415570096914432), 182842 (1457415570096914433), 182843 
(1457415570096914434), 182844 (1457415570096914435), 182846 
(1457415570097963008), 182848 (1457415570097963009), 182850 
(1457415570099011584), 182854 (1457415570099011585), 182857 
(1457415570099011586)]} 0 2
2014-01-16 14:16:37,521 [qtp417447538-14] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica1 add: 182880


So it seems like a "batch" of docs queued on the leader just got missed ...


> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>
>         Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

Reply via email to