Thanks for these directions, Ian. We are running Lucene 2.9.1 on CentOs 5 64-bit machines. We do use compound file format, and will look into replacing it with the simple files, although I believe this will create too many files. We will also consider the rsync option. Thanks again, -- Yuval
-----Original Message----- From: Ian Lea [mailto:ian....@gmail.com] Sent: Tuesday, February 09, 2010 6:13 PM To: java-user@lucene.apache.org Subject: Re: Different replicas return different scores Since the update commands may run in different order on different shards you might get different sets of segments because merges happen to be triggered at different points in the different batches of updates. But you shouldn't have different numbers of deleted docs if you have really been applying the same updates to all the shards. Could some updates have been missed? Or docs added then deleted or something? Maybe there are other variations between the shards and that is causing the variation in query scores. As an alternative approach you could have one master index per shard that takes all the updates and then send that index out to the shard servers. If you don't use compound file format, and don't optimize, the file changes are typically quite small with default or sensible merge settings and can be distributed quickly using rsync. You can have more control by using MergePolicy and friends. What version of lucene are you running? -- Ian. On Tue, Feb 9, 2010 at 2:26 PM, Yuval Feinstein <yuv...@answers.com> wrote: > We are running a large sharded Lucene-based application. > Our configuration supports near real-time updates, by incrementally > Updating documents (using delete then add) on the shards. > Every shard is replicated to several machines in order to improve performance. > We replicate the shard by sending the same deletion and addition commands to > all the replicas, > Where they may be performed in a different order. (We delete a set of > documents, say 1000 at a time, > Then add them one-by-one semi-asynchronously). > Lately we have noticed a subtle difference in query scores across different > replicas of the same shard. > Further investigation showed that the only noticeable difference between the > replicas was the index directory structure: > 1. Different replicas have different sets of segments - most segment > files are the same, but some are different. > 2. The numbers of deleted documents are different between two replicas > of the same shard. > Is this a known behavior of Java Lucene? > How can we change this behavior? We want different replicas returning the > exact same score per query hits. > (We would rather not optimize the index as we believe this will harm > performance.) > > TIA, > Yuval and Ophir > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org