Hi All,

*Understanding of Duplicity Handling by Solr*

As per an older discussion on solr community [ref mail: *Ranking of
duplicate documents on solr*], solr handles duplicate documents [documents
present in multiple shards], by preferring the document which is oldest
according to indexed date, and if indexed date is same, then it compares
*version* and document with higher *version* is displayed.

We verified the aforementioned hypothesis for a few cases where the indexed
date was different and where it was the same, and the hypothesis turned out
accurate for all of the cases.


*Issue Details*
Recently, I've found a document which is not following the above
hypothesis, the indexed date for the document[present on 2 shards] on both
the shards is the same, although the document with lower *version* is being
ranked [contrary to above hypothesis]. To check if *version* visible is
correct or not, I filtered the respective copy based on *version:*
1. [query: *fq=id:{document-copy1-id} AND version:{document-copy1-id}*],
2. [query: *fq=id:{document-copy2-id} AND version:{document-copy2-id}*],
and found that one document is not being displayed if we add fq on *version*
.


*How does solr set the _version_ field? Is there a possibility version
displayed is incorrect? Does solr maintain a different version internally
which can differ from one visible?Is this the reason why the above
hypothesis is failing?*

Would appreciate any help regarding solr duplicity handling/ and my
aforementioned doubts!

Reply via email to