Hi All, I verified hypotheses shared by Deepak in this mail thread for a few cases where the indexed date was different and where it was the same, and the hypothesis turned out accurate for all of the cases.
*Summarising the Hypothesis:* Solr handles duplicate documents [documents present in multiple shards], by preferring the document which is oldest according to indexed date, and if indexed date is same, then it compares version and document with higher version is displayed. Although, recently, I've found a document which is not following the above hypothesis, the indexed date for the document[present on 2 shards] on both the shards is the same, although the document with lower _version_ is being ranked [contrary to above hypothesis]. To check if the version visible is correct or not, I filtered the respective copy based on version: 1. [query: fq=id:{document-copy1-id} AND _version_:{document-copy1-id}], 2. [query: fq=id:{document-copy2-id} AND _version_:{document-copy2-id}]; and found that one document is not being displayed if we add fq on version. *How does solr set the _version_ field? Is there a possibility that the version displayed is incorrect? Does solr maintain a different version internally which can differ from one visible?* *Is this the reason why the above hypothesis is failing?* Would appreciate any help regarding solr duplicity handling/ and my aforementioned doubts! On Thu, Aug 1, 2024 at 4:38 PM Saksham Gupta <saksham.gu...@indiamart.com> wrote: > Hi Deepak, > > Thanks for digging out such a detailed answer for my query. I did observe > that the documents indexed earlier were the ones being displayed, but could > not find any relevant documentation supporting this. > > Although, I could not understand the nuances pointed out in point 4, What > do we mean by `If a commit happens between the first and > second phase of the distributed search`, what is first and second > phase here, and what issue will it cause? > > On Wed, Jul 31, 2024 at 12:24 PM Deepak Goel <deic...@gmail.com> wrote: > >> *Answer from Copilot:* >> >> >> Ah, the intricate dance of Solr shards and their cosmic collisions! Let’s >> unravel this like a digital detective, shall we? 🕵️♂️ >> >> When it comes to Solr and its distributed architecture, handling duplicate >> documents across shards can be as tricky as juggling flaming torches while >> riding a unicycle. But fear not—I’ve got some insights for you: >> >> 1. >> >> *Duplicate Documents and Shards:* >> - Imagine our document—a digital doppelgänger—migrating from one shard >> to another. It’s like a restless soul seeking a new home. >> - During this transition, both shards might harbor copies of the >> same >> document. They’re like twins separated at birth, each vying for the >> spotlight. >> 2. >> >> *The Solr Query Showdown:* >> - Now, let’s stage a Solr query duel. Our query gallops across the >> shards, demanding answers. >> - If our document is the top-ranked contender in both shards, who >> emerges victorious? 🏆 >> 3. >> >> *The Winner Takes It All (Sort of):* >> - Solr, being the wise oracle it is, follows a simple rule: *“First >> come, first served.”* >> - When Solr discovers duplicate document IDs during distributed >> searching, it selects the *first document* it encounters and >> discards >> subsequent ones. It’s like a cosmic game of “finders keepers.” >> - So, whichever shard’s copy of the document was indexed first—the >> early bird with the freshest ink—takes the spotlight. The other >> copy bows >> out gracefully. >> 4. >> >> *The Momentary Sync Shimmy:* >> - But wait! There’s a twist. If a commit happens between the first and >> second phase of the distributed search, the index might shimmy >> out of sync >> for a moment. >> - Picture this: Shard A says, “I’ve got the document!” Shard B says, >> “No, I’ve got it!” And Solr, in its infinite wisdom, says, “Hold >> my query, >> folks—I need to sync up.” >> - Eventually, harmony is restored, and the universe aligns itself. >> But for that brief moment, Solr juggles realities like a cosmic >> circus >> performer. >> 5. >> >> *The Shard Key Sorcery:* >> - Remember the shard key? It’s like Solr’s secret handshake. You can >> use >> it to influence how documents are distributed across shards. >> - For example, if you want to spread documents related to a specific >> customer (let’s say “IBM”) across multiple shards, you can use a >> syntax >> like this: "shard_key/num!document_id". The /num part determines how >> many bits from the shard key contribute to the composite hash >> < >> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html >> > >> 1 >> < >> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html >> > >> . >> 6. >> >> *Balance and Scalability:* >> - To prevent hotspots, distribute documents evenly across shards. >> Balance is key! >> - Choose shard keys that reflect your data’s access patterns. Think >> of them as Solr’s cosmic compass. >> - And maintain flexibility—consider using composite IDs for easier >> scalability. It’s like Solr’s way of saying, “Why settle for one >> shard when >> you can have a whole constellation?” >> >> So, in the grand Solr arena, the early bird document wins the query race. >> But remember, even in the digital cosmos, duplicates play by the >> rules—mostly. >> >> >> Deepak >> "The greatness of a nation can be judged by the way its animals are >> treated >> - Mahatma Gandhi" >> >> +91 73500 12833 >> deic...@gmail.com >> >> LinkedIn: www.linkedin.com/in/deicool >> >> "Plant a Tree, Go Green" >> >> Make In India : http://www.makeinindia.com/home >> >> >> On Mon, Jul 29, 2024 at 10:11 PM Saksham Gupta >> <saksham.gu...@indiamart.com.invalid> wrote: >> >> > Hi Solr Developers, >> > >> > Which solr document will be displayed if a duplicate instance of the >> same >> > document is present? >> > >> > In our current solr architecture, there is a possibility that a document >> > can move from one solr shard to another shard. While the document will >> > eventually be deleted from its old shard, there will be some duration >> where >> > multiple instances of this document will be present. >> > >> > Now, if a solr query executes on both these shards and this document is >> the >> > top ranked document from both the shards, which document will be >> returned >> > in solr result? >> > >> >