Hi Solr Developers, Problem Statement
We have been using solr cloud with implicit sharding. The data of the collection was divided into 8 shards. In order to reduce the response time, we thought of sharding the data further. Therefore we planned on sharding the solr data into 56 shards to reduce response time. According to this sharding strategy, one of the values of a multivalued field is being used to decide the shard of the document. But this has led to loss of documents. How is the loss Happening? Explaining the problem with an example: Consider 3 solr Documents: Doc1 { FieldA: id21, id29, id60P; Field2: val2; } Doc2 { FieldA: id19, id9, id8P; Field2: val1; } Doc1 { FieldA: id101, id29, id108P; Field2: val4; } While Querying on Solr: Let’s consider the Query--- fq=FieldA: id21+id8+id108; According to previous sharding, Doc1, Doc2, & Doc3 will be returned in the results as the filter query matches with at least one values present in each document i.e. id21 in Doc1, id8 in Doc2 and id108 in Doc3. According to the new sharding, only Doc2 and Doc3 will be returned and Doc1 will not be included in results because the query will be routed only to the shards corresponding to values present in filter query i.e. shard21,shard8,shard108 and Doc1 is present on shard60. INDEXING QUERYING ON THIS COLLECTION And our query won’t even go to the shard that contains document1. Therefore, document1 will not be returned in the results. Probable Solutions To deal with this, we can index the same document on multiple shards based on all the values of the field. But handling indexing/deletion if the values of this field is changed would be very complicated. So, this index can be very complex to maintain. Is this the most optimal way or is there a better way to achieve the goal and avoid losing any documents?