Hi Solr Developers,

Problem Statement

We have been using solr cloud with implicit sharding. The data of the
collection was divided into 8 shards. In order to reduce the response time,
we thought of sharding the data further.

Therefore we planned on sharding the solr data into 56 shards to reduce
response time. According to this sharding strategy, one of the values of a
multivalued field is being used to decide the shard of the document.

But this has led to loss of documents.

How is the loss Happening? Explaining the problem with an example:

Consider 3 solr Documents:

Doc1

{

FieldA: id21, id29, id60P;

Field2: val2;

}

Doc2

{

FieldA: id19, id9, id8P;

Field2: val1;

}

Doc1

{

FieldA: id101, id29, id108P;

Field2: val4;

}

While Querying on Solr:

Let’s consider the Query---  fq=FieldA: id21+id8+id108;

According to previous sharding, Doc1, Doc2, & Doc3 will be returned in the
results as the filter query matches with at least one values present in
each document i.e. id21 in Doc1, id8 in Doc2 and id108 in Doc3.


According to the new sharding, only Doc2 and Doc3 will be returned and Doc1
will not be included in results because the query will be routed only to
the shards corresponding to values present in filter query i.e.
shard21,shard8,shard108 and Doc1 is present on shard60.

INDEXING


QUERYING ON THIS COLLECTION

And our query won’t even go to the shard that contains document1.
Therefore, document1 will not be returned in the results.

Probable Solutions

To deal with this, we can index the same document on multiple shards based
on all the values of the field. But handling indexing/deletion if the
values of this field is changed would be very complicated. So, this index
can be very complex to maintain.

Is this the most optimal way or is there a better way to achieve the goal
and avoid losing any documents?

Reply via email to