I may have got this wrong, but I think it might be better to shard randomly, not on a value from one of your source documents, as otherwise certain searches will only hit some of the shards and possibly overload them.  This might also be the cause of the behaviour below.

Charlie

On 30/11/2023 04:36, Saksham Gupta wrote:
Hi All,
Pinging again for some assistance.

On Wed, Nov 29, 2023 at 7:11 PM Saksham Gupta <saksham.gu...@indiamart.com>
wrote:

Hi Solr Developers,

Problem Statement

We have been using solr cloud with implicit sharding. The data of the
collection was divided into 8 shards. In order to reduce the response time,
we thought of sharding the data further.

Therefore we planned on sharding the solr data into 56 shards to reduce
response time. According to this sharding strategy, one of the values of a
multivalued field is being used to decide the shard of the document.

But this has led to loss of documents.

How is the loss Happening? Explaining the problem with an example:

Consider 3 solr Documents:

Doc1

{

FieldA: id21, id29, id60P;

Field2: val2;

}

Doc2

{

FieldA: id19, id9, id8P;

Field2: val1;

}

Doc1

{

FieldA: id101, id29, id108P;

Field2: val4;

}

While Querying on Solr:

Let’s consider the Query---  fq=FieldA: id21+id8+id108;

According to previous sharding, Doc1, Doc2, & Doc3 will be returned in
the results as the filter query matches with at least one values present in
each document i.e. id21 in Doc1, id8 in Doc2 and id108 in Doc3.


According to the new sharding, only Doc2 and Doc3 will be returned and
Doc1 will not be included in results because the query will be routed
only to the shards corresponding to values present in filter query i.e.
shard21,shard8,shard108 and Doc1 is present on shard60.

INDEXING


QUERYING ON THIS COLLECTION

And our query won’t even go to the shard that contains document1.
Therefore, document1 will not be returned in the results.

Probable Solutions

To deal with this, we can index the same document on multiple shards based
on all the values of the field. But handling indexing/deletion if the
values of this field is changed would be very complicated. So, this index
can be very complex to maintain.

Is this the most optimal way or is there a better way to achieve the goal
and avoid losing any documents?


--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
Founding member of The Search Network and co-author of Searching the Enterprise
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

Reply via email to