On 5/18/22 08:42, Hasmik Sarkezians wrote:
Have a question about shard splitting and compositeId usage. We are
starting a solr collection with X number of shards for our multi-tenant
application. We are assuming that the number of shards will increase over
time as the number of customers grows as well as the customer data.
We are thinking of using the <customerId>/num!docId format to specify
multiple shards for my tenants depending on the number of records that we
will index. We will start with 4 shards and then my assumption is that we
use the shard split to add more shards to the collection.
customer size X = 1 shard and as such the compositeId would be
customer1!docId
customer size 5*X = 2 shards and as such the compositeId would be
customer2/1!docId
And now if I split the shards and the number of shards becomes 5, 6, 7, 8
what happens to the data? The point is I don't want the customer2 endup in
4 shards when we get to have 8 shards. If someone can shed some light here
I would appreciate it.
I wonder if you have a good understanding of how a compositeId works.
The prefix does not directly dictate what shard a document will end up
in. It determines how many bits of the full 32-bit ID hash will be
computed from the prefix and how many from the rest of the ID.
https://solr.apache.org/guide/8_11/shards-and-indexing-data-in-solrcloud.html#document-routing
Something not stated there is how many bits are used if the number is
not specified. Looking at the code, the default appears to be 16 if the
number of parts in the ID is 2, and 8 if the number of parts in the ID
is 3. I don't think it supports more than 3 parts.
When you split a shard, the hash range for the shard will be split, and
the range for the new shards will be smaller than any other shards that
were not split. So it may not be completely predictable which shards a
composite ID will be stored in when you split them. If you split ALL
shards in half, then a prefix that limited the number of shards to 2
could result in those documents being split across 4 shards, but
depending on how many documents there are with that prefix and EXACTLY
how the hashes end up being divided, it could be as low as 2 shards and
as high as 4. If there are a lot of documents with that prefix, chances
are that it would be 4 shards.
If you want explicit control over which shard a document ends up in, you
cannot use compositeId. You'll have to use the implicit router and
designate a field where the name of the shard will go. I don't think
splitting shards is possible with the implicit router.
Thanks,
Shawn