Re: Sharding strategy advise

Shawn Heisey Fri, 16 Dec 2022 00:43:19 -0800

On 12/15/22 12:43, Dominique Bejean wrote:

I have a sharded collection distributed over several solr nodes. Each solr
node hosts one shard and one replica of another shard. shards are huge (100
millions documents). Queries are using several filterQuery. filterCache for
this number of documents can use high amount of heap memory.

Terminology nit: Each node is hosting two replicas of different shards.It's not "a shard and a replica" ... it's "two replicas of shard N".One replica becomes leader, but that is a mutable temporary distinction.Any NRT or TLOG replica can become leader.

Is it a good idea to split shards by 2 or 4 in order to have shards with 50
or 25 millions documents ?
With a split by 4, a Solr node will host 8 replicas instead of 2, but with
smaller filterCache for each replica.

The total amount of heap memory required for the filterCache will not bereduced by this. And there are other per-core memory structures thatwould need more total heap memory, not less.

If you have a LOT of CPU capacity and a fairly low query rate, moreshards per Solr instance can yield a net increase in query performance.But if the query rate is high or the CPU core count is low, you'reprobably better off with fewer shards.

I don't expect to have better search performances, but I expect to have
faster warming and mainly less impacted heap memory by open searcher during
sofcommit.  For instance, instead of having one large filterCache warming
up once each minute, 4 smaller filterCaches will warm up not at the same
time (hopefully).

Multiple smaller caches at the same time might warm a bit faster ifsystem capacities are sufficient. But you seem to be under theimpression that running eight smaller indexes instead of two largerindexes will drop your heap requirement. It won't. It would actuallyincrease it, and due to efficiencies in the way Lucene builds eachindex, it would also increase your disk space requirements. I don'thave any way of knowing how much these things would increase.

Multiple threads/processes is the best way to increase Solr indexingperformance. Memory is the best way to increase general Solrperformance -- lots of extra memory for disk caching, especially withreally big indexes. Which might mean greatly increasing the servercount and also increasing the shard count, so each server's indexes aresmaller and fit into the disk cache better.

At least you haven't mentioned running multiple Solr instances permachine. That only makes sense for REALLY large installs. I'm sorry totell you that while yours is not small, we've heard from people who havebillions of documents and terabytes of index data when counting only onereplica of each shard. For now Solr sees better memory efficiency bykeeping the heap below 32GB, so for really large systems, two Solrinstances each with a 31GB heap actually has MORE memory available forSolr than one instance with a 64GB heap.


Thanks,
Shawn

Re: Sharding strategy advise

Reply via email to