On 12/15/22 12:43, Dominique Bejean wrote:
I have a sharded collection distributed over several solr nodes. Each solr
node hosts one shard and one replica of another shard. shards are huge (100
millions documents). Queries are using several filterQuery. filterCache for
this number of documents can use high amount of heap memory.

Terminology nit: Each node is hosting two replicas of different shards. It's not "a shard and a replica" ... it's "two replicas of shard N". One replica becomes leader, but that is a mutable temporary distinction. Any NRT or TLOG replica can become leader.

Is it a good idea to split shards by 2 or 4 in order to have shards with 50
or 25 millions documents ?
With a split by 4, a Solr node will host 8 replicas instead of 2, but with
smaller filterCache for each replica.

The total amount of heap memory required for the filterCache will not be reduced by this. And there are other per-core memory structures that would need more total heap memory, not less.

If you have a LOT of CPU capacity and a fairly low query rate, more shards per Solr instance can yield a net increase in query performance. But if the query rate is high or the CPU core count is low, you're probably better off with fewer shards.

I don't expect to have better search performances, but I expect to have
faster warming and mainly less impacted heap memory by open searcher during
sofcommit.  For instance, instead of having one large filterCache warming
up once each minute, 4 smaller filterCaches will warm up not at the same
time (hopefully).

Multiple smaller caches at the same time might warm a bit faster if system capacities are sufficient. But you seem to be under the impression that running eight smaller indexes instead of two larger indexes will drop your heap requirement. It won't. It would actually increase it, and due to efficiencies in the way Lucene builds each index, it would also increase your disk space requirements. I don't have any way of knowing how much these things would increase.

Multiple threads/processes is the best way to increase Solr indexing performance. Memory is the best way to increase general Solr performance -- lots of extra memory for disk caching, especially with really big indexes. Which might mean greatly increasing the server count and also increasing the shard count, so each server's indexes are smaller and fit into the disk cache better.

At least you haven't mentioned running multiple Solr instances per machine. That only makes sense for REALLY large installs. I'm sorry to tell you that while yours is not small, we've heard from people who have billions of documents and terabytes of index data when counting only one replica of each shard. For now Solr sees better memory efficiency by keeping the heap below 32GB, so for really large systems, two Solr instances each with a 31GB heap actually has MORE memory available for Solr than one instance with a 64GB heap.

Thanks,
Shawn

Reply via email to