On 12/15/22 12:43, Dominique Bejean wrote:
I have a sharded collection distributed over several solr nodes. Each solr node hosts one shard and one replica of another shard. shards are huge (100 millions documents). Queries are using several filterQuery. filterCache for this number of documents can use high amount of heap memory.
Terminology nit: Each node is hosting two replicas of different shards. It's not "a shard and a replica" ... it's "two replicas of shard N". One replica becomes leader, but that is a mutable temporary distinction. Any NRT or TLOG replica can become leader.
Is it a good idea to split shards by 2 or 4 in order to have shards with 50 or 25 millions documents ? With a split by 4, a Solr node will host 8 replicas instead of 2, but with smaller filterCache for each replica.
The total amount of heap memory required for the filterCache will not be reduced by this. And there are other per-core memory structures that would need more total heap memory, not less.
If you have a LOT of CPU capacity and a fairly low query rate, more shards per Solr instance can yield a net increase in query performance. But if the query rate is high or the CPU core count is low, you're probably better off with fewer shards.
I don't expect to have better search performances, but I expect to have faster warming and mainly less impacted heap memory by open searcher during sofcommit. For instance, instead of having one large filterCache warming up once each minute, 4 smaller filterCaches will warm up not at the same time (hopefully).
Multiple smaller caches at the same time might warm a bit faster if system capacities are sufficient. But you seem to be under the impression that running eight smaller indexes instead of two larger indexes will drop your heap requirement. It won't. It would actually increase it, and due to efficiencies in the way Lucene builds each index, it would also increase your disk space requirements. I don't have any way of knowing how much these things would increase.
Multiple threads/processes is the best way to increase Solr indexing performance. Memory is the best way to increase general Solr performance -- lots of extra memory for disk caching, especially with really big indexes. Which might mean greatly increasing the server count and also increasing the shard count, so each server's indexes are smaller and fit into the disk cache better.
At least you haven't mentioned running multiple Solr instances per machine. That only makes sense for REALLY large installs. I'm sorry to tell you that while yours is not small, we've heard from people who have billions of documents and terabytes of index data when counting only one replica of each shard. For now Solr sees better memory efficiency by keeping the heap below 32GB, so for really large systems, two Solr instances each with a 31GB heap actually has MORE memory available for Solr than one instance with a 64GB heap.
Thanks, Shawn
