We are running clusters with 100s of nodes in a single Solr Cloud with many collections, some of which are up to 2,048 shards. Generally the challenges that we have seen have been specifically related to handling updates to state.json particularly on restarts of nodes in the cluster since a node reboot results in a state change of the replica in state.json. For a large number of shards, this can result in frequent reading/writing of a large state.json file which can impact how quickly nodes are able to restart. Also, the state.json itself can exceed the default limit for a file size in Zookeeper of 1MB. This default limit can be configured in Zookeeper, but something to be aware of. I'm working on upstreaming a PR to allow compression of state.json in ZK which would help a bit here: https://github.com/apache/solr/pull/1267.
Also, at FullStory working with Ishan/Noble, we developed an alternative means of managing replica up/down state called Per Replica State which you can read more about here: https://searchscale.com/blog/prs/. This has been effective for us when operating at large scale since it limits how much we have to read/write state.json. On Tue, Jan 10, 2023 at 12:21 PM Wei <weiwan...@gmail.com> wrote: > Thanks Walter. We have 1-to-1 mapping, each physical server hosts a single > solr core so that CPU resource per node is sufficient. Because of the > constant index size growth and each shard can only hold X million > documents, and we don't want shards to reach their maximum capacity, > therefore the number of shards can grow into several hundreds. My question > is from solr cloud/Zookeeper perspective, is there a hard limit or > performance impact when number of shards(and correspondingly number of > nodes) reachs a certain threshold? Another option I see is to split the > data into multiple separate small solr clouds, but then we have to handle > the aggregation of results from different clouds outside of solr, which has > challenges like how to compare respective solr scores and handle sorting > etc. > > Thanks, > Wei > > On Mon, Jan 9, 2023 at 7:39 PM Walter Underwood <wun...@wunderwood.org> > wrote: > > > > On Jan 9, 2023, at 1:29 PM, Wei <weiwan...@gmail.com> wrote: > > > > > > Is there a practical limit on the number of shards and nodes in a solr > > > cloud? We need to scale up the solr cloud and wonder if there is > concern > > > when increasing to a couple of hundred shards and several thousand > nodes > > in > > > a single cloud. Any suggestions? > > > > It was challenging to manage with 8 shards and a replication factor of 8. > > At that point, we scaled vertically to bigger AWS instances. It scaled > > smoothly up to 72 CPU instances. > > > > wunder > > Walter Underwood > > wun...@wunderwood.org > > http://observer.wunderwood.org/ (my blog) > > > > > > >