Here is what I would do, take it with a grain of salt but it works solid

Have a single master solr node that takes all the data as the indexer, have the 
“replicas” used to be called “slaves” but it’s not pc any more to call them 
that.   The one you use for users and your reports use one of these.   This 
will keep that one server hot as in the index will be in memory the more it’s 
used.   Put all of them behind a proxy like nginx so you can control what 
server is hot and fail down to the others when needed.   Solr cloud is good in 
theory but won’t be as fast or reliable.   This is based on my own experience 
and I’m sure people will say otherwise but standalone Solr is super fast and 
solid with enough metal to back it up.  Enough memory and ssd hard drive to 
hold your index and Solr cloud will never be able to beat it.   

Again, if I were to do it and use the old way of naming conventions:

One solr master indexer-> one solr live master/slave that replicates as needed 
-> your ten or so slaves (10 is not needed, stick to three and see) replicate 
every 5 or so minutes

User->nginx proxy->slaves in order, but only use one, no round robin just one 
and failover to the next on failure.   
Reports go to a different slave but it’s ok since it’s the same data as what 
the users see, but won’t compete for resources.   

Optimize each of these three things to do what they are supposed to.   An 
indexing server is different than a search server in that way.  

Just my thoughts and experience with a few terabytes of an index.  Also be 
certain the machines have three times as much space ready as a full index, and 
keep your heap below 32gb on everything.   Servers, ssd and memory are cheap, 
master slave replication is the most reliable.  




> On Aug 4, 2025, at 11:53 AM, Marcus R. Matos <mar...@matosconsulting.com> 
> wrote:
> 
> Hi all,
> 
> I recently inherited a team/app that has been running on a single instance
> of SOLR for many years. An attempt was made to migrate to a 10 node cluster
> configuration and we immediately encountered some issues which appear to be
> related to the fact that data is being read from nodes where data
> replication had not yet completed. The highlights:
> 
> 
>   - 10 node cluster with 5 instances per DC with a mix of NRT and TLOG
>   - Data is sourced from another system in large batches throughout the
>   day (another system triggers our system on an adhoc basis, which then
>   refreshes data from the upstream system).
>      - These updates take from minutes to up to 2 hours
>      - We have an autoCommit of every 1 min and autoSoftCommit every 1 sec
>   - We also have numerous background processes which kick off on a
>   schedule (some every 15 mins, some hourly, some daily) which execute
>   queries and perform a variety of actions based on the current state of the
>   data
>      - e.g. New records = send an email notifying users of some things
>      they need to do
>      - e.g. Removed records = send an email notifying users of some updates
>      - (Significantly more complex than this.)
>      - Background jobs are NOT aware of whether or not a refresh (first
>      bullet) is currently underway
>   - Based on our investigation, we *think* our application is getting
>   incomplete results when executing queries during / shortly after data
>   refreshes, and making incorrect decisions (e.g. notifying users that some
>   records were removed when they actually weren't, followed by a future
>   notification that the records are back)
> 
> 
> Would appreciate any advice or things to consider based on the above.
> 
> Thank you!

Reply via email to