The likely cause of the issue is that replicas are not guaranteed to finish
commits simultaneously. Solr is eventually consistent
<https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud>.
If you make 3 fast requests, you can hit [Replica A, Replica B, Replica A]
where B is ahead of replica A due to differing commit completion times.
That final request to A (which still hasn't committed) will make it look
like a document disappeared.

One thing you can try is to ensure records have an indexDate field
identifying exactly when they were indexed, and then filter requests to
only look at index dates more than a commit interval in the past.

<field name="indexedDate" type="pdate" indexed="true" stored="true"
default="NOW" /> <!-- Don't ever send a value for this field!, let
solr fill it in. -->

if add that field and filter on

indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:*  that will catch
previously indexed data (the negated term) and all data older than 2 min
(the range). That set of data should have attained consistency unless your
system is struggling under load and a replica simply can't keep up (in
which case you are about to have bigger problems).

As a side note, your commit intervals are very aggressive, but I'm guessing
that's an attempt to get around the problems you are seeing? The filter of
course has to change if you relax your commit intervals substantially.

On Mon, Aug 4, 2025 at 12:33 PM Dave <hastings.recurs...@gmail.com> wrote:

> Here is what I would do, take it with a grain of salt but it works solid
>
> Have a single master solr node that takes all the data as the indexer,
> have the “replicas” used to be called “slaves” but it’s not pc any more to
> call them that.   The one you use for users and your reports use one of
> these.   This will keep that one server hot as in the index will be in
> memory the more it’s used.   Put all of them behind a proxy like nginx so
> you can control what server is hot and fail down to the others when
> needed.   Solr cloud is good in theory but won’t be as fast or reliable.
>  This is based on my own experience and I’m sure people will say otherwise
> but standalone Solr is super fast and solid with enough metal to back it
> up.  Enough memory and ssd hard drive to hold your index and Solr cloud
> will never be able to beat it.
>
> Again, if I were to do it and use the old way of naming conventions:
>
> One solr master indexer-> one solr live master/slave that replicates as
> needed -> your ten or so slaves (10 is not needed, stick to three and see)
> replicate every 5 or so minutes
>
> User->nginx proxy->slaves in order, but only use one, no round robin just
> one and failover to the next on failure.
> Reports go to a different slave but it’s ok since it’s the same data as
> what the users see, but won’t compete for resources.
>
> Optimize each of these three things to do what they are supposed to.   An
> indexing server is different than a search server in that way.
>
> Just my thoughts and experience with a few terabytes of an index.  Also be
> certain the machines have three times as much space ready as a full index,
> and keep your heap below 32gb on everything.   Servers, ssd and memory are
> cheap, master slave replication is the most reliable.
>
>
>
>
> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos <mar...@matosconsulting.com>
> wrote:
> >
> > Hi all,
> >
> > I recently inherited a team/app that has been running on a single
> instance
> > of SOLR for many years. An attempt was made to migrate to a 10 node
> cluster
> > configuration and we immediately encountered some issues which appear to
> be
> > related to the fact that data is being read from nodes where data
> > replication had not yet completed. The highlights:
> >
> >
> >   - 10 node cluster with 5 instances per DC with a mix of NRT and TLOG
> >   - Data is sourced from another system in large batches throughout the
> >   day (another system triggers our system on an adhoc basis, which then
> >   refreshes data from the upstream system).
> >      - These updates take from minutes to up to 2 hours
> >      - We have an autoCommit of every 1 min and autoSoftCommit every 1
> sec
> >   - We also have numerous background processes which kick off on a
> >   schedule (some every 15 mins, some hourly, some daily) which execute
> >   queries and perform a variety of actions based on the current state of
> the
> >   data
> >      - e.g. New records = send an email notifying users of some things
> >      they need to do
> >      - e.g. Removed records = send an email notifying users of some
> updates
> >      - (Significantly more complex than this.)
> >      - Background jobs are NOT aware of whether or not a refresh (first
> >      bullet) is currently underway
> >   - Based on our investigation, we *think* our application is getting
> >   incomplete results when executing queries during / shortly after data
> >   refreshes, and making incorrect decisions (e.g. notifying users that
> some
> >   records were removed when they actually weren't, followed by a future
> >   notification that the records are back)
> >
> >
> > Would appreciate any advice or things to consider based on the above.
> >
> > Thank you!
>


-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Reply via email to