Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Marcus R. Matos Tue, 05 Aug 2025 11:53:21 -0700

This has been very helpful and we are making a number of changes to help
resolve the issues.


Another question: our engineer has requested that we include distrib=true
in our queries. My understanding is that this is implicit in our calls to a
SOLR Cloud cluster. Any thoughts on how this works in practice?

Thank you,

--
Marcus R. Matos
Solution Architect || Software Development Leader || CISSP, CSSLP, ITIL v3
tel: 469/892-8962 || e-mail: [email protected]

Sent via mobile; please excuse brevity and teepees.

On Mon, Aug 4, 2025, 4:02 PM Gus Heck <[email protected]> wrote:

> One last note, combining NRT and TLOG might also be contributing to the
> variation in how fast data appears, Probably best to stick to the
> recommended combinations...
>
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#combining-replica-types-in-a-cluster
>
>
> On Mon, Aug 4, 2025 at 4:54 PM Gus Heck <[email protected]> wrote:
>
> > Actually it occurs to me (just after hitting send of course) that using a
> > field for that might be still problematic, I think it could still vary
> > slightly, since I think the field value might not get created until the
> > sub-request gets to the replica, and might be subject to local clock
> > issues...  probably safer to add a
> >
> https://solr.apache.org/docs/9_9_0/core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html
> > such that it is handled on the first receiving node instead.
> >
> > On Mon, Aug 4, 2025 at 4:47 PM Gus Heck <[email protected]> wrote:
> >
> >> The likely cause of the issue is that replicas are not guaranteed to
> >> finish commits simultaneously. Solr is eventually consistent
> >> <
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud
> >.
> >> If you make 3 fast requests, you can hit [Replica A, Replica B, Replica
> A]
> >> where B is ahead of replica A due to differing commit completion times.
> >> That final request to A (which still hasn't committed) will make it look
> >> like a document disappeared.
> >>
> >> One thing you can try is to ensure records have an indexDate field
> >> identifying exactly when they were indexed, and then filter requests to
> >> only look at index dates more than a commit interval in the past.
> >>
> >> <field name="indexedDate" type="pdate" indexed="true" stored="true"
> default="NOW" /> <!-- Don't ever send a value for this field!, let solr
> fill it in. -->
> >>
> >> if add that field and filter on
> >>
> >> indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:*  that will catch
> >> previously indexed data (the negated term) and all data older than 2 min
> >> (the range). That set of data should have attained consistency unless
> your
> >> system is struggling under load and a replica simply can't keep up (in
> >> which case you are about to have bigger problems).
> >>
> >> As a side note, your commit intervals are very aggressive, but I'm
> >> guessing that's an attempt to get around the problems you are seeing?
> The
> >> filter of course has to change if you relax your commit intervals
> >> substantially.
> >>
> >> On Mon, Aug 4, 2025 at 12:33 PM Dave <[email protected]>
> >> wrote:
> >>
> >>> Here is what I would do, take it with a grain of salt but it works
> solid
> >>>
> >>> Have a single master solr node that takes all the data as the indexer,
> >>> have the “replicas” used to be called “slaves” but it’s not pc any
> more to
> >>> call them that.   The one you use for users and your reports use one of
> >>> these.   This will keep that one server hot as in the index will be in
> >>> memory the more it’s used.   Put all of them behind a proxy like nginx
> so
> >>> you can control what server is hot and fail down to the others when
> >>> needed.   Solr cloud is good in theory but won’t be as fast or
> reliable.
> >>>  This is based on my own experience and I’m sure people will say
> otherwise
> >>> but standalone Solr is super fast and solid with enough metal to back
> it
> >>> up.  Enough memory and ssd hard drive to hold your index and Solr cloud
> >>> will never be able to beat it.
> >>>
> >>> Again, if I were to do it and use the old way of naming conventions:
> >>>
> >>> One solr master indexer-> one solr live master/slave that replicates as
> >>> needed -> your ten or so slaves (10 is not needed, stick to three and
> see)
> >>> replicate every 5 or so minutes
> >>>
> >>> User->nginx proxy->slaves in order, but only use one, no round robin
> >>> just one and failover to the next on failure.
> >>> Reports go to a different slave but it’s ok since it’s the same data as
> >>> what the users see, but won’t compete for resources.
> >>>
> >>> Optimize each of these three things to do what they are supposed to.
> >>>  An indexing server is different than a search server in that way.
> >>>
> >>> Just my thoughts and experience with a few terabytes of an index.  Also
> >>> be certain the machines have three times as much space ready as a full
> >>> index, and keep your heap below 32gb on everything.   Servers, ssd and
> >>> memory are cheap, master slave replication is the most reliable.
> >>>
> >>>
> >>>
> >>>
> >>> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos <
> >>> [email protected]> wrote:
> >>> >
> >>> > Hi all,
> >>> >
> >>> > I recently inherited a team/app that has been running on a single
> >>> instance
> >>> > of SOLR for many years. An attempt was made to migrate to a 10 node
> >>> cluster
> >>> > configuration and we immediately encountered some issues which appear
> >>> to be
> >>> > related to the fact that data is being read from nodes where data
> >>> > replication had not yet completed. The highlights:
> >>> >
> >>> >
> >>> >   - 10 node cluster with 5 instances per DC with a mix of NRT and
> TLOG
> >>> >   - Data is sourced from another system in large batches throughout
> the
> >>> >   day (another system triggers our system on an adhoc basis, which
> then
> >>> >   refreshes data from the upstream system).
> >>> >      - These updates take from minutes to up to 2 hours
> >>> >      - We have an autoCommit of every 1 min and autoSoftCommit every
> 1
> >>> sec
> >>> >   - We also have numerous background processes which kick off on a
> >>> >   schedule (some every 15 mins, some hourly, some daily) which
> execute
> >>> >   queries and perform a variety of actions based on the current state
> >>> of the
> >>> >   data
> >>> >      - e.g. New records = send an email notifying users of some
> things
> >>> >      they need to do
> >>> >      - e.g. Removed records = send an email notifying users of some
> >>> updates
> >>> >      - (Significantly more complex than this.)
> >>> >      - Background jobs are NOT aware of whether or not a refresh
> (first
> >>> >      bullet) is currently underway
> >>> >   - Based on our investigation, we *think* our application is getting
> >>> >   incomplete results when executing queries during / shortly after
> data
> >>> >   refreshes, and making incorrect decisions (e.g. notifying users
> that
> >>> some
> >>> >   records were removed when they actually weren't, followed by a
> future
> >>> >   notification that the records are back)
> >>> >
> >>> >
> >>> > Would appreciate any advice or things to consider based on the above.
> >>> >
> >>> > Thank you!
> >>>
> >>
> >>
> >> --
> >> http://www.needhamsoftware.com (work)
> >> https://a.co/d/b2sZLD9 (my fantasy fiction book)
> >>
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>

Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Reply via email to