Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Gus Heck Tue, 05 Aug 2025 12:25:12 -0700

You are correct, solr will distribute your request by default, here's the
code from SearchHandler (I assume you are asking about queries)


rb.isDistrib = isDistrib(req);

private boolean isDistrib(SolrQueryRequest req) {
  boolean isZkAware = req.getCoreContainer().isZooKeeperAware();
  boolean isDistrib = req.getParams().getBool(DISTRIB, isZkAware);
  if (!isDistrib) {
    // for back compat, a shards param with URLs like
localhost:8983/solr will mean that this
    // search is distributed.
    final String shards = req.getParams().get(ShardParams.SHARDS);
    isDistrib = ((shards != null) && (shards.indexOf('/') > 0));
  }
  return isDistrib;
}

Since you are running Solr Cloud, you will return true for
isZookeeperAware(), and that "true" result becomes the default for the
DISTRIB ("distrib") parameter.

Generally one only ever explicitly sets this (to false) if for some reason
one doesn't want to scatter-gather results across the shards in the
cluster. That's usually only going to happen for internal requests or
troubleshooting where you want to know exactly what a specific replica is
responding with (or similar).

On Tue, Aug 5, 2025 at 2:53 PM Marcus R. Matos <mar...@matosconsulting.com>
wrote:

> This has been very helpful and we are making a number of changes to help
> resolve the issues.
>
> Another question: our engineer has requested that we include distrib=true
> in our queries. My understanding is that this is implicit in our calls to a
> SOLR Cloud cluster. Any thoughts on how this works in practice?
>
> Thank you,
>
> --
> Marcus R. Matos
> Solution Architect || Software Development Leader || CISSP, CSSLP, ITIL v3
> tel: 469/892-8962 || e-mail: mar...@matosconsulting.com
>
> Sent via mobile; please excuse brevity and teepees.
>
> On Mon, Aug 4, 2025, 4:02 PM Gus Heck <gus.h...@gmail.com> wrote:
>
> > One last note, combining NRT and TLOG might also be contributing to the
> > variation in how fast data appears, Probably best to stick to the
> > recommended combinations...
> >
> >
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#combining-replica-types-in-a-cluster
> >
> >
> > On Mon, Aug 4, 2025 at 4:54 PM Gus Heck <gus.h...@gmail.com> wrote:
> >
> > > Actually it occurs to me (just after hitting send of course) that
> using a
> > > field for that might be still problematic, I think it could still vary
> > > slightly, since I think the field value might not get created until the
> > > sub-request gets to the replica, and might be subject to local clock
> > > issues...  probably safer to add a
> > >
> >
> https://solr.apache.org/docs/9_9_0/core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html
> > > such that it is handled on the first receiving node instead.
> > >
> > > On Mon, Aug 4, 2025 at 4:47 PM Gus Heck <gus.h...@gmail.com> wrote:
> > >
> > >> The likely cause of the issue is that replicas are not guaranteed to
> > >> finish commits simultaneously. Solr is eventually consistent
> > >> <
> >
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud
> > >.
> > >> If you make 3 fast requests, you can hit [Replica A, Replica B,
> Replica
> > A]
> > >> where B is ahead of replica A due to differing commit completion
> times.
> > >> That final request to A (which still hasn't committed) will make it
> look
> > >> like a document disappeared.
> > >>
> > >> One thing you can try is to ensure records have an indexDate field
> > >> identifying exactly when they were indexed, and then filter requests
> to
> > >> only look at index dates more than a commit interval in the past.
> > >>
> > >> <field name="indexedDate" type="pdate" indexed="true" stored="true"
> > default="NOW" /> <!-- Don't ever send a value for this field!, let solr
> > fill it in. -->
> > >>
> > >> if add that field and filter on
> > >>
> > >> indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:*  that will catch
> > >> previously indexed data (the negated term) and all data older than 2
> min
> > >> (the range). That set of data should have attained consistency unless
> > your
> > >> system is struggling under load and a replica simply can't keep up (in
> > >> which case you are about to have bigger problems).
> > >>
> > >> As a side note, your commit intervals are very aggressive, but I'm
> > >> guessing that's an attempt to get around the problems you are seeing?
> > The
> > >> filter of course has to change if you relax your commit intervals
> > >> substantially.
> > >>
> > >> On Mon, Aug 4, 2025 at 12:33 PM Dave <hastings.recurs...@gmail.com>
> > >> wrote:
> > >>
> > >>> Here is what I would do, take it with a grain of salt but it works
> > solid
> > >>>
> > >>> Have a single master solr node that takes all the data as the
> indexer,
> > >>> have the “replicas” used to be called “slaves” but it’s not pc any
> > more to
> > >>> call them that.   The one you use for users and your reports use one
> of
> > >>> these.   This will keep that one server hot as in the index will be
> in
> > >>> memory the more it’s used.   Put all of them behind a proxy like
> nginx
> > so
> > >>> you can control what server is hot and fail down to the others when
> > >>> needed.   Solr cloud is good in theory but won’t be as fast or
> > reliable.
> > >>>  This is based on my own experience and I’m sure people will say
> > otherwise
> > >>> but standalone Solr is super fast and solid with enough metal to back
> > it
> > >>> up.  Enough memory and ssd hard drive to hold your index and Solr
> cloud
> > >>> will never be able to beat it.
> > >>>
> > >>> Again, if I were to do it and use the old way of naming conventions:
> > >>>
> > >>> One solr master indexer-> one solr live master/slave that replicates
> as
> > >>> needed -> your ten or so slaves (10 is not needed, stick to three and
> > see)
> > >>> replicate every 5 or so minutes
> > >>>
> > >>> User->nginx proxy->slaves in order, but only use one, no round robin
> > >>> just one and failover to the next on failure.
> > >>> Reports go to a different slave but it’s ok since it’s the same data
> as
> > >>> what the users see, but won’t compete for resources.
> > >>>
> > >>> Optimize each of these three things to do what they are supposed to.
> > >>>  An indexing server is different than a search server in that way.
> > >>>
> > >>> Just my thoughts and experience with a few terabytes of an index.
> Also
> > >>> be certain the machines have three times as much space ready as a
> full
> > >>> index, and keep your heap below 32gb on everything.   Servers, ssd
> and
> > >>> memory are cheap, master slave replication is the most reliable.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos <
> > >>> mar...@matosconsulting.com> wrote:
> > >>> >
> > >>> > Hi all,
> > >>> >
> > >>> > I recently inherited a team/app that has been running on a single
> > >>> instance
> > >>> > of SOLR for many years. An attempt was made to migrate to a 10 node
> > >>> cluster
> > >>> > configuration and we immediately encountered some issues which
> appear
> > >>> to be
> > >>> > related to the fact that data is being read from nodes where data
> > >>> > replication had not yet completed. The highlights:
> > >>> >
> > >>> >
> > >>> >   - 10 node cluster with 5 instances per DC with a mix of NRT and
> > TLOG
> > >>> >   - Data is sourced from another system in large batches throughout
> > the
> > >>> >   day (another system triggers our system on an adhoc basis, which
> > then
> > >>> >   refreshes data from the upstream system).
> > >>> >      - These updates take from minutes to up to 2 hours
> > >>> >      - We have an autoCommit of every 1 min and autoSoftCommit
> every
> > 1
> > >>> sec
> > >>> >   - We also have numerous background processes which kick off on a
> > >>> >   schedule (some every 15 mins, some hourly, some daily) which
> > execute
> > >>> >   queries and perform a variety of actions based on the current
> state
> > >>> of the
> > >>> >   data
> > >>> >      - e.g. New records = send an email notifying users of some
> > things
> > >>> >      they need to do
> > >>> >      - e.g. Removed records = send an email notifying users of some
> > >>> updates
> > >>> >      - (Significantly more complex than this.)
> > >>> >      - Background jobs are NOT aware of whether or not a refresh
> > (first
> > >>> >      bullet) is currently underway
> > >>> >   - Based on our investigation, we *think* our application is
> getting
> > >>> >   incomplete results when executing queries during / shortly after
> > data
> > >>> >   refreshes, and making incorrect decisions (e.g. notifying users
> > that
> > >>> some
> > >>> >   records were removed when they actually weren't, followed by a
> > future
> > >>> >   notification that the records are back)
> > >>> >
> > >>> >
> > >>> > Would appreciate any advice or things to consider based on the
> above.
> > >>> >
> > >>> > Thank you!
> > >>>
> > >>
> > >>
> > >> --
> > >> http://www.needhamsoftware.com (work)
> > >> https://a.co/d/b2sZLD9 (my fantasy fiction book)
> > >>
> > >
> > >
> > > --
> > > http://www.needhamsoftware.com (work)
> > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> > >
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> >
>


-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Reply via email to