Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Gus Heck Mon, 04 Aug 2025 14:01:55 -0700

One last note, combining NRT and TLOG might also be contributing to the
variation in how fast data appears, Probably best to stick to the
recommended combinations...
https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#combining-replica-types-in-a-cluster



On Mon, Aug 4, 2025 at 4:54 PM Gus Heck <gus.h...@gmail.com> wrote:

> Actually it occurs to me (just after hitting send of course) that using a
> field for that might be still problematic, I think it could still vary
> slightly, since I think the field value might not get created until the
> sub-request gets to the replica, and might be subject to local clock
> issues...  probably safer to add a
> https://solr.apache.org/docs/9_9_0/core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html
> such that it is handled on the first receiving node instead.
>
> On Mon, Aug 4, 2025 at 4:47 PM Gus Heck <gus.h...@gmail.com> wrote:
>
>> The likely cause of the issue is that replicas are not guaranteed to
>> finish commits simultaneously. Solr is eventually consistent
>> <https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud>.
>> If you make 3 fast requests, you can hit [Replica A, Replica B, Replica A]
>> where B is ahead of replica A due to differing commit completion times.
>> That final request to A (which still hasn't committed) will make it look
>> like a document disappeared.
>>
>> One thing you can try is to ensure records have an indexDate field
>> identifying exactly when they were indexed, and then filter requests to
>> only look at index dates more than a commit interval in the past.
>>
>> <field name="indexedDate" type="pdate" indexed="true" stored="true" 
>> default="NOW" /> <!-- Don't ever send a value for this field!, let solr fill 
>> it in. -->
>>
>> if add that field and filter on
>>
>> indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:*  that will catch
>> previously indexed data (the negated term) and all data older than 2 min
>> (the range). That set of data should have attained consistency unless your
>> system is struggling under load and a replica simply can't keep up (in
>> which case you are about to have bigger problems).
>>
>> As a side note, your commit intervals are very aggressive, but I'm
>> guessing that's an attempt to get around the problems you are seeing? The
>> filter of course has to change if you relax your commit intervals
>> substantially.
>>
>> On Mon, Aug 4, 2025 at 12:33 PM Dave <hastings.recurs...@gmail.com>
>> wrote:
>>
>>> Here is what I would do, take it with a grain of salt but it works solid
>>>
>>> Have a single master solr node that takes all the data as the indexer,
>>> have the “replicas” used to be called “slaves” but it’s not pc any more to
>>> call them that.   The one you use for users and your reports use one of
>>> these.   This will keep that one server hot as in the index will be in
>>> memory the more it’s used.   Put all of them behind a proxy like nginx so
>>> you can control what server is hot and fail down to the others when
>>> needed.   Solr cloud is good in theory but won’t be as fast or reliable.
>>>  This is based on my own experience and I’m sure people will say otherwise
>>> but standalone Solr is super fast and solid with enough metal to back it
>>> up.  Enough memory and ssd hard drive to hold your index and Solr cloud
>>> will never be able to beat it.
>>>
>>> Again, if I were to do it and use the old way of naming conventions:
>>>
>>> One solr master indexer-> one solr live master/slave that replicates as
>>> needed -> your ten or so slaves (10 is not needed, stick to three and see)
>>> replicate every 5 or so minutes
>>>
>>> User->nginx proxy->slaves in order, but only use one, no round robin
>>> just one and failover to the next on failure.
>>> Reports go to a different slave but it’s ok since it’s the same data as
>>> what the users see, but won’t compete for resources.
>>>
>>> Optimize each of these three things to do what they are supposed to.
>>>  An indexing server is different than a search server in that way.
>>>
>>> Just my thoughts and experience with a few terabytes of an index.  Also
>>> be certain the machines have three times as much space ready as a full
>>> index, and keep your heap below 32gb on everything.   Servers, ssd and
>>> memory are cheap, master slave replication is the most reliable.
>>>
>>>
>>>
>>>
>>> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos <
>>> mar...@matosconsulting.com> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I recently inherited a team/app that has been running on a single
>>> instance
>>> > of SOLR for many years. An attempt was made to migrate to a 10 node
>>> cluster
>>> > configuration and we immediately encountered some issues which appear
>>> to be
>>> > related to the fact that data is being read from nodes where data
>>> > replication had not yet completed. The highlights:
>>> >
>>> >
>>> >   - 10 node cluster with 5 instances per DC with a mix of NRT and TLOG
>>> >   - Data is sourced from another system in large batches throughout the
>>> >   day (another system triggers our system on an adhoc basis, which then
>>> >   refreshes data from the upstream system).
>>> >      - These updates take from minutes to up to 2 hours
>>> >      - We have an autoCommit of every 1 min and autoSoftCommit every 1
>>> sec
>>> >   - We also have numerous background processes which kick off on a
>>> >   schedule (some every 15 mins, some hourly, some daily) which execute
>>> >   queries and perform a variety of actions based on the current state
>>> of the
>>> >   data
>>> >      - e.g. New records = send an email notifying users of some things
>>> >      they need to do
>>> >      - e.g. Removed records = send an email notifying users of some
>>> updates
>>> >      - (Significantly more complex than this.)
>>> >      - Background jobs are NOT aware of whether or not a refresh (first
>>> >      bullet) is currently underway
>>> >   - Based on our investigation, we *think* our application is getting
>>> >   incomplete results when executing queries during / shortly after data
>>> >   refreshes, and making incorrect decisions (e.g. notifying users that
>>> some
>>> >   records were removed when they actually weren't, followed by a future
>>> >   notification that the records are back)
>>> >
>>> >
>>> > Would appreciate any advice or things to consider based on the above.
>>> >
>>> > Thank you!
>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>>
>
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>


-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Reply via email to