Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Gus Heck Mon, 04 Aug 2025 14:35:04 -0700

Actually it occurs to me (just after hitting send of course) that using a
field for that might be still problematic, I think it could still vary
slightly, since I think the field value might not get created until the
sub-request gets to the replica, and might be subject to local clock
issues...  probably safer to add a
https://solr.apache.org/docs/9_9_0/core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html
such that it is handled on the first receiving node instead.


On Mon, Aug 4, 2025 at 4:47 PM Gus Heck <gus.h...@gmail.com> wrote:

> The likely cause of the issue is that replicas are not guaranteed to
> finish commits simultaneously. Solr is eventually consistent
> <https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud>.
> If you make 3 fast requests, you can hit [Replica A, Replica B, Replica A]
> where B is ahead of replica A due to differing commit completion times.
> That final request to A (which still hasn't committed) will make it look
> like a document disappeared.
>
> One thing you can try is to ensure records have an indexDate field
> identifying exactly when they were indexed, and then filter requests to
> only look at index dates more than a commit interval in the past.
>
> <field name="indexedDate" type="pdate" indexed="true" stored="true" 
> default="NOW" /> <!-- Don't ever send a value for this field!, let solr fill 
> it in. -->
>
> if add that field and filter on
>
> indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:*  that will catch
> previously indexed data (the negated term) and all data older than 2 min
> (the range). That set of data should have attained consistency unless your
> system is struggling under load and a replica simply can't keep up (in
> which case you are about to have bigger problems).
>
> As a side note, your commit intervals are very aggressive, but I'm
> guessing that's an attempt to get around the problems you are seeing? The
> filter of course has to change if you relax your commit intervals
> substantially.
>
> On Mon, Aug 4, 2025 at 12:33 PM Dave <hastings.recurs...@gmail.com> wrote:
>
>> Here is what I would do, take it with a grain of salt but it works solid
>>
>> Have a single master solr node that takes all the data as the indexer,
>> have the “replicas” used to be called “slaves” but it’s not pc any more to
>> call them that.   The one you use for users and your reports use one of
>> these.   This will keep that one server hot as in the index will be in
>> memory the more it’s used.   Put all of them behind a proxy like nginx so
>> you can control what server is hot and fail down to the others when
>> needed.   Solr cloud is good in theory but won’t be as fast or reliable.
>>  This is based on my own experience and I’m sure people will say otherwise
>> but standalone Solr is super fast and solid with enough metal to back it
>> up.  Enough memory and ssd hard drive to hold your index and Solr cloud
>> will never be able to beat it.
>>
>> Again, if I were to do it and use the old way of naming conventions:
>>
>> One solr master indexer-> one solr live master/slave that replicates as
>> needed -> your ten or so slaves (10 is not needed, stick to three and see)
>> replicate every 5 or so minutes
>>
>> User->nginx proxy->slaves in order, but only use one, no round robin just
>> one and failover to the next on failure.
>> Reports go to a different slave but it’s ok since it’s the same data as
>> what the users see, but won’t compete for resources.
>>
>> Optimize each of these three things to do what they are supposed to.   An
>> indexing server is different than a search server in that way.
>>
>> Just my thoughts and experience with a few terabytes of an index.  Also
>> be certain the machines have three times as much space ready as a full
>> index, and keep your heap below 32gb on everything.   Servers, ssd and
>> memory are cheap, master slave replication is the most reliable.
>>
>>
>>
>>
>> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos <
>> mar...@matosconsulting.com> wrote:
>> >
>> > Hi all,
>> >
>> > I recently inherited a team/app that has been running on a single
>> instance
>> > of SOLR for many years. An attempt was made to migrate to a 10 node
>> cluster
>> > configuration and we immediately encountered some issues which appear
>> to be
>> > related to the fact that data is being read from nodes where data
>> > replication had not yet completed. The highlights:
>> >
>> >
>> >   - 10 node cluster with 5 instances per DC with a mix of NRT and TLOG
>> >   - Data is sourced from another system in large batches throughout the
>> >   day (another system triggers our system on an adhoc basis, which then
>> >   refreshes data from the upstream system).
>> >      - These updates take from minutes to up to 2 hours
>> >      - We have an autoCommit of every 1 min and autoSoftCommit every 1
>> sec
>> >   - We also have numerous background processes which kick off on a
>> >   schedule (some every 15 mins, some hourly, some daily) which execute
>> >   queries and perform a variety of actions based on the current state
>> of the
>> >   data
>> >      - e.g. New records = send an email notifying users of some things
>> >      they need to do
>> >      - e.g. Removed records = send an email notifying users of some
>> updates
>> >      - (Significantly more complex than this.)
>> >      - Background jobs are NOT aware of whether or not a refresh (first
>> >      bullet) is currently underway
>> >   - Based on our investigation, we *think* our application is getting
>> >   incomplete results when executing queries during / shortly after data
>> >   refreshes, and making incorrect decisions (e.g. notifying users that
>> some
>> >   records were removed when they actually weren't, followed by a future
>> >   notification that the records are back)
>> >
>> >
>> > Would appreciate any advice or things to consider based on the above.
>> >
>> > Thank you!
>>
>
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>


-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Re: Seeking Advice: Application misbehaving after migrating from single instance to Solr Cloud Cluster

Reply via email to