This has been very helpful and we are making a number of changes to help resolve the issues.
Another question: our engineer has requested that we include distrib=true in our queries. My understanding is that this is implicit in our calls to a SOLR Cloud cluster. Any thoughts on how this works in practice? Thank you, -- Marcus R. Matos Solution Architect || Software Development Leader || CISSP, CSSLP, ITIL v3 tel: 469/892-8962 || e-mail: mar...@matosconsulting.com Sent via mobile; please excuse brevity and teepees. On Mon, Aug 4, 2025, 4:02 PM Gus Heck <gus.h...@gmail.com> wrote: > One last note, combining NRT and TLOG might also be contributing to the > variation in how fast data appears, Probably best to stick to the > recommended combinations... > > https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#combining-replica-types-in-a-cluster > > > On Mon, Aug 4, 2025 at 4:54 PM Gus Heck <gus.h...@gmail.com> wrote: > > > Actually it occurs to me (just after hitting send of course) that using a > > field for that might be still problematic, I think it could still vary > > slightly, since I think the field value might not get created until the > > sub-request gets to the replica, and might be subject to local clock > > issues... probably safer to add a > > > https://solr.apache.org/docs/9_9_0/core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html > > such that it is handled on the first receiving node instead. > > > > On Mon, Aug 4, 2025 at 4:47 PM Gus Heck <gus.h...@gmail.com> wrote: > > > >> The likely cause of the issue is that replicas are not guaranteed to > >> finish commits simultaneously. Solr is eventually consistent > >> < > https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud > >. > >> If you make 3 fast requests, you can hit [Replica A, Replica B, Replica > A] > >> where B is ahead of replica A due to differing commit completion times. > >> That final request to A (which still hasn't committed) will make it look > >> like a document disappeared. > >> > >> One thing you can try is to ensure records have an indexDate field > >> identifying exactly when they were indexed, and then filter requests to > >> only look at index dates more than a commit interval in the past. > >> > >> <field name="indexedDate" type="pdate" indexed="true" stored="true" > default="NOW" /> <!-- Don't ever send a value for this field!, let solr > fill it in. --> > >> > >> if add that field and filter on > >> > >> indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:* that will catch > >> previously indexed data (the negated term) and all data older than 2 min > >> (the range). That set of data should have attained consistency unless > your > >> system is struggling under load and a replica simply can't keep up (in > >> which case you are about to have bigger problems). > >> > >> As a side note, your commit intervals are very aggressive, but I'm > >> guessing that's an attempt to get around the problems you are seeing? > The > >> filter of course has to change if you relax your commit intervals > >> substantially. > >> > >> On Mon, Aug 4, 2025 at 12:33 PM Dave <hastings.recurs...@gmail.com> > >> wrote: > >> > >>> Here is what I would do, take it with a grain of salt but it works > solid > >>> > >>> Have a single master solr node that takes all the data as the indexer, > >>> have the “replicas” used to be called “slaves” but it’s not pc any > more to > >>> call them that. The one you use for users and your reports use one of > >>> these. This will keep that one server hot as in the index will be in > >>> memory the more it’s used. Put all of them behind a proxy like nginx > so > >>> you can control what server is hot and fail down to the others when > >>> needed. Solr cloud is good in theory but won’t be as fast or > reliable. > >>> This is based on my own experience and I’m sure people will say > otherwise > >>> but standalone Solr is super fast and solid with enough metal to back > it > >>> up. Enough memory and ssd hard drive to hold your index and Solr cloud > >>> will never be able to beat it. > >>> > >>> Again, if I were to do it and use the old way of naming conventions: > >>> > >>> One solr master indexer-> one solr live master/slave that replicates as > >>> needed -> your ten or so slaves (10 is not needed, stick to three and > see) > >>> replicate every 5 or so minutes > >>> > >>> User->nginx proxy->slaves in order, but only use one, no round robin > >>> just one and failover to the next on failure. > >>> Reports go to a different slave but it’s ok since it’s the same data as > >>> what the users see, but won’t compete for resources. > >>> > >>> Optimize each of these three things to do what they are supposed to. > >>> An indexing server is different than a search server in that way. > >>> > >>> Just my thoughts and experience with a few terabytes of an index. Also > >>> be certain the machines have three times as much space ready as a full > >>> index, and keep your heap below 32gb on everything. Servers, ssd and > >>> memory are cheap, master slave replication is the most reliable. > >>> > >>> > >>> > >>> > >>> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos < > >>> mar...@matosconsulting.com> wrote: > >>> > > >>> > Hi all, > >>> > > >>> > I recently inherited a team/app that has been running on a single > >>> instance > >>> > of SOLR for many years. An attempt was made to migrate to a 10 node > >>> cluster > >>> > configuration and we immediately encountered some issues which appear > >>> to be > >>> > related to the fact that data is being read from nodes where data > >>> > replication had not yet completed. The highlights: > >>> > > >>> > > >>> > - 10 node cluster with 5 instances per DC with a mix of NRT and > TLOG > >>> > - Data is sourced from another system in large batches throughout > the > >>> > day (another system triggers our system on an adhoc basis, which > then > >>> > refreshes data from the upstream system). > >>> > - These updates take from minutes to up to 2 hours > >>> > - We have an autoCommit of every 1 min and autoSoftCommit every > 1 > >>> sec > >>> > - We also have numerous background processes which kick off on a > >>> > schedule (some every 15 mins, some hourly, some daily) which > execute > >>> > queries and perform a variety of actions based on the current state > >>> of the > >>> > data > >>> > - e.g. New records = send an email notifying users of some > things > >>> > they need to do > >>> > - e.g. Removed records = send an email notifying users of some > >>> updates > >>> > - (Significantly more complex than this.) > >>> > - Background jobs are NOT aware of whether or not a refresh > (first > >>> > bullet) is currently underway > >>> > - Based on our investigation, we *think* our application is getting > >>> > incomplete results when executing queries during / shortly after > data > >>> > refreshes, and making incorrect decisions (e.g. notifying users > that > >>> some > >>> > records were removed when they actually weren't, followed by a > future > >>> > notification that the records are back) > >>> > > >>> > > >>> > Would appreciate any advice or things to consider based on the above. > >>> > > >>> > Thank you! > >>> > >> > >> > >> -- > >> http://www.needhamsoftware.com (work) > >> https://a.co/d/b2sZLD9 (my fantasy fiction book) > >> > > > > > > -- > > http://www.needhamsoftware.com (work) > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > > > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book) >