One last note, combining NRT and TLOG might also be contributing to the variation in how fast data appears, Probably best to stick to the recommended combinations... https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#combining-replica-types-in-a-cluster
On Mon, Aug 4, 2025 at 4:54 PM Gus Heck <gus.h...@gmail.com> wrote: > Actually it occurs to me (just after hitting send of course) that using a > field for that might be still problematic, I think it could still vary > slightly, since I think the field value might not get created until the > sub-request gets to the replica, and might be subject to local clock > issues... probably safer to add a > https://solr.apache.org/docs/9_9_0/core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html > such that it is handled on the first receiving node instead. > > On Mon, Aug 4, 2025 at 4:47 PM Gus Heck <gus.h...@gmail.com> wrote: > >> The likely cause of the issue is that replicas are not guaranteed to >> finish commits simultaneously. Solr is eventually consistent >> <https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud>. >> If you make 3 fast requests, you can hit [Replica A, Replica B, Replica A] >> where B is ahead of replica A due to differing commit completion times. >> That final request to A (which still hasn't committed) will make it look >> like a document disappeared. >> >> One thing you can try is to ensure records have an indexDate field >> identifying exactly when they were indexed, and then filter requests to >> only look at index dates more than a commit interval in the past. >> >> <field name="indexedDate" type="pdate" indexed="true" stored="true" >> default="NOW" /> <!-- Don't ever send a value for this field!, let solr fill >> it in. --> >> >> if add that field and filter on >> >> indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:* that will catch >> previously indexed data (the negated term) and all data older than 2 min >> (the range). That set of data should have attained consistency unless your >> system is struggling under load and a replica simply can't keep up (in >> which case you are about to have bigger problems). >> >> As a side note, your commit intervals are very aggressive, but I'm >> guessing that's an attempt to get around the problems you are seeing? The >> filter of course has to change if you relax your commit intervals >> substantially. >> >> On Mon, Aug 4, 2025 at 12:33 PM Dave <hastings.recurs...@gmail.com> >> wrote: >> >>> Here is what I would do, take it with a grain of salt but it works solid >>> >>> Have a single master solr node that takes all the data as the indexer, >>> have the “replicas” used to be called “slaves” but it’s not pc any more to >>> call them that. The one you use for users and your reports use one of >>> these. This will keep that one server hot as in the index will be in >>> memory the more it’s used. Put all of them behind a proxy like nginx so >>> you can control what server is hot and fail down to the others when >>> needed. Solr cloud is good in theory but won’t be as fast or reliable. >>> This is based on my own experience and I’m sure people will say otherwise >>> but standalone Solr is super fast and solid with enough metal to back it >>> up. Enough memory and ssd hard drive to hold your index and Solr cloud >>> will never be able to beat it. >>> >>> Again, if I were to do it and use the old way of naming conventions: >>> >>> One solr master indexer-> one solr live master/slave that replicates as >>> needed -> your ten or so slaves (10 is not needed, stick to three and see) >>> replicate every 5 or so minutes >>> >>> User->nginx proxy->slaves in order, but only use one, no round robin >>> just one and failover to the next on failure. >>> Reports go to a different slave but it’s ok since it’s the same data as >>> what the users see, but won’t compete for resources. >>> >>> Optimize each of these three things to do what they are supposed to. >>> An indexing server is different than a search server in that way. >>> >>> Just my thoughts and experience with a few terabytes of an index. Also >>> be certain the machines have three times as much space ready as a full >>> index, and keep your heap below 32gb on everything. Servers, ssd and >>> memory are cheap, master slave replication is the most reliable. >>> >>> >>> >>> >>> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos < >>> mar...@matosconsulting.com> wrote: >>> > >>> > Hi all, >>> > >>> > I recently inherited a team/app that has been running on a single >>> instance >>> > of SOLR for many years. An attempt was made to migrate to a 10 node >>> cluster >>> > configuration and we immediately encountered some issues which appear >>> to be >>> > related to the fact that data is being read from nodes where data >>> > replication had not yet completed. The highlights: >>> > >>> > >>> > - 10 node cluster with 5 instances per DC with a mix of NRT and TLOG >>> > - Data is sourced from another system in large batches throughout the >>> > day (another system triggers our system on an adhoc basis, which then >>> > refreshes data from the upstream system). >>> > - These updates take from minutes to up to 2 hours >>> > - We have an autoCommit of every 1 min and autoSoftCommit every 1 >>> sec >>> > - We also have numerous background processes which kick off on a >>> > schedule (some every 15 mins, some hourly, some daily) which execute >>> > queries and perform a variety of actions based on the current state >>> of the >>> > data >>> > - e.g. New records = send an email notifying users of some things >>> > they need to do >>> > - e.g. Removed records = send an email notifying users of some >>> updates >>> > - (Significantly more complex than this.) >>> > - Background jobs are NOT aware of whether or not a refresh (first >>> > bullet) is currently underway >>> > - Based on our investigation, we *think* our application is getting >>> > incomplete results when executing queries during / shortly after data >>> > refreshes, and making incorrect decisions (e.g. notifying users that >>> some >>> > records were removed when they actually weren't, followed by a future >>> > notification that the records are back) >>> > >>> > >>> > Would appreciate any advice or things to consider based on the above. >>> > >>> > Thank you! >>> >> >> >> -- >> http://www.needhamsoftware.com (work) >> https://a.co/d/b2sZLD9 (my fantasy fiction book) >> > > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book) > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)