Actually it occurs to me (just after hitting send of course) that using a field for that might be still problematic, I think it could still vary slightly, since I think the field value might not get created until the sub-request gets to the replica, and might be subject to local clock issues... probably safer to add a https://solr.apache.org/docs/9_9_0/core/org/apache/solr/update/processor/TimestampUpdateProcessorFactory.html such that it is handled on the first receiving node instead.
On Mon, Aug 4, 2025 at 4:47 PM Gus Heck <gus.h...@gmail.com> wrote: > The likely cause of the issue is that replicas are not guaranteed to > finish commits simultaneously. Solr is eventually consistent > <https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#ignoring-commits-from-client-applications-in-solrcloud>. > If you make 3 fast requests, you can hit [Replica A, Replica B, Replica A] > where B is ahead of replica A due to differing commit completion times. > That final request to A (which still hasn't committed) will make it look > like a document disappeared. > > One thing you can try is to ensure records have an indexDate field > identifying exactly when they were indexed, and then filter requests to > only look at index dates more than a commit interval in the past. > > <field name="indexedDate" type="pdate" indexed="true" stored="true" > default="NOW" /> <!-- Don't ever send a value for this field!, let solr fill > it in. --> > > if add that field and filter on > > indexedDate:[NOW - 2MINUTES TO *] OR -indexedDate:* that will catch > previously indexed data (the negated term) and all data older than 2 min > (the range). That set of data should have attained consistency unless your > system is struggling under load and a replica simply can't keep up (in > which case you are about to have bigger problems). > > As a side note, your commit intervals are very aggressive, but I'm > guessing that's an attempt to get around the problems you are seeing? The > filter of course has to change if you relax your commit intervals > substantially. > > On Mon, Aug 4, 2025 at 12:33 PM Dave <hastings.recurs...@gmail.com> wrote: > >> Here is what I would do, take it with a grain of salt but it works solid >> >> Have a single master solr node that takes all the data as the indexer, >> have the “replicas” used to be called “slaves” but it’s not pc any more to >> call them that. The one you use for users and your reports use one of >> these. This will keep that one server hot as in the index will be in >> memory the more it’s used. Put all of them behind a proxy like nginx so >> you can control what server is hot and fail down to the others when >> needed. Solr cloud is good in theory but won’t be as fast or reliable. >> This is based on my own experience and I’m sure people will say otherwise >> but standalone Solr is super fast and solid with enough metal to back it >> up. Enough memory and ssd hard drive to hold your index and Solr cloud >> will never be able to beat it. >> >> Again, if I were to do it and use the old way of naming conventions: >> >> One solr master indexer-> one solr live master/slave that replicates as >> needed -> your ten or so slaves (10 is not needed, stick to three and see) >> replicate every 5 or so minutes >> >> User->nginx proxy->slaves in order, but only use one, no round robin just >> one and failover to the next on failure. >> Reports go to a different slave but it’s ok since it’s the same data as >> what the users see, but won’t compete for resources. >> >> Optimize each of these three things to do what they are supposed to. An >> indexing server is different than a search server in that way. >> >> Just my thoughts and experience with a few terabytes of an index. Also >> be certain the machines have three times as much space ready as a full >> index, and keep your heap below 32gb on everything. Servers, ssd and >> memory are cheap, master slave replication is the most reliable. >> >> >> >> >> > On Aug 4, 2025, at 11:53 AM, Marcus R. Matos < >> mar...@matosconsulting.com> wrote: >> > >> > Hi all, >> > >> > I recently inherited a team/app that has been running on a single >> instance >> > of SOLR for many years. An attempt was made to migrate to a 10 node >> cluster >> > configuration and we immediately encountered some issues which appear >> to be >> > related to the fact that data is being read from nodes where data >> > replication had not yet completed. The highlights: >> > >> > >> > - 10 node cluster with 5 instances per DC with a mix of NRT and TLOG >> > - Data is sourced from another system in large batches throughout the >> > day (another system triggers our system on an adhoc basis, which then >> > refreshes data from the upstream system). >> > - These updates take from minutes to up to 2 hours >> > - We have an autoCommit of every 1 min and autoSoftCommit every 1 >> sec >> > - We also have numerous background processes which kick off on a >> > schedule (some every 15 mins, some hourly, some daily) which execute >> > queries and perform a variety of actions based on the current state >> of the >> > data >> > - e.g. New records = send an email notifying users of some things >> > they need to do >> > - e.g. Removed records = send an email notifying users of some >> updates >> > - (Significantly more complex than this.) >> > - Background jobs are NOT aware of whether or not a refresh (first >> > bullet) is currently underway >> > - Based on our investigation, we *think* our application is getting >> > incomplete results when executing queries during / shortly after data >> > refreshes, and making incorrect decisions (e.g. notifying users that >> some >> > records were removed when they actually weren't, followed by a future >> > notification that the records are back) >> > >> > >> > Would appreciate any advice or things to consider based on the above. >> > >> > Thank you! >> > > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book) > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)