Hi, We have a solrcloud setup indexing data for 4 data collections and we recently upgraded our solrcloud system and version to improve scalability and availability. For comparison:
Old system: 1. 2 solr nodes 2. 1 shard per collection 3. 2 tlog replicas per shard(1 leader and 1 follower). So, 2 replicas per collection 4. Each solr node hosts 1 replica for each collection 5. SOLR 7.7.1 6. SOLR node information from admin console for one of the nodes(Both solr nodes are nearly identical in configuration and resource allocation, but running on different hosts): Linux 4.19.0-8-amd64, 16cpu Memory: 94.4Gb File descriptors: 255/65535 Disk: 6.0Tb used: 59% Load: 0.23 New system: 1. 4 solr nodes 2. 2 shards per collection 3. 2 tlog replicas per shard(1 leader and 1 follower). So, 4 replicas per collection 4. Each solr node hosts 1 replica for each collection 5. SOLR 8.11.1 6. SOLR node information from admin console for one of the nodes(All solr nodes are nearly identical in configuration and resource allocation, but running on different hosts): Linux 4.19.0-18-amd64, 16cpu Memory: 78.6Gb File descriptors: 321/65535 Disk: 3.6Tb used: 33% Load: 2.86 We run a full indexing(full data import) job during the weekend for 3 of the collections one after the other(1st one on Friday, 2nd one on Saturday and 3rd one on Sunday). These jobs usually take anywhere between 17-30hrs to finish running full import for the entire data in solr. The full import happens in batches and we leverage the data import handler threads to spread out the workload amongst 10 handlers on the leader replica. We send a /dataimport request for a batch of ids to import with the handler number/name(like /dataimport1). The data can be sparse based on the batch that solr is importing and can vary in size. In the new system we regularly see the SQL queries that SOLR runs during full import(range query) getting stuck in the state "writing to net". Looking at the process list and running transactions at the time, the queries seem to have fetched the data, but seem to take a long time to send it over the network. We also have a delta import that we run every minute to index any new data that is added to the datasource after the max indexed id in SOLR. So, whenever the full import stalls during the weekend, it seems to take down the delta import with it causing the whole indexing system to stall/hang. When I looked at the SOLR server logs, I see the following exception being thrown multiple times: 2023-03-18 00:38:47.807 ERROR (Thread-9693) [ ] o.a.s.u.SolrCmdDistributor java.io.IOException: Request processing has stalled for 100079ms with 100 remaining elements in the queue. at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449) ~[?:?] at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) ~[?:?] at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345) ~[?:?] at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338) ~[?:?] at org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244) ~[?:?] at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300) ~[?:?] at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:237) ~[?:?] at org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245) ~[?:?] at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106) ~[?:?] at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) ~[?:?] at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118) ~[?:?] at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) ~[?:?] at org.apache.solr.update.processor.CloneFieldUpdateProcessorFactory$1.processAdd(CloneFieldUpdateProcessorFactory.java:469) ~[?:?] at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:80) ~[?:?] at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:271) ~[?:?] at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:547) ~[?:?] at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:435) ~[?:?] at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:350) ~[?:?] at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:235) ~[?:?] at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:427) ~[?:?] at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:486) ~[?:?] at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:469) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?] The above exception is thrown when the leader running the data import has finished fetching the data from the datasource and tries to send the update to its follower. The leader is waiting for the update to happen for too long and then fails at the stall timeout. However, this only happens in the new system running solr 8. In the older system running solr 7, the full import was never stalling and it went on to finish successfully. I looked at the SOLR documentation and found that a new stall timeout was introduced with a default value of 15 seconds. I even tried increasing that value to 100 seconds(has to be less than jetty idletimeout of 120 seconds) to give more time for SOLR to recover from the stall, however, that doesn't seem to have any effect other than just delaying the inevitable stall exception. I haven't changed the indexing logic or the sql queries when upgrading to the new solrcloud system. Could someone help me understand if there are any relevant changes in SOLR 8 that might be causing this issue and how I may investigate this issue? This issue only seems to happen during the weekend when we run full import(heavy import operation with high network I/O). During the week we run delta imports every minute, but I never see the "Request processing has stalled" error. I'm curious as to why only the new system running SOLR 8 keeps stalling, but the older system running SOLR 7 does not do the same for the identical datasource and size of data. Regards, Abhi