Hi,

We have a solrcloud setup indexing data for 4 data collections and we
recently upgraded our solrcloud system and version to improve scalability
and availability. For comparison:

Old system:
1. 2 solr nodes
2. 1 shard per collection
3. 2 tlog replicas per shard(1 leader and 1 follower). So, 2 replicas per
collection
4. Each solr node hosts 1 replica for each collection
5. SOLR 7.7.1
6. SOLR node information from admin console for one of the nodes(Both solr
nodes are nearly identical in configuration and resource allocation, but
running on different hosts):
    Linux 4.19.0-8-amd64, 16cpu
   Memory: 94.4Gb
   File descriptors: 255/65535
   Disk: 6.0Tb used: 59%
   Load: 0.23

New system:
1. 4 solr nodes
2. 2 shards per collection
3. 2 tlog replicas per shard(1 leader and 1 follower). So, 4 replicas per
collection
4. Each solr node hosts 1 replica for each collection
5. SOLR 8.11.1
6. SOLR node information from admin console for one of the nodes(All solr
nodes are nearly identical in configuration and resource allocation, but
running on different hosts):
    Linux 4.19.0-18-amd64, 16cpu
   Memory: 78.6Gb
   File descriptors: 321/65535
   Disk: 3.6Tb used: 33%
   Load: 2.86

We run a full indexing(full data import) job during the weekend for 3 of
the collections one after the other(1st one on Friday, 2nd one on Saturday
and 3rd one on Sunday). These jobs usually take anywhere between 17-30hrs
to finish running full import for the entire data in solr. The full import
happens in batches and we leverage the data import handler threads to
spread out the workload amongst 10 handlers on the leader replica. We send
a /dataimport request for a batch of ids to import with the handler
number/name(like /dataimport1). The data can be sparse based on the batch
that solr is importing and can vary in size.

In the new system we regularly see the SQL queries that SOLR runs during
full import(range query) getting stuck in the state "writing to net".
Looking at the process list and running transactions at the time,
the queries seem to have fetched the data, but seem to take a long time to
send it over the network. We also have a delta import that we run every
minute to index any new data that is added to the datasource after the max
indexed id in SOLR. So, whenever the full import stalls during the weekend,
it seems to take down the delta import with it causing the whole indexing
system to stall/hang.

When I looked at the SOLR server logs, I see the following exception being
thrown multiple times:

2023-03-18 00:38:47.807 ERROR (Thread-9693) [   ]
o.a.s.u.SolrCmdDistributor java.io.IOException: Request processing has
stalled for 100079ms with 100 remaining elements in the queue.
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:449)
~[?:?]
        at
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) ~[?:?]
        at
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:345)
~[?:?]
        at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:338)
~[?:?]
        at
org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:244)
~[?:?]
        at
org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300)
~[?:?]
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:237)
~[?:?]
        at
org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
~[?:?]
        at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106)
~[?:?]
        at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
~[?:?]
        at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
~[?:?]
        at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
~[?:?]
        at
org.apache.solr.update.processor.CloneFieldUpdateProcessorFactory$1.processAdd(CloneFieldUpdateProcessorFactory.java:469)
~[?:?]
        at
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:80)
~[?:?]
        at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:271)
~[?:?]
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:547)
~[?:?]
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:435)
~[?:?]
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:350)
~[?:?]
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:235)
~[?:?]
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:427)
~[?:?]
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:486)
~[?:?]
        at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:469)
~[?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

The above exception is thrown when the leader running the data import has
finished fetching the data from the datasource and tries to send the update
to its follower. The leader is waiting for the update to happen for too
long and then fails at the stall timeout.
However, this only happens in the new system running solr 8. In the older
system running solr 7, the full import was never stalling and it went on to
finish successfully. I looked at the SOLR documentation and found that a
new stall timeout was introduced with a default value of 15 seconds. I even
tried increasing that value to 100 seconds(has to be less than jetty
idletimeout of 120 seconds) to give more time for SOLR to recover from the
stall, however, that doesn't seem to have any effect other than just
delaying the inevitable stall exception.

I haven't changed the indexing logic or the sql queries when upgrading to
the new solrcloud system. Could someone help me understand if there are any
relevant changes in SOLR 8 that might be causing this issue and how I may
investigate this issue? This issue only seems to happen during the weekend
when we run full import(heavy import operation with high network I/O).
During the week we run delta imports every minute, but I never see the
"Request processing has stalled" error.

I'm curious as to why only the new system running SOLR 8 keeps stalling,
but the older system running SOLR 7 does not do the same for the identical
datasource and size of data.

Regards,
Abhi

Reply via email to