Hi Shawn, Thanks for the response. Based on the questions you have provided I have collected the following data.
a. This is the stack trace : 2021-12-16 18:05:11.834 ERROR (qtp1765690649-66565) [c:ss s:shard1 r:core_node24 x:ss_shard1_replica_n23] o.a.s.u.SolrCmdDistributor java.io.IOException: Request processing has stalled for 20032ms with 100 remaining elements in the queue. at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:448) at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:344) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:337) at org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:242) at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribAdd(DistributedZkUpdateProcessor.java:300) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:233) at org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245) at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103) at org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:332) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:281) at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:236) at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:196) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:127) at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:122) at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:70) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:802) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:579) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:420) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:352) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221) at org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.Server.handle(Server.java:500) at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:335) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135) at org.eclipse.jetty.http2.HTTP2Connection.produce(HTTP2Connection.java:170) at org.eclipse.jetty.http2.HTTP2Connection.onFillable(HTTP2Connection.java:125) at org.eclipse.jetty.http2.HTTP2Connection$FillableCallback.succeeded(HTTP2Connection.java:348) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) at java.base/java.lang.Thread.run(Thread.java:834) b. Max heap on the server is 32GB c. We are using the default GC tuning - not customized anything around it. d. I see the following GC logs around that time corresponding to the error above. [2021-12-16T18:02:39.902+0000][323265.234s] GC(1711) Pause Young (Normal) (G1 Evacuation Pause) [2021-12-16T18:02:39.902+0000][323265.234s] GC(1711) Using 8 workers of 8 for evacuation [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Pre Evacuate Collection Set: 0.2ms [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Evacuate Collection Set: 46.1ms [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Post Evacuate Collection Set: 28.8ms [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Other: 2.2ms [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Eden regions: 1189->0(1187) [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Survivor regions: 39->41(154) [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Old regions: 219->219 [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Humongous regions: 185->185 [2021-12-16T18:02:39.979+0000][323265.311s] GC(1711) Metaspace: 86065K->86065K(92160K) [2021-12-16T18:02:39.980+0000][323265.311s] GC(1711) Pause Young (Normal) (G1 Evacuation Pause) 26092M->7101M(32768M) 77.166ms [2021-12-16T18:02:39.980+0000][323265.312s] GC(1711) User=0.57s Sys=0.00s Real=0.07s [2021-12-16T18:06:21.343+0000][323486.675s] GC(1712) Pause Young (Normal) (G1 Evacuation Pause) [2021-12-16T18:06:21.343+0000][323486.675s] GC(1712) Using 8 workers of 8 for evacuation [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Pre Evacuate Collection Set: 0.2ms [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Evacuate Collection Set: 63.6ms [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Post Evacuate Collection Set: 17.4ms [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Other: 2.3ms [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Eden regions: 1187->0(1198) [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Survivor regions: 41->30(154) [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Old regions: 219->219 [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Humongous regions: 197->197 [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Metaspace: 86065K->86065K(92160K) [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) Pause Young (Normal) (G1 Evacuation Pause) 26285M->7122M(32768M) 83.507ms [2021-12-16T18:06:21.426+0000][323486.758s] GC(1712) User=0.62s Sys=0.00s Real=0.08s [2021-12-16T18:07:45.174+0000][323570.506s] GC(1713) Pause Young (Normal) (G1 Evacuation Pause) [2021-12-16T18:07:45.174+0000][323570.506s] GC(1713) Using 8 workers of 8 for evacuation [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Pre Evacuate Collection Set: 0.2ms [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Evacuate Collection Set: 44.9ms [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Post Evacuate Collection Set: 11.8ms [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Other: 2.3ms [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Eden regions: 1198->0(1205) [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Survivor regions: 30->23(154) [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Old regions: 219->219 [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Humongous regions: 197->197 [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Metaspace: 86065K->86065K(92160K) [2021-12-16T18:07:45.233+0000][323570.565s] GC(1713) Pause Young (Normal) (G1 Evacuation Pause) 26290M->7020M(32768M) 59.056ms [2021-12-16T18:07:45.234+0000][323570.565s] GC(1713) User=0.43s Sys=0.00s Real=0.06s Regards, Nikhilesh Jannu On Thu, Dec 16, 2021 at 5:34 PM Shawn Heisey <apa...@elyograg.org> wrote: > On 12/16/2021 1:22 PM, Nikhilesh Jannu wrote: > > One of our SOLR Cloud cluster has started showing lot of errors related > to > > > > java.io.IOException: Request processing has stalled for 20029ms with 100 > > remaining elements in the queue. > > The error message will have significantly more detail than that. We > would need to see ALL the lines in the error message to have any hope of > figuring out where the problem is happening. > > What is the max heap size for Solr? Have you changed the GC tuning? > Can you provide the solr GC logs from that server? Note that you cannot > attach those logs to email because the mailing list will eat them ... > you will need to put them on a file-sharing website and provide a URL to > download them. > > Thanks, > Shawn >