Hi Mikhail, Thanks for the response. This instance mostly idling, at that time it was coordinating one request > and awaits shard's request to complete see
The shard is waiting on itself. 10.128.193.11 is the private IP of the same node where I have taken this stack trace. in the below request, One node has a PULL replica and one node has an NRT replica. We have set the preference to PULL replicas. httpShardExecutor-7-thread-939362-processing-x:im-search-03-08-22_shard1_replica_p17 r:core_node18 http://// 10.128.193.11:8985//solr//im-search-03-08-22_shard1_replica_p17//|http:////10.128.99.14:8985//solr//im-search-03-08-22_shard1_replica_n1// n:10.128.193.11:8985_solr c:im-search-03-08-22 s:shard1 [http://// 10.128.193.11:8985//solr//im-search-03-08-22_shard1_replica_p17//, http://// 10.128.99.14:8985//solr//im-search-03-08-22_shard1_replica_n1//] I tried to track internal requests for this main request which took almost 5+ hours to execute with only 9k hits. and it had a 0 status code (successful). There were 12 requests with this RID. 8 requests got successful at 10:41, but 4 requests got successful at 16:22. I checked the response time of internal requests, and no requests had a response time greater than 100 ms. This means Solr was waiting on something before executing requests. what could be that? AFAIK ParallelGC > despite its name is quite old and not really performant. Earlier we were using java 8 and G1GC with default settings. Recently we decide to upgrade java to 15. After upgrading java to 15, the application wasn't performing well. even with fewer GC counts and less GC time system was on load in peak hours. We experimented with ZGC, but that also didn't help. we tried parallel GC, and the system was stable, with no sudden load peaks in peak hours. that's why we are continuing with parallel GC. On Thu, Dec 8, 2022 at 5:31 PM Mikhail Khludnev <m...@apache.org> wrote: > Hi Satya. > This instance mostly idling, at that time it were coordinating one request > and awaits shard request to complete see > > https://fastthread.io/same-state-threads.jsp?state=non-daemon&dumpId=1#panel111 > > > https://fastthread.io/same-state-threads.jsp?state=non-daemon&dumpId=1#panel118 > that another instance might have some clues in stacktrace. Also, if you > have 500 errors there might be exceptions; slow query logging might be > enabled and can give more clues for troubleshooting. AFAIK ParallelGC > despite its name is quite old and not really performant. > > On Thu, Dec 8, 2022 at 2:28 PM Satya Nand <satya.n...@indiamart.com > .invalid> > wrote: > > > Hi, > > > > Greetings for the day, > > > > We are facing a strange problem in Solr cloud where a few requests are > > taking hours to complete. Some requests return with a 0 status code and > > some with a 500 status code. The recent request took more than 5 hours to > > complete with only a 9k results count. > > > > > > These queries create problems in closing old searchers, Some times there > > are 3-4 searchers where one is a new searcher and the others are just > stuck > > because a few queries are tracking hours. Finally, the application slows > > down horribly, and the load increases. > > > > I have downloaded the stack trace of the affected node and tried to > analyze > > this stack trace online. but I couldn't get many insights from it. > > . > > > > Stack Trace: > > > > > > > https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjIvMTIvOC9sb2dzLnR4dC0tMTAtNTUtMzA=& > > > > JVM Settings: We are using Parallel GC, can this be causing this much log > > pause? > > > > -XX:+UseParallelGC > > -XX:-OmitStackTraceInFastThrow > > -Xms12g > > -Xmx12g > > -Xss256k > > > > What more we can check here to find the root cause and prevent this from > > happening again? > > Thanks in advance > > > > > -- > Sincerely yours > Mikhail Khludnev >