Hi Mikhail,
Thanks for the response.

This instance mostly idling, at that time it was coordinating one request
> and awaits shard's request to complete see


The shard is waiting on itself. 10.128.193.11 is the private IP of the same
node where I have taken this stack trace. in the below request, One node
has a PULL replica and one node has an NRT replica. We have set the
preference to PULL replicas.

httpShardExecutor-7-thread-939362-processing-x:im-search-03-08-22_shard1_replica_p17
r:core_node18 http:////
10.128.193.11:8985//solr//im-search-03-08-22_shard1_replica_p17//|http:////10.128.99.14:8985//solr//im-search-03-08-22_shard1_replica_n1//
n:10.128.193.11:8985_solr c:im-search-03-08-22 s:shard1 [http:////
10.128.193.11:8985//solr//im-search-03-08-22_shard1_replica_p17//, http:////
10.128.99.14:8985//solr//im-search-03-08-22_shard1_replica_n1//]

I tried to track internal requests for this main request which took almost
5+ hours to execute with only 9k hits. and it had a 0 status code
(successful).

There were 12 requests with this RID. 8 requests got successful at 10:41,
but 4 requests got successful at 16:22. I checked the response time of
internal requests, and no requests had a response time greater than 100 ms.
This means Solr was waiting on something before executing requests.

what could be that?


 AFAIK ParallelGC
> despite its name is quite old and not really performant.


Earlier we were using java 8 and G1GC with default settings. Recently we
decide to upgrade java to 15. After upgrading java to 15, the application
wasn't performing well. even with fewer GC counts and less GC time system
was on load in peak hours.

We experimented with ZGC, but that also didn't help.

we tried parallel GC, and the system was stable, with no sudden load peaks
in peak hours. that's why we are continuing with parallel GC.




On Thu, Dec 8, 2022 at 5:31 PM Mikhail Khludnev <m...@apache.org> wrote:

> Hi Satya.
> This instance mostly idling, at that time it were coordinating one request
> and awaits shard request to complete see
>
> https://fastthread.io/same-state-threads.jsp?state=non-daemon&dumpId=1#panel111
>
>
> https://fastthread.io/same-state-threads.jsp?state=non-daemon&dumpId=1#panel118
> that another instance might have some clues in stacktrace. Also, if you
> have 500 errors there might be exceptions; slow query logging might be
> enabled and can give more clues for troubleshooting. AFAIK ParallelGC
> despite its name is quite old and not really performant.
>
> On Thu, Dec 8, 2022 at 2:28 PM Satya Nand <satya.n...@indiamart.com
> .invalid>
> wrote:
>
> > Hi,
> >
> > Greetings for the day,
> >
> > We are facing a strange problem in Solr cloud where a few requests are
> > taking hours to complete. Some requests return with a 0 status code and
> > some with a 500 status code. The recent request took more than 5 hours to
> > complete with only a 9k results count.
> >
> >
> > These queries create problems in closing old searchers,  Some times there
> > are 3-4 searchers where one is a new searcher and the others are just
> stuck
> > because a few queries are tracking hours. Finally, the application slows
> > down horribly, and the load increases.
> >
> > I have downloaded the stack trace of the affected node and tried to
> analyze
> > this stack trace online. but I couldn't get many insights from it.
> > .
> >
> > Stack Trace:
> >
> >
> >
> https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjIvMTIvOC9sb2dzLnR4dC0tMTAtNTUtMzA=&;
> >
> > JVM Settings: We are using Parallel GC, can this be causing this much log
> > pause?
> >
> > -XX:+UseParallelGC
> > -XX:-OmitStackTraceInFastThrow
> > -Xms12g
> > -Xmx12g
> > -Xss256k
> >
> > What more we can check here to find the root cause and prevent this from
> > happening again?
> > Thanks in advance
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Reply via email to