Thank you Shawn, that was very helpful. I have tried the G1HeapRegionSize setting. I set it to 32m(XX:G1HeapRegionSize=32m) and replayed the same query logs, but it didn't help, reproducing the same oom error.
I was able to capture the heap dump when the heap was almost full and have the heap analysis report generated by MAT, uploaded here on my drive <https://drive.google.com/file/d/1j1ghQB-zezTu8dje5pWJQE5A0qzyZ1ro/view?usp=sharing>. Whenever you can, could you please take a look and let me know your thoughts? Although the issue is reproducible only when the query has LTR as reranker, the core issue seems to be originating from main libraries is what report seems to be implying. Let me know what you think. I will test with ZGC and see if it can prevent STW, old generation full gc, will let you know. Thanks, Rajani On Thu, Jan 4, 2024 at 11:20 AM Shawn Heisey <apa...@elyograg.org.invalid> wrote: > On 1/3/24 13:33, rajani m wrote: > > Solr query with LTR as a re-ranker is using full heap all of sudden > and > > triggering STW pause. Could you please take a look and let me know your > > thoughts? What is causing this? The STW is putting nodes in an unhealthy > > state causing nodes to restart and bringing the entire cluster down. > > > > As per logs, the issue seems to be related to LTR generating features at > > query time. The model has 12 features and most features are solr query > and > > few field values. The error from the logs is copied below[2]. I'd say > this > > is a major bug as G1GC is supposed to avoid STW. What are your thoughts? > > G1 does not completely eliminate stop-the-world. > > One of the little details of G1GC operation concerns something called > humongous objects. > > Any object larger than half the G1 region size is classified as > humongous. These objects are allocated directly in the old region, and > the only way they can be collected is during a full garbage collection. > > The secret to stellar performance with G1 is to eliminate, as much as > possible, full GC cycles ... because there will always be a long STW > with a full G1GC, but G1's region-specific collectors operate almost > entirely concurrently with the application. > > You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter > in your GC tuning ... but be aware that the max region size is 32m. > Which means that no matter what when using G1, an object that is 16 > megabytes or larger will always be humongous. It is my understanding > that LTR models can be many megabytes in size, but I have never used the > feature myself. > > If you are running on Java 11 or later, I recommend giving ZGC a try. > This is the tuning I use in /etc/default/solr.in.sh. I use OpenJDK 17: > > GC_TUNE=" \ > -XX:+UnlockExperimentalVMOptions \ > -XX:+UseZGC \ > -XX:+ParallelRefProcEnabled \ > -XX:+ExplicitGCInvokesConcurrent \ > -XX:+AlwaysPreTouch \ > -XX:+UseNUMA \ > " > > ZGC promises extremely short GC pauses with ANY size heap, even > terabytes. I haven't tested it with a large heap myself, but in my > limited testing, its individual pauses were MUCH shorter than what I saw > with G1. Throughput is lower than G1, but latency is AWESOME. > > One bit of warning ... ZGC always uses 64-bit pointers, so the advice > you'll commonly see recommending a heap size below 32GB does not apply > to ZGC. There is no advantage to a 31GB heap compared to 32GB when > using ZGC. > > Thanks, > Shawn > >