On 1/3/24 13:33, rajani m wrote:
     Solr query with LTR as a re-ranker is using full heap all of sudden and
triggering STW pause. Could you please take a look and let me know your
thoughts? What is causing this? The STW  is putting nodes in an unhealthy
state causing nodes to restart and bringing the entire cluster down.

As per logs, the issue seems to be related to LTR generating features at
query time. The model has 12 features and most features are solr query and
few field values. The error from the logs is copied below[2].  I'd say this
is a major bug as G1GC is supposed to avoid STW.  What are your thoughts?

G1 does not completely eliminate stop-the-world.

One of the little details of G1GC operation concerns something called humongous objects.

Any object larger than half the G1 region size is classified as humongous. These objects are allocated directly in the old region, and the only way they can be collected is during a full garbage collection.

The secret to stellar performance with G1 is to eliminate, as much as possible, full GC cycles ... because there will always be a long STW with a full G1GC, but G1's region-specific collectors operate almost entirely concurrently with the application.

You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter in your GC tuning ... but be aware that the max region size is 32m. Which means that no matter what when using G1, an object that is 16 megabytes or larger will always be humongous. It is my understanding that LTR models can be many megabytes in size, but I have never used the feature myself.

If you are running on Java 11 or later, I recommend giving ZGC a try. This is the tuning I use in /etc/default/solr.in.sh. I use OpenJDK 17:

GC_TUNE=" \
  -XX:+UnlockExperimentalVMOptions \
  -XX:+UseZGC \
  -XX:+ParallelRefProcEnabled \
  -XX:+ExplicitGCInvokesConcurrent \
  -XX:+AlwaysPreTouch \
  -XX:+UseNUMA \
"

ZGC promises extremely short GC pauses with ANY size heap, even terabytes. I haven't tested it with a large heap myself, but in my limited testing, its individual pauses were MUCH shorter than what I saw with G1. Throughput is lower than G1, but latency is AWESOME.

One bit of warning ... ZGC always uses 64-bit pointers, so the advice you'll commonly see recommending a heap size below 32GB does not apply to ZGC. There is no advantage to a 31GB heap compared to 32GB when using ZGC.

Thanks,
Shawn

Reply via email to