reRankDocs is set to 1000. I would try with a lower number, like 100. If the 
best match is not in the top 100 documents, something is wrong with the base 
relevance algorithm.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 4, 2024, at 9:28 AM, rajani m <rajinima...@gmail.com> wrote:
> 
> Thank you Shawn, that was very helpful.  I have tried the G1HeapRegionSize
> setting. I set it to 32m(XX:G1HeapRegionSize=32m) and replayed the same
> query logs, but it didn't help, reproducing the same oom error.
> 
> I was able to capture the heap dump when the heap was almost full and have
> the heap analysis report generated by MAT, uploaded here on my drive
> <https://drive.google.com/file/d/1j1ghQB-zezTu8dje5pWJQE5A0qzyZ1ro/view?usp=sharing>.
> Whenever you can, could you please take a look and let me know your
> thoughts? Although the issue is reproducible only when the query has LTR as
> reranker, the core issue seems to be originating from main libraries is
> what report seems to be implying. Let me know what you think.
> 
> I will test with ZGC and see if it can prevent STW, old generation full gc,
> will let you know.
> 
> Thanks,
> Rajani
> 
> 
> On Thu, Jan 4, 2024 at 11:20 AM Shawn Heisey <apa...@elyograg.org.invalid>
> wrote:
> 
>> On 1/3/24 13:33, rajani m wrote:
>>>     Solr query with LTR as a re-ranker is using full heap all of sudden
>> and
>>> triggering STW pause. Could you please take a look and let me know your
>>> thoughts? What is causing this? The STW  is putting nodes in an unhealthy
>>> state causing nodes to restart and bringing the entire cluster down.
>>> 
>>> As per logs, the issue seems to be related to LTR generating features at
>>> query time. The model has 12 features and most features are solr query
>> and
>>> few field values. The error from the logs is copied below[2].  I'd say
>> this
>>> is a major bug as G1GC is supposed to avoid STW.  What are your thoughts?
>> 
>> G1 does not completely eliminate stop-the-world.
>> 
>> One of the little details of G1GC operation concerns something called
>> humongous objects.
>> 
>> Any object larger than half the G1 region size is classified as
>> humongous.  These objects are allocated directly in the old region, and
>> the only way they can be collected is during a full garbage collection.
>> 
>> The secret to stellar performance with G1 is to eliminate, as much as
>> possible, full GC cycles ... because there will always be a long STW
>> with a full G1GC, but G1's region-specific collectors operate almost
>> entirely concurrently with the application.
>> 
>> You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter
>> in your GC tuning ... but be aware that the max region size is 32m.
>> Which means that no matter what when using G1, an object that is 16
>> megabytes or larger will always be humongous.  It is my understanding
>> that LTR models can be many megabytes in size, but I have never used the
>> feature myself.
>> 
>> If you are running on Java 11 or later, I recommend giving ZGC a try.
>> This is the tuning I use in /etc/default/solr.in.sh.  I use OpenJDK 17:
>> 
>> GC_TUNE=" \
>>   -XX:+UnlockExperimentalVMOptions \
>>   -XX:+UseZGC \
>>   -XX:+ParallelRefProcEnabled \
>>   -XX:+ExplicitGCInvokesConcurrent \
>>   -XX:+AlwaysPreTouch \
>>   -XX:+UseNUMA \
>> "
>> 
>> ZGC promises extremely short GC pauses with ANY size heap, even
>> terabytes.  I haven't tested it with a large heap myself, but in my
>> limited testing, its individual pauses were MUCH shorter than what I saw
>> with G1.  Throughput is lower than G1, but latency is AWESOME.
>> 
>> One bit of warning ... ZGC always uses 64-bit pointers, so the advice
>> you'll commonly see recommending a heap size below 32GB does not apply
>> to ZGC.  There is no advantage to a 31GB heap compared to 32GB when
>> using ZGC.
>> 
>> Thanks,
>> Shawn
>> 
>> 

Reply via email to