Re: Solr query using full heap and triggers stop the world pause

Walter Underwood Thu, 04 Jan 2024 10:58:21 -0800

Try it with 100 and see if it runs out of heap. It if does not run out, then 
the size of reRankDocs is the cause.


You can increase the heap if you want to, but if the reranker is moving 
document 1000 places in the result list, I would look seriously at improving 
the base relevance. You might include an aggregate popularity, for example. 
Maybe add overall recency.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 4, 2024, at 10:28 AM, rajani m <rajinima...@gmail.com> wrote:
> 
> Hi Wunder,
> 
>  The base ranker takes care of matching and ranking docs based on qf, pf2
> and pf3, the ltr re-ranker looks at bunch of user behavior fields/features
> such as date(recency), popularity, favorited, saves and hence reranking 1k
> presents better quality than top 100.
> 
> 
> Thanks,
> Rajani
> 
> On Thu, Jan 4, 2024 at 12:33 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>> reRankDocs is set to 1000. I would try with a lower number, like 100. If
>> the best match is not in the top 100 documents, something is wrong with the
>> base relevance algorithm.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jan 4, 2024, at 9:28 AM, rajani m <rajinima...@gmail.com> wrote:
>>> 
>>> Thank you Shawn, that was very helpful.  I have tried the
>> G1HeapRegionSize
>>> setting. I set it to 32m(XX:G1HeapRegionSize=32m) and replayed the same
>>> query logs, but it didn't help, reproducing the same oom error.
>>> 
>>> I was able to capture the heap dump when the heap was almost full and
>> have
>>> the heap analysis report generated by MAT, uploaded here on my drive
>>> <
>> https://drive.google.com/file/d/1j1ghQB-zezTu8dje5pWJQE5A0qzyZ1ro/view?usp=sharing
>>> .
>>> Whenever you can, could you please take a look and let me know your
>>> thoughts? Although the issue is reproducible only when the query has LTR
>> as
>>> reranker, the core issue seems to be originating from main libraries is
>>> what report seems to be implying. Let me know what you think.
>>> 
>>> I will test with ZGC and see if it can prevent STW, old generation full
>> gc,
>>> will let you know.
>>> 
>>> Thanks,
>>> Rajani
>>> 
>>> 
>>> On Thu, Jan 4, 2024 at 11:20 AM Shawn Heisey <apa...@elyograg.org.invalid
>>> 
>>> wrote:
>>> 
>>>> On 1/3/24 13:33, rajani m wrote:
>>>>>    Solr query with LTR as a re-ranker is using full heap all of sudden
>>>> and
>>>>> triggering STW pause. Could you please take a look and let me know your
>>>>> thoughts? What is causing this? The STW  is putting nodes in an
>> unhealthy
>>>>> state causing nodes to restart and bringing the entire cluster down.
>>>>> 
>>>>> As per logs, the issue seems to be related to LTR generating features
>> at
>>>>> query time. The model has 12 features and most features are solr query
>>>> and
>>>>> few field values. The error from the logs is copied below[2].  I'd say
>>>> this
>>>>> is a major bug as G1GC is supposed to avoid STW.  What are your
>> thoughts?
>>>> 
>>>> G1 does not completely eliminate stop-the-world.
>>>> 
>>>> One of the little details of G1GC operation concerns something called
>>>> humongous objects.
>>>> 
>>>> Any object larger than half the G1 region size is classified as
>>>> humongous.  These objects are allocated directly in the old region, and
>>>> the only way they can be collected is during a full garbage collection.
>>>> 
>>>> The secret to stellar performance with G1 is to eliminate, as much as
>>>> possible, full GC cycles ... because there will always be a long STW
>>>> with a full G1GC, but G1's region-specific collectors operate almost
>>>> entirely concurrently with the application.
>>>> 
>>>> You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter
>>>> in your GC tuning ... but be aware that the max region size is 32m.
>>>> Which means that no matter what when using G1, an object that is 16
>>>> megabytes or larger will always be humongous.  It is my understanding
>>>> that LTR models can be many megabytes in size, but I have never used the
>>>> feature myself.
>>>> 
>>>> If you are running on Java 11 or later, I recommend giving ZGC a try.
>>>> This is the tuning I use in /etc/default/solr.in.sh.  I use OpenJDK 17:
>>>> 
>>>> GC_TUNE=" \
>>>>  -XX:+UnlockExperimentalVMOptions \
>>>>  -XX:+UseZGC \
>>>>  -XX:+ParallelRefProcEnabled \
>>>>  -XX:+ExplicitGCInvokesConcurrent \
>>>>  -XX:+AlwaysPreTouch \
>>>>  -XX:+UseNUMA \
>>>> "
>>>> 
>>>> ZGC promises extremely short GC pauses with ANY size heap, even
>>>> terabytes.  I haven't tested it with a large heap myself, but in my
>>>> limited testing, its individual pauses were MUCH shorter than what I saw
>>>> with G1.  Throughput is lower than G1, but latency is AWESOME.
>>>> 
>>>> One bit of warning ... ZGC always uses 64-bit pointers, so the advice
>>>> you'll commonly see recommending a heap size below 32GB does not apply
>>>> to ZGC.  There is no advantage to a 31GB heap compared to 32GB when
>>>> using ZGC.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>> 
>>

Re: Solr query using full heap and triggers stop the world pause

Reply via email to