Re: Solr query using full heap and triggers stop the world pause

rajani m Thu, 04 Jan 2024 10:29:37 -0800

Hi Wunder,

  The base ranker takes care of matching and ranking docs based on qf, pf2
and pf3, the ltr re-ranker looks at bunch of user behavior fields/features
such as date(recency), popularity, favorited, saves and hence reranking 1k
presents better quality than top 100.



Thanks,
Rajani

On Thu, Jan 4, 2024 at 12:33 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> reRankDocs is set to 1000. I would try with a lower number, like 100. If
> the best match is not in the top 100 documents, something is wrong with the
> base relevance algorithm.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jan 4, 2024, at 9:28 AM, rajani m <rajinima...@gmail.com> wrote:
> >
> > Thank you Shawn, that was very helpful.  I have tried the
> G1HeapRegionSize
> > setting. I set it to 32m(XX:G1HeapRegionSize=32m) and replayed the same
> > query logs, but it didn't help, reproducing the same oom error.
> >
> > I was able to capture the heap dump when the heap was almost full and
> have
> > the heap analysis report generated by MAT, uploaded here on my drive
> > <
> https://drive.google.com/file/d/1j1ghQB-zezTu8dje5pWJQE5A0qzyZ1ro/view?usp=sharing
> >.
> > Whenever you can, could you please take a look and let me know your
> > thoughts? Although the issue is reproducible only when the query has LTR
> as
> > reranker, the core issue seems to be originating from main libraries is
> > what report seems to be implying. Let me know what you think.
> >
> > I will test with ZGC and see if it can prevent STW, old generation full
> gc,
> > will let you know.
> >
> > Thanks,
> > Rajani
> >
> >
> > On Thu, Jan 4, 2024 at 11:20 AM Shawn Heisey <apa...@elyograg.org.invalid
> >
> > wrote:
> >
> >> On 1/3/24 13:33, rajani m wrote:
> >>>     Solr query with LTR as a re-ranker is using full heap all of sudden
> >> and
> >>> triggering STW pause. Could you please take a look and let me know your
> >>> thoughts? What is causing this? The STW  is putting nodes in an
> unhealthy
> >>> state causing nodes to restart and bringing the entire cluster down.
> >>>
> >>> As per logs, the issue seems to be related to LTR generating features
> at
> >>> query time. The model has 12 features and most features are solr query
> >> and
> >>> few field values. The error from the logs is copied below[2].  I'd say
> >> this
> >>> is a major bug as G1GC is supposed to avoid STW.  What are your
> thoughts?
> >>
> >> G1 does not completely eliminate stop-the-world.
> >>
> >> One of the little details of G1GC operation concerns something called
> >> humongous objects.
> >>
> >> Any object larger than half the G1 region size is classified as
> >> humongous.  These objects are allocated directly in the old region, and
> >> the only way they can be collected is during a full garbage collection.
> >>
> >> The secret to stellar performance with G1 is to eliminate, as much as
> >> possible, full GC cycles ... because there will always be a long STW
> >> with a full G1GC, but G1's region-specific collectors operate almost
> >> entirely concurrently with the application.
> >>
> >> You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter
> >> in your GC tuning ... but be aware that the max region size is 32m.
> >> Which means that no matter what when using G1, an object that is 16
> >> megabytes or larger will always be humongous.  It is my understanding
> >> that LTR models can be many megabytes in size, but I have never used the
> >> feature myself.
> >>
> >> If you are running on Java 11 or later, I recommend giving ZGC a try.
> >> This is the tuning I use in /etc/default/solr.in.sh.  I use OpenJDK 17:
> >>
> >> GC_TUNE=" \
> >>   -XX:+UnlockExperimentalVMOptions \
> >>   -XX:+UseZGC \
> >>   -XX:+ParallelRefProcEnabled \
> >>   -XX:+ExplicitGCInvokesConcurrent \
> >>   -XX:+AlwaysPreTouch \
> >>   -XX:+UseNUMA \
> >> "
> >>
> >> ZGC promises extremely short GC pauses with ANY size heap, even
> >> terabytes.  I haven't tested it with a large heap myself, but in my
> >> limited testing, its individual pauses were MUCH shorter than what I saw
> >> with G1.  Throughput is lower than G1, but latency is AWESOME.
> >>
> >> One bit of warning ... ZGC always uses 64-bit pointers, so the advice
> >> you'll commonly see recommending a heap size below 32GB does not apply
> >> to ZGC.  There is no advantage to a 31GB heap compared to 32GB when
> >> using ZGC.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: Solr query using full heap and triggers stop the world pause

Reply via email to