Hi Wunder, The base ranker takes care of matching and ranking docs based on qf, pf2 and pf3, the ltr re-ranker looks at bunch of user behavior fields/features such as date(recency), popularity, favorited, saves and hence reranking 1k presents better quality than top 100.
Thanks, Rajani On Thu, Jan 4, 2024 at 12:33 PM Walter Underwood <wun...@wunderwood.org> wrote: > reRankDocs is set to 1000. I would try with a lower number, like 100. If > the best match is not in the top 100 documents, something is wrong with the > base relevance algorithm. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Jan 4, 2024, at 9:28 AM, rajani m <rajinima...@gmail.com> wrote: > > > > Thank you Shawn, that was very helpful. I have tried the > G1HeapRegionSize > > setting. I set it to 32m(XX:G1HeapRegionSize=32m) and replayed the same > > query logs, but it didn't help, reproducing the same oom error. > > > > I was able to capture the heap dump when the heap was almost full and > have > > the heap analysis report generated by MAT, uploaded here on my drive > > < > https://drive.google.com/file/d/1j1ghQB-zezTu8dje5pWJQE5A0qzyZ1ro/view?usp=sharing > >. > > Whenever you can, could you please take a look and let me know your > > thoughts? Although the issue is reproducible only when the query has LTR > as > > reranker, the core issue seems to be originating from main libraries is > > what report seems to be implying. Let me know what you think. > > > > I will test with ZGC and see if it can prevent STW, old generation full > gc, > > will let you know. > > > > Thanks, > > Rajani > > > > > > On Thu, Jan 4, 2024 at 11:20 AM Shawn Heisey <apa...@elyograg.org.invalid > > > > wrote: > > > >> On 1/3/24 13:33, rajani m wrote: > >>> Solr query with LTR as a re-ranker is using full heap all of sudden > >> and > >>> triggering STW pause. Could you please take a look and let me know your > >>> thoughts? What is causing this? The STW is putting nodes in an > unhealthy > >>> state causing nodes to restart and bringing the entire cluster down. > >>> > >>> As per logs, the issue seems to be related to LTR generating features > at > >>> query time. The model has 12 features and most features are solr query > >> and > >>> few field values. The error from the logs is copied below[2]. I'd say > >> this > >>> is a major bug as G1GC is supposed to avoid STW. What are your > thoughts? > >> > >> G1 does not completely eliminate stop-the-world. > >> > >> One of the little details of G1GC operation concerns something called > >> humongous objects. > >> > >> Any object larger than half the G1 region size is classified as > >> humongous. These objects are allocated directly in the old region, and > >> the only way they can be collected is during a full garbage collection. > >> > >> The secret to stellar performance with G1 is to eliminate, as much as > >> possible, full GC cycles ... because there will always be a long STW > >> with a full G1GC, but G1's region-specific collectors operate almost > >> entirely concurrently with the application. > >> > >> You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter > >> in your GC tuning ... but be aware that the max region size is 32m. > >> Which means that no matter what when using G1, an object that is 16 > >> megabytes or larger will always be humongous. It is my understanding > >> that LTR models can be many megabytes in size, but I have never used the > >> feature myself. > >> > >> If you are running on Java 11 or later, I recommend giving ZGC a try. > >> This is the tuning I use in /etc/default/solr.in.sh. I use OpenJDK 17: > >> > >> GC_TUNE=" \ > >> -XX:+UnlockExperimentalVMOptions \ > >> -XX:+UseZGC \ > >> -XX:+ParallelRefProcEnabled \ > >> -XX:+ExplicitGCInvokesConcurrent \ > >> -XX:+AlwaysPreTouch \ > >> -XX:+UseNUMA \ > >> " > >> > >> ZGC promises extremely short GC pauses with ANY size heap, even > >> terabytes. I haven't tested it with a large heap myself, but in my > >> limited testing, its individual pauses were MUCH shorter than what I saw > >> with G1. Throughput is lower than G1, but latency is AWESOME. > >> > >> One bit of warning ... ZGC always uses 64-bit pointers, so the advice > >> you'll commonly see recommending a heap size below 32GB does not apply > >> to ZGC. There is no advantage to a 31GB heap compared to 32GB when > >> using ZGC. > >> > >> Thanks, > >> Shawn > >> > >> > >