reRankDocs is set to 1000. I would try with a lower number, like 100. If the best match is not in the top 100 documents, something is wrong with the base relevance algorithm.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 4, 2024, at 9:28 AM, rajani m <rajinima...@gmail.com> wrote: > > Thank you Shawn, that was very helpful. I have tried the G1HeapRegionSize > setting. I set it to 32m(XX:G1HeapRegionSize=32m) and replayed the same > query logs, but it didn't help, reproducing the same oom error. > > I was able to capture the heap dump when the heap was almost full and have > the heap analysis report generated by MAT, uploaded here on my drive > <https://drive.google.com/file/d/1j1ghQB-zezTu8dje5pWJQE5A0qzyZ1ro/view?usp=sharing>. > Whenever you can, could you please take a look and let me know your > thoughts? Although the issue is reproducible only when the query has LTR as > reranker, the core issue seems to be originating from main libraries is > what report seems to be implying. Let me know what you think. > > I will test with ZGC and see if it can prevent STW, old generation full gc, > will let you know. > > Thanks, > Rajani > > > On Thu, Jan 4, 2024 at 11:20 AM Shawn Heisey <apa...@elyograg.org.invalid> > wrote: > >> On 1/3/24 13:33, rajani m wrote: >>> Solr query with LTR as a re-ranker is using full heap all of sudden >> and >>> triggering STW pause. Could you please take a look and let me know your >>> thoughts? What is causing this? The STW is putting nodes in an unhealthy >>> state causing nodes to restart and bringing the entire cluster down. >>> >>> As per logs, the issue seems to be related to LTR generating features at >>> query time. The model has 12 features and most features are solr query >> and >>> few field values. The error from the logs is copied below[2]. I'd say >> this >>> is a major bug as G1GC is supposed to avoid STW. What are your thoughts? >> >> G1 does not completely eliminate stop-the-world. >> >> One of the little details of G1GC operation concerns something called >> humongous objects. >> >> Any object larger than half the G1 region size is classified as >> humongous. These objects are allocated directly in the old region, and >> the only way they can be collected is during a full garbage collection. >> >> The secret to stellar performance with G1 is to eliminate, as much as >> possible, full GC cycles ... because there will always be a long STW >> with a full G1GC, but G1's region-specific collectors operate almost >> entirely concurrently with the application. >> >> You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter >> in your GC tuning ... but be aware that the max region size is 32m. >> Which means that no matter what when using G1, an object that is 16 >> megabytes or larger will always be humongous. It is my understanding >> that LTR models can be many megabytes in size, but I have never used the >> feature myself. >> >> If you are running on Java 11 or later, I recommend giving ZGC a try. >> This is the tuning I use in /etc/default/solr.in.sh. I use OpenJDK 17: >> >> GC_TUNE=" \ >> -XX:+UnlockExperimentalVMOptions \ >> -XX:+UseZGC \ >> -XX:+ParallelRefProcEnabled \ >> -XX:+ExplicitGCInvokesConcurrent \ >> -XX:+AlwaysPreTouch \ >> -XX:+UseNUMA \ >> " >> >> ZGC promises extremely short GC pauses with ANY size heap, even >> terabytes. I haven't tested it with a large heap myself, but in my >> limited testing, its individual pauses were MUCH shorter than what I saw >> with G1. Throughput is lower than G1, but latency is AWESOME. >> >> One bit of warning ... ZGC always uses 64-bit pointers, so the advice >> you'll commonly see recommending a heap size below 32GB does not apply >> to ZGC. There is no advantage to a 31GB heap compared to 32GB when >> using ZGC. >> >> Thanks, >> Shawn >> >>