To be clear, you have to remove Integer.MAX_VALUE as value and use a reasonable amount.
On Thu, Jul 31, 2025 at 10:57 AM Vincenzo D'Amore <v.dam...@gmail.com> wrote: > Hi Antonio, > > Passing &facet.limit=Integer.MAX_VALUE or rows=Integer.MAX_VALUE might be > the root of the issue you're encountering. > > What’s likely happening is that the Java Virtual Machine (JVM), upon > receiving such large parameter values, attempts to allocate an enormous > amount of memory. This can lead to significant memory fragmentation, making > it difficult for the garbage collector to function efficiently. As a > result, overall performance may degrade or the system may become unstable. > > I've run into this problem multiple times with SolrCloud, where it often > resulted in recurring OutOfMemoryError exceptions. > > > > On Thu, Jul 31, 2025 at 10:37 AM Antonio Nunziante <nunzia...@light-sf.com> > wrote: > >> Dear Solr Community, >> >> >> >> I'm running Solr 8.11.1 in SolrCloud mode (3 nodes, 44GB heap each), and >> I'm >> investigating a critical OutOfMemoryError. >> >> >> >> The GC logs show Solr attempting to allocate an object of 17179868936 >> bytes, >> that I suspect it corresponds to an object of size 2147483617 on a 64-bit >> JVM (with 8-byte word size) 2147483617 * 8 = 17179868936 bytes. >> >> In java Integer.MAX_INT is 2147483647, and the value 2147483617 is just a >> little below Integer.MAX_INT. This corresponds to "MAX_ARRAY_LENGTH" the >> maximum safe Java array size internally used by Lucene's >> ArrayUtil.oversize(), and probably also in some other places in Solr >> source >> code. >> >> >> >> /** Maximum length for an array (Integer.MAX_VALUE - >> RamUsageEstimator.NUM_BYTES_ARRAY_HEADER). */ >> >> public static final int MAX_ARRAY_LENGTH = Integer.MAX_VALUE - >> RamUsageEstimator.NUM_BYTES_ARRAY_HEADER; >> >> >> >> This leads to a 16GB allocation attempt on 64-bit JVMs, which is what >> eventually triggers the OOM, more or less once per day on each node. Often >> when one nodes restart, also the other 3 restart. >> >> >> >> Some details about our setup: >> >> * Linux Red Hat version 8.4 >> * OpenJDK 64-Bit Server VM (build 21+35-2513) >> * Solr 8.11.1, 3 nodes, 44GB heap over 64GB of total RAM (4GB of >> swap) >> * G1GC, default parameters (-XX:+AlwaysPreTouch >> -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled >> -XX:+PerfDisableSharedMem -XX:+UseG1GC -XX:+UseLargePages >> -XX:-OmitStackTraceInFastThrow -XX:MaxGCPauseMillis=250) >> >> >> >> Solr contains 42 collections, 3 shards and 3 replicas each: >> >> * 21 collections are kept empty and are used as support when >> re-indexing (we index from scratch on an empty collection and then swap it >> with the current active one by modifying aliases, the old one then is >> emptied) >> * 21 collections contains documents, but of these only 6 are the >> most >> used for the main search requests: >> >> * number of documents per collection ranges from 20k to 70k >> * average document size ranges from 10Kb to 50Kb >> * on disk the biggest shard is 750MB, and each node has a total of >> 5GB >> or 6GB size on disk >> * We have lots of dynamic fields (like *_s, *_b, *_d, etc.), each of >> these 6 collections has from 20k to 40k of these fields >> >> >> >> Requests are around 1000 per minute, mostly edismax queries, usually >> retrieving less than 50 documents with 20 to 30 fields, some facets >> (around >> 30 different fields), and some filters. Also sorting on a couple of fields >> (around 10 different fields). >> >> Some of these requests send parameter &facet.limit as Integer.MAX_VALUE, >> but >> if this was the problem the OOMs should happen every minute, and it's not >> our case (by the way, we are fixing this by forcing -1 instead of >> MAX_VALUE). >> >> >> >> Here is the relevant solr_gc.log extract: >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Eden regions: 0->0(407) >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Survivor regions: >> 0->0(87) >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Old regions: 424->418 >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Humongous regions: >> 512->512 >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Metaspace: >> 99070K(100224K)->99002K(100224K) NonClass: 88782K(89472K)->88726K(89472K) >> Class: 10287K(10752K)->10276K(10752K) >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Heap after GC >> invocations=689 (full 2): >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) garbage-first heap >> total 46137344K, used 30443099K [0x00007f5e8c000000, 0x00007f698c000000) >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) region size 32768K, 0 >> young (0K), 0 survivors (0K) >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Metaspace used >> 99002K, committed 100224K, reserved 1179648K >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) class space used >> 10276K, committed 10752K, reserved 1048576K >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Pause Full (G1 >> Compaction >> Pause) 29774M->29729M(45056M) 7798.255ms >> >> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) User=47.22s Sys=0.05s >> Real=7.80s >> >> [2025-07-30T07:56:35.977+0200][38364.361s] Attempt heap expansion >> (allocation request failed). Allocation request: 17179868936B >> >> [2025-07-30T07:56:35.977+0200][38364.361s] Expand the heap. requested >> expansion amount: 17179868936B expansion amount: 17179869184B >> >> [2025-07-30T07:56:35.977+0200][38364.361s] Did not expand the heap (heap >> already fully expanded) >> >> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Card Set >> Free >> Memory Task) (run: 14470.631ms) (cpu: 0.642ms) >> >> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC >> Task) (run 14061.096ms after schedule) >> >> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC >> Task) (run: 0.010ms) (cpu: 0.000ms) >> >> [2025-07-30T07:56:35.989+0200][38364.373s] G1 Service Thread (Card Set >> Free >> Memory Task) (run 0.202ms after schedule) >> >> >> >> I need help with identifying what part of Solr or Lucene could be >> responsible for that allocation. We do not have millions of facet terms (i >> think max should be in the thousands) or unusually large result sets. >> >> If anyone can help in pointing to known causes, relevant classes, or >> previous similar issues, it would be greatly appreciated. >> >> >> >> Thanks, >> >> Antonio >> >> >> >> > > -- > Vincenzo D'Amore > > -- Vincenzo D'Amore