Recurring OutOfMemoryError in SolrCloud 8.11.1 - 16GB allocation failure

Antonio Nunziante Thu, 31 Jul 2025 01:36:55 -0700

Dear Solr Community,


I'm running Solr 8.11.1 in SolrCloud mode (3 nodes, 44GB heap each), and I'm
investigating a critical OutOfMemoryError.

 

The GC logs show Solr attempting to allocate an object of 17179868936 bytes,
that I suspect it corresponds to an object of size 2147483617 on a 64-bit
JVM (with 8-byte word size) 2147483617 * 8 = 17179868936 bytes. 

In java Integer.MAX_INT is 2147483647, and the value 2147483617 is just a
little below Integer.MAX_INT. This corresponds to "MAX_ARRAY_LENGTH" the
maximum safe Java array size internally used by Lucene's
ArrayUtil.oversize(), and probably also in some other places in Solr source
code.

 

  /** Maximum length for an array (Integer.MAX_VALUE -
RamUsageEstimator.NUM_BYTES_ARRAY_HEADER). */

  public static final int MAX_ARRAY_LENGTH = Integer.MAX_VALUE -
RamUsageEstimator.NUM_BYTES_ARRAY_HEADER;

 

This leads to a 16GB allocation attempt on 64-bit JVMs, which is what
eventually triggers the OOM, more or less once per day on each node. Often
when one nodes restart, also the other 3 restart.

 

Some details about our setup:

*       Linux Red Hat version 8.4
*       OpenJDK 64-Bit Server VM (build 21+35-2513)
*       Solr 8.11.1, 3 nodes, 44GB heap over 64GB of total RAM (4GB of swap)
*       G1GC, default parameters (-XX:+AlwaysPreTouch
-XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled
-XX:+PerfDisableSharedMem -XX:+UseG1GC -XX:+UseLargePages
-XX:-OmitStackTraceInFastThrow -XX:MaxGCPauseMillis=250)

 

Solr contains 42 collections, 3 shards and 3 replicas each:

*       21 collections are kept empty and are used as support when
re-indexing (we index from scratch on an empty collection and then swap it
with the current active one by modifying aliases, the old one then is
emptied)
*       21 collections contains documents, but of these only 6 are the most
used for the main search requests: 

*       number of documents per collection ranges from 20k to 70k
*       average document size ranges from 10Kb to 50Kb
*       on disk the biggest shard is 750MB, and each node has a total of 5GB
or 6GB size on disk
*       We have lots of dynamic fields (like *_s, *_b, *_d, etc.), each of
these 6 collections has from 20k to 40k of these fields

 

Requests are around 1000 per minute, mostly edismax queries, usually
retrieving less than 50 documents with 20 to 30 fields, some facets (around
30 different fields), and some filters. Also sorting on a couple of fields
(around 10 different fields).

Some of these requests send parameter &facet.limit as Integer.MAX_VALUE, but
if this was the problem the OOMs should happen every minute, and it's not
our case (by the way, we are fixing this by forcing -1 instead of
MAX_VALUE).

 

Here is the relevant solr_gc.log extract:

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Eden regions: 0->0(407)

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Survivor regions:
0->0(87)

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Old regions: 424->418

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Humongous regions:
512->512

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Metaspace:
99070K(100224K)->99002K(100224K) NonClass: 88782K(89472K)->88726K(89472K)
Class: 10287K(10752K)->10276K(10752K)

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Heap after GC
invocations=689 (full 2):

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688)  garbage-first heap
total 46137344K, used 30443099K [0x00007f5e8c000000, 0x00007f698c000000)

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688)   region size 32768K, 0
young (0K), 0 survivors (0K)

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688)  Metaspace       used
99002K, committed 100224K, reserved 1179648K

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688)   class space    used
10276K, committed 10752K, reserved 1048576K

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Pause Full (G1 Compaction
Pause) 29774M->29729M(45056M) 7798.255ms

[2025-07-30T07:56:35.977+0200][38364.361s] GC(688) User=47.22s Sys=0.05s
Real=7.80s

[2025-07-30T07:56:35.977+0200][38364.361s] Attempt heap expansion
(allocation request failed). Allocation request: 17179868936B

[2025-07-30T07:56:35.977+0200][38364.361s] Expand the heap. requested
expansion amount: 17179868936B expansion amount: 17179869184B

[2025-07-30T07:56:35.977+0200][38364.361s] Did not expand the heap (heap
already fully expanded)

[2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Card Set Free
Memory Task) (run: 14470.631ms) (cpu: 0.642ms)

[2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC
Task) (run 14061.096ms after schedule)

[2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC
Task) (run: 0.010ms) (cpu: 0.000ms)

[2025-07-30T07:56:35.989+0200][38364.373s] G1 Service Thread (Card Set Free
Memory Task) (run 0.202ms after schedule)

 

I need help with identifying what part of Solr or Lucene could be
responsible for that allocation. We do not have millions of facet terms (i
think max should be in the thousands) or unusually large result sets.

If anyone can help in pointing to known causes, relevant classes, or
previous similar issues, it would be greatly appreciated.

 

Thanks,  

Antonio

Recurring OutOfMemoryError in SolrCloud 8.11.1 - 16GB allocation failure

Reply via email to