Dear Solr Community,
I'm running Solr 8.11.1 in SolrCloud mode (3 nodes, 44GB heap each), and I'm investigating a critical OutOfMemoryError. The GC logs show Solr attempting to allocate an object of 17179868936 bytes, that I suspect it corresponds to an object of size 2147483617 on a 64-bit JVM (with 8-byte word size) 2147483617 * 8 = 17179868936 bytes. In java Integer.MAX_INT is 2147483647, and the value 2147483617 is just a little below Integer.MAX_INT. This corresponds to "MAX_ARRAY_LENGTH" the maximum safe Java array size internally used by Lucene's ArrayUtil.oversize(), and probably also in some other places in Solr source code. /** Maximum length for an array (Integer.MAX_VALUE - RamUsageEstimator.NUM_BYTES_ARRAY_HEADER). */ public static final int MAX_ARRAY_LENGTH = Integer.MAX_VALUE - RamUsageEstimator.NUM_BYTES_ARRAY_HEADER; This leads to a 16GB allocation attempt on 64-bit JVMs, which is what eventually triggers the OOM, more or less once per day on each node. Often when one nodes restart, also the other 3 restart. Some details about our setup: * Linux Red Hat version 8.4 * OpenJDK 64-Bit Server VM (build 21+35-2513) * Solr 8.11.1, 3 nodes, 44GB heap over 64GB of total RAM (4GB of swap) * G1GC, default parameters (-XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+PerfDisableSharedMem -XX:+UseG1GC -XX:+UseLargePages -XX:-OmitStackTraceInFastThrow -XX:MaxGCPauseMillis=250) Solr contains 42 collections, 3 shards and 3 replicas each: * 21 collections are kept empty and are used as support when re-indexing (we index from scratch on an empty collection and then swap it with the current active one by modifying aliases, the old one then is emptied) * 21 collections contains documents, but of these only 6 are the most used for the main search requests: * number of documents per collection ranges from 20k to 70k * average document size ranges from 10Kb to 50Kb * on disk the biggest shard is 750MB, and each node has a total of 5GB or 6GB size on disk * We have lots of dynamic fields (like *_s, *_b, *_d, etc.), each of these 6 collections has from 20k to 40k of these fields Requests are around 1000 per minute, mostly edismax queries, usually retrieving less than 50 documents with 20 to 30 fields, some facets (around 30 different fields), and some filters. Also sorting on a couple of fields (around 10 different fields). Some of these requests send parameter &facet.limit as Integer.MAX_VALUE, but if this was the problem the OOMs should happen every minute, and it's not our case (by the way, we are fixing this by forcing -1 instead of MAX_VALUE). Here is the relevant solr_gc.log extract: [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Eden regions: 0->0(407) [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Survivor regions: 0->0(87) [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Old regions: 424->418 [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Humongous regions: 512->512 [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Metaspace: 99070K(100224K)->99002K(100224K) NonClass: 88782K(89472K)->88726K(89472K) Class: 10287K(10752K)->10276K(10752K) [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Heap after GC invocations=689 (full 2): [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) garbage-first heap total 46137344K, used 30443099K [0x00007f5e8c000000, 0x00007f698c000000) [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) region size 32768K, 0 young (0K), 0 survivors (0K) [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Metaspace used 99002K, committed 100224K, reserved 1179648K [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) class space used 10276K, committed 10752K, reserved 1048576K [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Pause Full (G1 Compaction Pause) 29774M->29729M(45056M) 7798.255ms [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) User=47.22s Sys=0.05s Real=7.80s [2025-07-30T07:56:35.977+0200][38364.361s] Attempt heap expansion (allocation request failed). Allocation request: 17179868936B [2025-07-30T07:56:35.977+0200][38364.361s] Expand the heap. requested expansion amount: 17179868936B expansion amount: 17179869184B [2025-07-30T07:56:35.977+0200][38364.361s] Did not expand the heap (heap already fully expanded) [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Card Set Free Memory Task) (run: 14470.631ms) (cpu: 0.642ms) [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC Task) (run 14061.096ms after schedule) [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC Task) (run: 0.010ms) (cpu: 0.000ms) [2025-07-30T07:56:35.989+0200][38364.373s] G1 Service Thread (Card Set Free Memory Task) (run 0.202ms after schedule) I need help with identifying what part of Solr or Lucene could be responsible for that allocation. We do not have millions of facet terms (i think max should be in the thousands) or unusually large result sets. If anyone can help in pointing to known causes, relevant classes, or previous similar issues, it would be greatly appreciated. Thanks, Antonio