Re: Recurring OutOfMemoryError in SolrCloud 8.11.1 - 16GB allocation failure

Vincenzo D'Amore Thu, 31 Jul 2025 02:02:05 -0700

To be clear, you have to remove Integer.MAX_VALUE as value and use a
reasonable amount.


On Thu, Jul 31, 2025 at 10:57 AM Vincenzo D'Amore <v.dam...@gmail.com>
wrote:

> Hi Antonio,
>
> Passing &facet.limit=Integer.MAX_VALUE or rows=Integer.MAX_VALUE might be
> the root of the issue you're encountering.
>
> What’s likely happening is that the Java Virtual Machine (JVM), upon
> receiving such large parameter values, attempts to allocate an enormous
> amount of memory. This can lead to significant memory fragmentation, making
> it difficult for the garbage collector to function efficiently. As a
> result, overall performance may degrade or the system may become unstable.
>
> I've run into this problem multiple times with SolrCloud, where it often
> resulted in recurring OutOfMemoryError exceptions.
>
>
>
> On Thu, Jul 31, 2025 at 10:37 AM Antonio Nunziante <nunzia...@light-sf.com>
> wrote:
>
>> Dear Solr Community,
>>
>>
>>
>> I'm running Solr 8.11.1 in SolrCloud mode (3 nodes, 44GB heap each), and
>> I'm
>> investigating a critical OutOfMemoryError.
>>
>>
>>
>> The GC logs show Solr attempting to allocate an object of 17179868936
>> bytes,
>> that I suspect it corresponds to an object of size 2147483617 on a 64-bit
>> JVM (with 8-byte word size) 2147483617 * 8 = 17179868936 bytes.
>>
>> In java Integer.MAX_INT is 2147483647, and the value 2147483617 is just a
>> little below Integer.MAX_INT. This corresponds to "MAX_ARRAY_LENGTH" the
>> maximum safe Java array size internally used by Lucene's
>> ArrayUtil.oversize(), and probably also in some other places in Solr
>> source
>> code.
>>
>>
>>
>>   /** Maximum length for an array (Integer.MAX_VALUE -
>> RamUsageEstimator.NUM_BYTES_ARRAY_HEADER). */
>>
>>   public static final int MAX_ARRAY_LENGTH = Integer.MAX_VALUE -
>> RamUsageEstimator.NUM_BYTES_ARRAY_HEADER;
>>
>>
>>
>> This leads to a 16GB allocation attempt on 64-bit JVMs, which is what
>> eventually triggers the OOM, more or less once per day on each node. Often
>> when one nodes restart, also the other 3 restart.
>>
>>
>>
>> Some details about our setup:
>>
>> *       Linux Red Hat version 8.4
>> *       OpenJDK 64-Bit Server VM (build 21+35-2513)
>> *       Solr 8.11.1, 3 nodes, 44GB heap over 64GB of total RAM (4GB of
>> swap)
>> *       G1GC, default parameters (-XX:+AlwaysPreTouch
>> -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled
>> -XX:+PerfDisableSharedMem -XX:+UseG1GC -XX:+UseLargePages
>> -XX:-OmitStackTraceInFastThrow -XX:MaxGCPauseMillis=250)
>>
>>
>>
>> Solr contains 42 collections, 3 shards and 3 replicas each:
>>
>> *       21 collections are kept empty and are used as support when
>> re-indexing (we index from scratch on an empty collection and then swap it
>> with the current active one by modifying aliases, the old one then is
>> emptied)
>> *       21 collections contains documents, but of these only 6 are the
>> most
>> used for the main search requests:
>>
>> *       number of documents per collection ranges from 20k to 70k
>> *       average document size ranges from 10Kb to 50Kb
>> *       on disk the biggest shard is 750MB, and each node has a total of
>> 5GB
>> or 6GB size on disk
>> *       We have lots of dynamic fields (like *_s, *_b, *_d, etc.), each of
>> these 6 collections has from 20k to 40k of these fields
>>
>>
>>
>> Requests are around 1000 per minute, mostly edismax queries, usually
>> retrieving less than 50 documents with 20 to 30 fields, some facets
>> (around
>> 30 different fields), and some filters. Also sorting on a couple of fields
>> (around 10 different fields).
>>
>> Some of these requests send parameter &facet.limit as Integer.MAX_VALUE,
>> but
>> if this was the problem the OOMs should happen every minute, and it's not
>> our case (by the way, we are fixing this by forcing -1 instead of
>> MAX_VALUE).
>>
>>
>>
>> Here is the relevant solr_gc.log extract:
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Eden regions: 0->0(407)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Survivor regions:
>> 0->0(87)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Old regions: 424->418
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Humongous regions:
>> 512->512
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Metaspace:
>> 99070K(100224K)->99002K(100224K) NonClass: 88782K(89472K)->88726K(89472K)
>> Class: 10287K(10752K)->10276K(10752K)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Heap after GC
>> invocations=689 (full 2):
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)  garbage-first heap
>> total 46137344K, used 30443099K [0x00007f5e8c000000, 0x00007f698c000000)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)   region size 32768K, 0
>> young (0K), 0 survivors (0K)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)  Metaspace       used
>> 99002K, committed 100224K, reserved 1179648K
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)   class space    used
>> 10276K, committed 10752K, reserved 1048576K
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Pause Full (G1
>> Compaction
>> Pause) 29774M->29729M(45056M) 7798.255ms
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) User=47.22s Sys=0.05s
>> Real=7.80s
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] Attempt heap expansion
>> (allocation request failed). Allocation request: 17179868936B
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] Expand the heap. requested
>> expansion amount: 17179868936B expansion amount: 17179869184B
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] Did not expand the heap (heap
>> already fully expanded)
>>
>> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Card Set
>> Free
>> Memory Task) (run: 14470.631ms) (cpu: 0.642ms)
>>
>> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC
>> Task) (run 14061.096ms after schedule)
>>
>> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Periodic GC
>> Task) (run: 0.010ms) (cpu: 0.000ms)
>>
>> [2025-07-30T07:56:35.989+0200][38364.373s] G1 Service Thread (Card Set
>> Free
>> Memory Task) (run 0.202ms after schedule)
>>
>>
>>
>> I need help with identifying what part of Solr or Lucene could be
>> responsible for that allocation. We do not have millions of facet terms (i
>> think max should be in the thousands) or unusually large result sets.
>>
>> If anyone can help in pointing to known causes, relevant classes, or
>> previous similar issues, it would be greatly appreciated.
>>
>>
>>
>> Thanks,
>>
>> Antonio
>>
>>
>>
>>
>
> --
> Vincenzo D'Amore
>
>

-- 
Vincenzo D'Amore

Re: Recurring OutOfMemoryError in SolrCloud 8.11.1 - 16GB allocation failure

Reply via email to