Thank you Vincenzo for your answer

We really appreciate your support and we want to ask you more details if 
possible.

We'll try to avoid Integer.MAX_VALUE but, according to the fact that we pass 
this values at each search request, why this kind of problem doesn't occur more 
often during the day but almost just once a day?

Does Solr try to allocate 16GB each time we pass Integer.MAX_VALUE? The heap is 
44GB and could allocate only few objects like this (?)

Thanks in advance,
Antonio


-----Messaggio originale-----
Da: Vincenzo D'Amore <v.dam...@gmail.com> 
Inviato: giovedì 31 luglio 2025 11:01
A: users@solr.apache.org
Oggetto: Re: Recurring OutOfMemoryError in SolrCloud 8.11.1 - 16GB allocation 
failure

To be clear, you have to remove Integer.MAX_VALUE as value and use a reasonable 
amount.

On Thu, Jul 31, 2025 at 10:57 AM Vincenzo D'Amore <v.dam...@gmail.com>
wrote:

> Hi Antonio,
>
> Passing &facet.limit=Integer.MAX_VALUE or rows=Integer.MAX_VALUE might 
> be the root of the issue you're encountering.
>
> What’s likely happening is that the Java Virtual Machine (JVM), upon 
> receiving such large parameter values, attempts to allocate an 
> enormous amount of memory. This can lead to significant memory 
> fragmentation, making it difficult for the garbage collector to 
> function efficiently. As a result, overall performance may degrade or the 
> system may become unstable.
>
> I've run into this problem multiple times with SolrCloud, where it 
> often resulted in recurring OutOfMemoryError exceptions.
>
>
>
> On Thu, Jul 31, 2025 at 10:37 AM Antonio Nunziante 
> <nunzia...@light-sf.com>
> wrote:
>
>> Dear Solr Community,
>>
>>
>>
>> I'm running Solr 8.11.1 in SolrCloud mode (3 nodes, 44GB heap each), 
>> and I'm investigating a critical OutOfMemoryError.
>>
>>
>>
>> The GC logs show Solr attempting to allocate an object of 17179868936 
>> bytes, that I suspect it corresponds to an object of size 2147483617 
>> on a 64-bit JVM (with 8-byte word size) 2147483617 * 8 = 17179868936 
>> bytes.
>>
>> In java Integer.MAX_INT is 2147483647, and the value 2147483617 is 
>> just a little below Integer.MAX_INT. This corresponds to 
>> "MAX_ARRAY_LENGTH" the maximum safe Java array size internally used 
>> by Lucene's ArrayUtil.oversize(), and probably also in some other 
>> places in Solr source code.
>>
>>
>>
>>   /** Maximum length for an array (Integer.MAX_VALUE - 
>> RamUsageEstimator.NUM_BYTES_ARRAY_HEADER). */
>>
>>   public static final int MAX_ARRAY_LENGTH = Integer.MAX_VALUE - 
>> RamUsageEstimator.NUM_BYTES_ARRAY_HEADER;
>>
>>
>>
>> This leads to a 16GB allocation attempt on 64-bit JVMs, which is what 
>> eventually triggers the OOM, more or less once per day on each node. 
>> Often when one nodes restart, also the other 3 restart.
>>
>>
>>
>> Some details about our setup:
>>
>> *       Linux Red Hat version 8.4
>> *       OpenJDK 64-Bit Server VM (build 21+35-2513)
>> *       Solr 8.11.1, 3 nodes, 44GB heap over 64GB of total RAM (4GB of
>> swap)
>> *       G1GC, default parameters (-XX:+AlwaysPreTouch
>> -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled 
>> -XX:+PerfDisableSharedMem -XX:+UseG1GC -XX:+UseLargePages 
>> -XX:-OmitStackTraceInFastThrow -XX:MaxGCPauseMillis=250)
>>
>>
>>
>> Solr contains 42 collections, 3 shards and 3 replicas each:
>>
>> *       21 collections are kept empty and are used as support when
>> re-indexing (we index from scratch on an empty collection and then 
>> swap it with the current active one by modifying aliases, the old one 
>> then is
>> emptied)
>> *       21 collections contains documents, but of these only 6 are the
>> most
>> used for the main search requests:
>>
>> *       number of documents per collection ranges from 20k to 70k
>> *       average document size ranges from 10Kb to 50Kb
>> *       on disk the biggest shard is 750MB, and each node has a total of
>> 5GB
>> or 6GB size on disk
>> *       We have lots of dynamic fields (like *_s, *_b, *_d, etc.), each of
>> these 6 collections has from 20k to 40k of these fields
>>
>>
>>
>> Requests are around 1000 per minute, mostly edismax queries, usually 
>> retrieving less than 50 documents with 20 to 30 fields, some facets 
>> (around
>> 30 different fields), and some filters. Also sorting on a couple of 
>> fields (around 10 different fields).
>>
>> Some of these requests send parameter &facet.limit as 
>> Integer.MAX_VALUE, but if this was the problem the OOMs should happen 
>> every minute, and it's not our case (by the way, we are fixing this 
>> by forcing -1 instead of MAX_VALUE).
>>
>>
>>
>> Here is the relevant solr_gc.log extract:
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Eden regions: 
>> 0->0(407)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Survivor regions:
>> 0->0(87)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Old regions: 
>> 424->418
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Humongous regions:
>> 512->512
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Metaspace:
>> 99070K(100224K)->99002K(100224K) NonClass: 
>> 88782K(89472K)->88726K(89472K)
>> Class: 10287K(10752K)->10276K(10752K)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Heap after GC
>> invocations=689 (full 2):
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)  garbage-first 
>> heap total 46137344K, used 30443099K [0x00007f5e8c000000, 
>> 0x00007f698c000000)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)   region size 32768K, 0
>> young (0K), 0 survivors (0K)
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)  Metaspace       used
>> 99002K, committed 100224K, reserved 1179648K
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688)   class space    used
>> 10276K, committed 10752K, reserved 1048576K
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) Pause Full (G1 
>> Compaction
>> Pause) 29774M->29729M(45056M) 7798.255ms
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] GC(688) User=47.22s 
>> Sys=0.05s Real=7.80s
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] Attempt heap expansion 
>> (allocation request failed). Allocation request: 17179868936B
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] Expand the heap. requested 
>> expansion amount: 17179868936B expansion amount: 17179869184B
>>
>> [2025-07-30T07:56:35.977+0200][38364.361s] Did not expand the heap 
>> (heap already fully expanded)
>>
>> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread (Card 
>> Set Free Memory Task) (run: 14470.631ms) (cpu: 0.642ms)
>>
>> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread 
>> (Periodic GC
>> Task) (run 14061.096ms after schedule)
>>
>> [2025-07-30T07:56:35.979+0200][38364.363s] G1 Service Thread 
>> (Periodic GC
>> Task) (run: 0.010ms) (cpu: 0.000ms)
>>
>> [2025-07-30T07:56:35.989+0200][38364.373s] G1 Service Thread (Card 
>> Set Free Memory Task) (run 0.202ms after schedule)
>>
>>
>>
>> I need help with identifying what part of Solr or Lucene could be 
>> responsible for that allocation. We do not have millions of facet 
>> terms (i think max should be in the thousands) or unusually large result 
>> sets.
>>
>> If anyone can help in pointing to known causes, relevant classes, or 
>> previous similar issues, it would be greatly appreciated.
>>
>>
>>
>> Thanks,
>>
>> Antonio
>>
>>
>>
>>
>
> --
> Vincenzo D'Amore
>
>

--
Vincenzo D'Amore

Reply via email to