I'm glad I double checked, we have the logs from the last 5 minutes before
the heap dump.

Most of the log lines (59265 of 68296) are "Client node outbound message
queue size exceeded slowClientQueueLimit, the client will be dropped
(consider changing 'slowClientQueueLimit' configuration property)" so I
would say that server had some network issues.

But if, as you say, the data structure is related to recovering from
failures, if the server drops the client that structure should be freed,
shouldn't it?

I don't know. Maybe we need to configure some kind of limit in the server
to avoid this situation.

El mar, 30 nov 2021 a las 15:12, Eduard Llull Pou (<
eduard.ll...@bluekiri.com>) escribió:

> Hi Stephen,
>
> I have not gathered the logs produced around the time I generated the
> memory dump. I will dump the memory again when we reach the 5GB warning
> threshold and also I'll gather the log files in the server so we have all
> the related information.
>
> I should take less than a week to have another situation like this.
>
> El mar, 30 nov 2021 a las 14:37, Stephen Darlington (<
> stephen.darling...@gridgain.com>) escribió:
>
>> I have not dug into the code, but judging from the property name, the
>> data structure is related to recovering from failures (recovery). Are these
>> out of memory errors happening around the time of other problems? Are you
>> seeing network issues? Do you see “long JVM pauses” in the logs?
>>
>> On 30 Nov 2021, at 12:28, Eduard Llull Pou <eduard.ll...@bluekiri.com>
>> wrote:
>>
>> Hi Ibrahim,
>>
>> We'll test it but even if your suggested parameters reduce the number of
>> OOMs, the instance of the
>> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper class
>> will still retain a lot of memory because the nodes of the `recoveryDescs`
>> ConcurrentHashMap are not weak references so, as long the nodes are
>> referenced by the ConcurrentHashMap they won't be collectected by the
>> Garbage Collector.
>>
>> A proper solution would be to find a way to reduce the number of entries
>> in the `recoveryDescs` ConcurrentHashMap.
>>
>> Going deeper, the values of the `recoveryDescs` ConcurrentHaspMap are
>> instances of org.apache.ignite.internal.util.nio.GridNioRecoveryDescriptor
>> which contain the `msgReqs` ArrayDeque and most of the memory is retained
>> because of the elements of that ArrayDeque. I see that the elements of the
>> `msgReqs` ArrayDeque are instances
>> of org.apache.ignite.internal.util.nio.GridNioServer$WriteRequestImpl
>>
>> <image.png>
>>
>> El mar, 30 nov 2021 a las 12:44, Ibrahim Altun (<
>> ibrahim.al...@segmentify.com>) escribió:
>>
>>> Hi,
>>>
>>> We have faced same problems for a long time,
>>> https://medium.com/@hoan.nguyen.it/how-did-g1gc-tuning-flags-affect-our-back-end-web-app-c121d38dfe56
>>> helped a lot solving the problem on our side. We have added following gc
>>> parameters and problem solved in our case;
>>>
>>> -XX:ParallelGCThreads=6 -XX:ConcGCThreads=2 -XX:MaxGCPauseMillis=200
>>> -XX:InitiatingHeapOccupancyPercent=40
>>>
>>>
>>>
>>> On Tue, 30 Nov 2021 at 14:22, Eduard Llull Pou <
>>> eduard.ll...@bluekiri.com> wrote:
>>>
>>>> Hello Igniters,
>>>>
>>>> We have an Apache Ignite 2.10.0 cluster with several server nodes and a
>>>> bunch of thick client nodes. At least once every week we have at least one
>>>> of the server nodes that crashes because of a " java.lang.OutOfMemoryError:
>>>> Java heap space"
>>>>
>>>> The servers JVMs are started with the ignite.sh script setting:
>>>> JVM_OPTS=-server -Xms6g -Xmx6g -XX:+AlwaysPreTouch -XX:+UseG1GC
>>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
>>>> -Djava.net.preferIPv4Stack=true
>>>>
>>>> This is the heap usage of one of the servers
>>>> <image.png>
>>>>
>>>> Strangelly, not all servers have this memory usage. Most of them never
>>>> go above 4.5GB of heap.
>>>>
>>>> I have a memory dump of one of the servers when it reached 5GB of heap
>>>> usage for several minutes and using the Eclipse Memory Analyzer I can see
>>>> that from the 3.8GB of live heap, 3.3GB are allocated in an instance of
>>>> the org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper
>>>> class.
>>>>
>>>> <image.png>
>>>>
>>>> And almost all of the 3.3GB of that GridNioServerWrapper instance are
>>>> retained because of the recoveryDescs ConcurrentHashMap nodes:
>>>> <image.png>
>>>>
>>>> Is there anything we can configure to avoid this map growing that
>>>> large? is it a bug?
>>>>
>>>> I'm assuming that the ~2GB of difference between the memory dump size
>>>> (3.8GB) and the Xmx value (6GB) are short lived objects so they don't
>>>> appear in the dump as we used the `jmap -dump:live,...` command to generate
>>>> the memory dump.
>>>>
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>>
>>>> *Eduard Llull* | Technical Architect
>>>> eduard.ll...@bluekiri.com | +34 971925981
>>>>
>>>> *Bluekiri*
>>>> https://bluekiri.com
>>>> Blaise Pascal, ParcBit - Edificio Europa, bajos 07121 Palma (Spain)
>>>> <https://cloud.bluekiri.com/>
>>>> <https://cloud.withgoogle.com/partners/detail/?id=CIGAgICAgICzQg%3D%3D&language=en>
>>>>
>>>> <https://medium.com/bluekiri/bluekiri-is-now-silver-microsoft-partner-69887ad25d82>
>>>>
>>>> <https://medium.com/bluekiri/announcing-iso-27001-certification-b0923982441>
>>>> This email may be confidential and privileged. If you received this
>>>> communication by mistake, please don't forward it to anyone else, please
>>>> erase all copies and attachments, and please let me know that it has gone
>>>> to the wrong person. The above terms reflect a potential business
>>>> arrangement, are provided solely as a basis for further discussion, and are
>>>> not intended to be and do not constitute a legally binding obligation. No
>>>> legally binding obligations will be created, implied, or inferred until an
>>>> agreement in final form is executed in writing by all parties involved.
>>>>
>>>>
>>>
>>> --
>>> <https://www.segmentify.com/>İbrahim Halil AltunSenior Software Engineer+90
>>> 536 3327510 • segmentify.com → <https://www.segmentify.com/>UK •
>>> Germany • Turkey <https://www.segmentify.com/ecommerce-growth-show>
>>> <https://www.g2.com/products/segmentify/reviews>
>>>
>>
>>
>> --
>>
>> *Eduard Llull* | Technical Architect
>> eduard.ll...@bluekiri.com | +34 971925981
>>
>> *Bluekiri*
>> https://bluekiri.com
>> Blaise Pascal, ParcBit - Edificio Europa, bajos 07121 Palma (Spain)
>> <https://cloud.bluekiri.com/>
>> <https://cloud.withgoogle.com/partners/detail/?id=CIGAgICAgICzQg%3D%3D&language=en>
>>
>> <https://medium.com/bluekiri/bluekiri-is-now-silver-microsoft-partner-69887ad25d82>
>>
>> <https://medium.com/bluekiri/announcing-iso-27001-certification-b0923982441>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person. The above terms reflect a potential business
>> arrangement, are provided solely as a basis for further discussion, and are
>> not intended to be and do not constitute a legally binding obligation. No
>> legally binding obligations will be created, implied, or inferred until an
>> agreement in final form is executed in writing by all parties involved.
>>
>>
>>
>>
>
> --
>
> *Eduard Llull* | Technical Architect
> eduard.ll...@bluekiri.com | +34 971925981
>
> *Bluekiri*
> https://bluekiri.com
> Blaise Pascal, ParcBit - Edificio Europa, bajos 07121 Palma (Spain)
> <https://cloud.bluekiri.com/>
> <https://cloud.withgoogle.com/partners/detail/?id=CIGAgICAgICzQg%3D%3D&language=en>
>
> <https://medium.com/bluekiri/bluekiri-is-now-silver-microsoft-partner-69887ad25d82>
>
> <https://medium.com/bluekiri/announcing-iso-27001-certification-b0923982441>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person. The above terms reflect a potential business
> arrangement, are provided solely as a basis for further discussion, and are
> not intended to be and do not constitute a legally binding obligation. No
> legally binding obligations will be created, implied, or inferred until an
> agreement in final form is executed in writing by all parties involved.
>
>

-- 

*Eduard Llull* | Technical Architect
eduard.ll...@bluekiri.com | +34 971925981

*Bluekiri*
https://bluekiri.com
Blaise Pascal, ParcBit - Edificio Europa, bajos 07121 Palma (Spain)
<https://cloud.bluekiri.com/>
<https://cloud.withgoogle.com/partners/detail/?id=CIGAgICAgICzQg%3D%3D&language=en>

<https://medium.com/bluekiri/bluekiri-is-now-silver-microsoft-partner-69887ad25d82>

<https://medium.com/bluekiri/announcing-iso-27001-certification-b0923982441>
This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person. The above terms reflect a potential business
arrangement, are provided solely as a basis for further discussion, and are
not intended to be and do not constitute a legally binding obligation. No
legally binding obligations will be created, implied, or inferred until an
agreement in final form is executed in writing by all parties involved.

Reply via email to