Re: Lost node again.

Denis Magda Wed, 19 Aug 2020 20:35:20 -0700

John,

I would try to get to the bottom of the issue, especially, if the case is
reproducible.


If that's not GC then check if that's the I/O (your logs show that the
checkpointing rate is high):

   - You can monitor checkpointing duration with a JMX tool
   
<https://www.gridgain.com/docs/latest/administrators-guide/monitoring-metrics/metrics#monitoring-checkpointing-operations>
or
   Control Center
   
<https://www.gridgain.com/docs/control-center/latest/monitoring/metrics#checkpoint-duration>
   .
   - Configure write-throttling
   
<https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning#pages-writes-throttling>
   if the checkpointing buffer fills in quickly.
   - Ideally, storage files and WALs should be stored on different SSD media
   
<https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning#keep-wals-separately>.
   SSDs also do garbage collection and you might hit it frequently.

As for the failureDetectionTimeout, I would set it to 15 secs until your
cluster is battle-tested and well-tuned for your use case.

-
Denis


On Tue, Aug 18, 2020 at 10:37 AM John Smith <[email protected]> wrote:

> I don't see why we would get such a huge pause, in fact I have provided GC
> logs before and we found nothing...
>
> All operations are in the "big" partitioned 3 million cache are put or get
> and a query on another cache which has 450 entries. There no other caches.
>
> The nodes all have 6G off heap and 26G off heap.
>
> I think it can be IO related but I can't seem to be able to correlate it
> to IO. I saw some heavy IO usage but the node failed way after.
>
> Now my question is should I put the failure detection to 60s just for the
> sake of trying it? Isn't that too high? If i put the servers to 60s how how
> high should I put the clients?
>
> On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <
> [email protected]> wrote:
>
>> Hello!
>>
>> [13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company]
>> Possible too long JVM pause: 41779 milliseconds.
>>
>> It seems that you have too-long full GC. Either make sure it does not
>> happen, or increase failureDetectionTimeout to be longer than any expected
>> GC.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 17 авг. 2020 г. в 17:51, John Smith <[email protected]>:
>>
>>> Hi guys it seems every couple of weeks we lose a node... Here are the
>>> logs:
>>> https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0
>>>
>>> And some extra details. Maybe I need to do more tuning then what is
>>> already mentioned below, maybe set a higher timeout?
>>>
>>> 3 server nodes and 9 clients (client = true)
>>>
>>> Performance wise the cluster is not doing any kind of high volume on
>>> average it does about 15-20 puts/gets/queries (any combination of) per
>>> 30-60 seconds.
>>>
>>> The biggest cache we have is: 3 million records distributed with 1
>>> backup using the following template.
>>>
>>>           <bean id="cache-template-bean" abstract="true"
>>> class="org.apache.ignite.configuration.CacheConfiguration">
>>>             <!-- when you create a template via XML configuration,
>>>             you must add an asterisk to the name of the template -->
>>>             <property name="name" value="partitionedTpl*"/>
>>>             <property name="cacheMode" value="PARTITIONED" />
>>>             <property name="backups" value="1" />
>>>             <property name="partitionLossPolicy"
>>> value="READ_WRITE_SAFE"/>
>>>           </bean>
>>>
>>> Persistence is configured:
>>>
>>>       <property name="dataStorageConfiguration">
>>>         <bean
>>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>>           <!-- Redefining the default region's settings -->
>>>           <property name="defaultDataRegionConfiguration">
>>>             <bean
>>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>>               <property name="persistenceEnabled" value="true"/>
>>>
>>>               <property name="name" value="Default_Region"/>
>>>               <property name="maxSize" value="#{10L * 1024 * 1024 *
>>> 1024}"/>
>>>             </bean>
>>>           </property>
>>>         </bean>
>>>       </property>
>>>
>>> We also followed the tuning instructions for GC and I/O
>>> if [ -z "$JVM_OPTS" ] ; then
>>>     JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
>>> fi
>>>
>>> #
>>> # Uncomment the following GC settings if you see spikes in your
>>> throughput due to Garbage Collection.
>>> #
>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch
>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>> sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm
>>> .dirty_expire_centisecs=500
>>>
>>>

Re: Lost node again.

Reply via email to