Re: Ignite cluster stability problems under heavy load

Piotr Jagielski Mon, 18 Oct 2021 02:10:09 -0700

Hi again,

We managed to stabilize heap a bit for now, by tuning some general andcache configuration:


- enabled direct I/O

- enabled write throttling

- changed writeSynchronizationMode to FULL_ASYNC

We have now 5 days without peaks in heap (stabilized at around 4GB) andGC pauses (< 10ms). Update speed (dataStreamer) is around 5K/sec.


In the logs we have following entries:

2021-10-18 10:59:42 INFO Throttling is applied to page modifications[percentOfPartTime=0.64, markDirty=2120 pages/sec, checkpointWrite=1926pages/sec, estIdealMarkDirty=0 pages/sec, curDirty=0.02, maxDirty=0.13,avgParkTime=13575716 ns, pages: (total=382258, evicted=0,written=108269, synced=0, cpBufUsed=10503, cpBufTotal=259107)]2021-10-18 10:59:52 INFO Throttling is applied to page modifications[percentOfPartTime=0.60, markDirty=2367 pages/sec, checkpointWrite=2151pages/sec, estIdealMarkDirty=0 pages/sec, curDirty=0.04, maxDirty=0.22,avgParkTime=12109253 ns, pages: (total=382258, evicted=0,written=204076, synced=0, cpBufUsed=11903, cpBufTotal=259107)]2021-10-18 11:00:02 INFO Throttling is applied to page modifications[percentOfPartTime=0.25, markDirty=2577 pages/sec, checkpointWrite=2340pages/sec, estIdealMarkDirty=0 pages/sec, curDirty=0.07, maxDirty=0.31,avgParkTime=4708297 ns, pages: (total=382258, evicted=0, written=298971,synced=0, cpBufUsed=8287, cpBufTotal=259107)]

We also observed that fsync phase of checkpointing duration dropped from2-5 secs to under 20 millis.


So, is the throttling the thing that helped? Is our disk too slow?

W dniu 08.10.2021 o 08:44, Piotr Jagielski pisze:

Hi again,
Any advice? We're really struggling with our cluster stability - wehad to turn on throttling before sending data to DataStreamer, butproblems still happen.
In DataStreamer javadoc I found this:
perNodeParallelOperations(int) - sometimes data may be added to thedata streamer via addData(Object, Object) method faster than it can beput in cache. In this case, new buffered stream messages are sent toremote nodes before responses from previous ones are received. Thiscould cause unlimited heap memory utilization growth on local andremote nodes. To control memory utilization, this setting limitsmaximum allowed number of parallel buffered stream messages that arebeing processed on remote nodes. If this number is exceeded, thenaddData(Object, Object) method will block to control memoryutilization. Default is equal to CPU count on remote node multiply byDFLT_PARALLEL_OPS_MULTIPLIER.
This could be the case - we have unlimited heap memory utilizationgrowth, on histogram I see GridDhtAtomicSingleUpdateRequest andGridNearAtomicUpdateResponse classes.
How can we check that "data may be added... faster than it can be putin cache"? Is there any growing metric exposed via JMX to check?
Regarding HDD I managed to run hdparm:

/dev/sda1:
 Timing cached reads:   15036 MB in  2.00 seconds = 7525.21 MB/sec
 Timing buffered disk reads: 2664 MB in  3.00 seconds = 887.36 MB/sec

Regards,

Piotr


On 2021/10/06 12:26:07, Piotr Jagielski wrote:
> OK I managed to take a larger heap histogram - attached
>
> Also I found WARNs about long running cache futures:
>
> 2021-10-06 14:15:29 WARN  First 10 long running cache futures
> [total=5986879]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214273, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214275, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214277, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214279, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214281, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214283, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214285, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214287, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214289, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
> 2021-10-06 14:15:29 WARN  >>> Future [startTime=14:00:24.385,
> curTime=14:15:28.372, fut=GridDhtAtomicSingleUpdat
> eFuture [allUpdated=true, super=GridDhtAtomicAbstractUpdateFuture
> [futId=436214291, resCnt=0, addedReader=false,
>  dhtRes=TransformMapView
> {44055f7f-02d5-42bf-bcbe-2e78b45b7954=[res=false, size=1,nearSize=0]}]]]
>
>
> On 2021/10/06 11:00:15, Piotr Jagielski wrote:
> > Hi,
> >
> > Thanks for the quick answer.
> >
> > I've attached the config and logs with thread dumps. I can take heap
> > histogram when we'll experience problems again - for now wedisabled the
> > have update process to keep stability.
> >
> > Regarding HDD - this could be good point, I can see that
> > lastCheckpointFsyncDuration (regarding JMX stats) is the slow part:
> >
> >
> > Maybe
> >
>https://ignite.apache.org/docs/latest/persistence/persistence-tuning#pages-writes-throttling
>
> > would be a good idea?
> >
> > Regards
> >
> > On 2021/10/06 10:09:03, Anton Kurbanov wrote:
> > > Hello Piotr,
> > >
> > > Please share the configuration and the logs, preferably with thread
> dumps
> > > attached during the time when system_worker_blocked message pops up.
> > >
> > > It is difficult to diagnose an issue with only metrics involved,as for> > > example, the checkpoint itself has several phases which might belong
> > with
> > > different reasons for this. For example, if the fsync phase is
> slow, then
> > > the reason is the slow disk (probably a slow HDD).
> > >
> > > A few heap histograms taken might also be very helpful inidentifying
> > what
> > > kind of objects are alive in the heap and what are their GC roots
> to help
> > > identify which component is holding these objects.
> > >
> > > Best regards,
> > > Anton
> > >
> > > ср, 6 окт. 2021 г. в 12:57, Piotr Jagielski :
> > >
> > > > Hi,
> > > >
> > > > We experience stability problems on our Ignite cluster (2.10)under
> > heavy
> > > > load. Our cluster nodes are 3x 8 CPU, 32GB RAM.
> > > >
> > > > We mainly use 2 persistent caches:
> > > > - aggregates - only updates, around 6K records / sec, ~70 mlnrecords
> > > > total, stored mostly on disk (dataRegion maxSize = 4GB)
> > > > - customers - mainly reads by jdbc thin client + massive update
> of all
> > > > records once a day (~20 mln records) at about 60K records / sec,
> stored
> > > > off-heap (maxSize = 8GB)
> > > >
> > > > For updates we use DataStreamer with:
> > > > - perNodeParallelOperations = 5
> > > > - perNodeBufferSize = 500
> > > > - autoFlushFrequency = 1000 millis
> > > >
> > > > Under normal load (only aggregate updates) cluster behaves
> > normally, the
> > > > problems happen only during massive customer cache updates. We
> observe:
> > > > - Heap starvation (we have Xms4g / Xmx8g)
> > > > - Long gc pauses (up to 5 secs)
> > > > - SYSTEM_WORKER_BLOCKED logs
> > > > - Long checkpoint write times (up to 20 secs)
> > > > - Increasing Outbound message queue (> 100 entries)
> > > >
> > > > For now, we increased walSegmentSize to 256MB, any other options
> we can
> > > > adjust? Maybe something from this list
> > > >
> >https://ignite.apache.org/docs/latest/persistence/persistence-tuning? Is
> > > > the data streamer too fast for the cluster?
> > > >
> > > > I can provide more logs/configuration if needed.
> > > >
> > > > Regards,
> > > > Piotr
> > > >
> > > >
> > >
> >
>

Re: Ignite cluster stability problems under heavy load

Reply via email to