Ignite cluster stability problems under heavy load

Piotr Jagielski Wed, 06 Oct 2021 02:57:55 -0700

Hi,

We experience stability problems on our Ignite cluster (2.10) under heavy load. 
Our cluster nodes are 3x 8 CPU, 32GB RAM.


We mainly use 2 persistent caches:
 - aggregates - only updates, around 6K records / sec, ~70 mln records total, 
stored mostly on disk (dataRegion maxSize = 4GB)
- customers - mainly reads by jdbc thin client + massive update of all records 
once a day (~20 mln records) at about 60K records / sec, stored off-heap 
(maxSize = 8GB)

For updates we use DataStreamer with:
 - perNodeParallelOperations = 5 
 - perNodeBufferSize = 500
 - autoFlushFrequency = 1000 millis

Under normal load (only aggregate updates) cluster behaves normally, the 
problems happen only during massive customer cache updates. We observe:
 - Heap starvation (we have Xms4g / Xmx8g)
 - Long gc pauses (up to 5 secs)
 - SYSTEM_WORKER_BLOCKED logs
 - Long checkpoint write times (up to 20 secs)
 - Increasing Outbound message queue (> 100 entries)

For now, we increased walSegmentSize to 256MB, any other options we can adjust? 
Maybe something from this list 
https://ignite.apache.org/docs/latest/persistence/persistence-tuning? Is the 
data streamer too fast for the cluster?

I can provide more logs/configuration if needed.

Regards,
Piotr

Ignite cluster stability problems under heavy load

Reply via email to