Hi, We experience stability problems on our Ignite cluster (2.10) under heavy load. Our cluster nodes are 3x 8 CPU, 32GB RAM.
We mainly use 2 persistent caches: - aggregates - only updates, around 6K records / sec, ~70 mln records total, stored mostly on disk (dataRegion maxSize = 4GB) - customers - mainly reads by jdbc thin client + massive update of all records once a day (~20 mln records) at about 60K records / sec, stored off-heap (maxSize = 8GB) For updates we use DataStreamer with: - perNodeParallelOperations = 5 - perNodeBufferSize = 500 - autoFlushFrequency = 1000 millis Under normal load (only aggregate updates) cluster behaves normally, the problems happen only during massive customer cache updates. We observe: - Heap starvation (we have Xms4g / Xmx8g) - Long gc pauses (up to 5 secs) - SYSTEM_WORKER_BLOCKED logs - Long checkpoint write times (up to 20 secs) - Increasing Outbound message queue (> 100 entries) For now, we increased walSegmentSize to 256MB, any other options we can adjust? Maybe something from this list https://ignite.apache.org/docs/latest/persistence/persistence-tuning? Is the data streamer too fast for the cluster? I can provide more logs/configuration if needed. Regards, Piotr