Hello Piotr, Please share the configuration and the logs, preferably with thread dumps attached during the time when system_worker_blocked message pops up.
It is difficult to diagnose an issue with only metrics involved, as for example, the checkpoint itself has several phases which might be long with different reasons for this. For example, if the fsync phase is slow, then the reason is the slow disk (probably a slow HDD). A few heap histograms taken might also be very helpful in identifying what kind of objects are alive in the heap and what are their GC roots to help identify which component is holding these objects. Best regards, Anton ср, 6 окт. 2021 г. в 12:57, Piotr Jagielski <p...@touk.pl>: > Hi, > > We experience stability problems on our Ignite cluster (2.10) under heavy > load. Our cluster nodes are 3x 8 CPU, 32GB RAM. > > We mainly use 2 persistent caches: > - aggregates - only updates, around 6K records / sec, ~70 mln records > total, stored mostly on disk (dataRegion maxSize = 4GB) > - customers - mainly reads by jdbc thin client + massive update of all > records once a day (~20 mln records) at about 60K records / sec, stored > off-heap (maxSize = 8GB) > > For updates we use DataStreamer with: > - perNodeParallelOperations = 5 > - perNodeBufferSize = 500 > - autoFlushFrequency = 1000 millis > > Under normal load (only aggregate updates) cluster behaves normally, the > problems happen only during massive customer cache updates. We observe: > - Heap starvation (we have Xms4g / Xmx8g) > - Long gc pauses (up to 5 secs) > - SYSTEM_WORKER_BLOCKED logs > - Long checkpoint write times (up to 20 secs) > - Increasing Outbound message queue (> 100 entries) > > For now, we increased walSegmentSize to 256MB, any other options we can > adjust? Maybe something from this list > https://ignite.apache.org/docs/latest/persistence/persistence-tuning? Is > the data streamer too fast for the cluster? > > I can provide more logs/configuration if needed. > > Regards, > Piotr > >