Hi Courtney, Have you looked at thread dumps in the moment of server nodes being stuck?
ср, 15 сент. 2021 г. в 13:44, Courtney Robinson <courtney.robin...@hypi.io>: > Hey all, > We're trying to debug an issue in production where Ignite 2.8.1 is taking > 1 hour *per node* to start. > This cluster has 3 nodes and caches/tables have 2 backups i.e. each node > has a replica so it takes 3 hours to restart all nodes. > The nodes get stuck after outputting: > >> 2021-09-15 10:21:16.889 INFO [ArcOS,,,] 8 --- [ main] >> o.a.i.i.p.cache.GridCacheProcessor [285] : Started cache in recovery >> mode [name=*cache1*, id=-1556141001, group=hypi, dataRegionName=hypi, >> mode=PARTITIONED, atomicity=ATOMIC, backups=2, mvcc=false] >> > then after it logs a similar message about *cache2* and carries on as if > nothing happened. > The log is always in this order and it is always these two caches. > I believe this log happens after the cache is recovered so the problem is > with cache2. > > There is only about 1GB in this cache2 that appears to have the problem. > > How can we find out what's causing Ignite to take an hour each on this > cache? > > Regards, > Courtney Robinson > Founder and CEO, Hypi > Tel: ++44 208 123 2413 (GMT+0) <https://hypi.io> > > <https://hypi.io> > https://hypi.io >