Hello Mike. According to your description, there was a hanging PME. If you need more detailed analysis, could you share your logs and thread dumps?
------------------- Ilya пн, 6 сент. 2021 г. в 22:21, Mike Wiesenberg <mike.wiesenb...@gmail.com>: > Using Ignite 2.10.0 > > We had a frustrating series of issues with Ignite the other day. We're > using a 4-node cluster with 1 backup per table and cacheMode set to > Partitioned, and write behind enabled. We have a client that inserts data > into caches and another client that listens for new data in those caches. > (Apologies I can't paste logs or configuration due to firm policy) > > What happened: > > 1. We observed that our insertion client was not working after startup, it > logged every 20 seconds that 'Still awaiting for initial partition map > exchange.' This continued until we restarted the node it was trying to > connect to, at which point the client connected to another node and the > warning stopped. > > Possible Bug #1 - why didn't it automatically try a different node, or if > it would have that same issue connecting to any node, why couldn't the > cluster print an error and function anyhow? > > 2. After rebooting bad node #1, the insertion client still didn't work, it > then started printing totally different warnings about 'First 10 long > running cache futures [total=1]', whatever that means, and then printed the > ID of a node. We killed that referenced node, and then everything started > working. > > Again, why didn't the client switch to a good node automatically(or is > there a way to configure such failover capability that I don't know about)? > > 3. In terms of root cause, it seems bad node #1 had a 'blocked > system-critical thread' which according to the stack trace was blocked at > CheckpointReadWriteLock.java line 69. Is there a way to automatically > recover from this or handle this more gracefully? If not I will probably > disable WAL (which I understand will disable checkpointing). > > Possible Bug #2 - why couldn't it recover from this lock if restarting > fixed it? > > Regards, and thanks in advance, for any advice! >