As I mentioned above, it was hanged PME. PME is a cluster-wide operation that leads to refreshing information about partition distribution across nodes. Ans any cache operations should await PME ending. In your case, client reconnection to another server node does not resolve the issue, because the root cause - hanging PME. You should resolve PME hanging. To resolve it, try to read logs carefully and determine which server node is hanging. Sometimes it may be necessary to unwind a chain of several server nodes to get to the root cause.
You can read more about PME here: https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood ---------------- Ilya чт, 9 сент. 2021 г. в 21:21, Mike Wiesenberg <mike.wiesenb...@gmail.com>: > Apologies can't paste logs due to firm policy. What do you think about my > questions regarding clients switching to good nodes automatically? > > On Tue, Sep 7, 2021 at 4:14 AM Ilya Kazakov <kazakov.i...@gmail.com> > wrote: > >> Hello Mike. According to your description, there was a hanging PME. If >> you need more detailed analysis, could you share your logs and thread dumps? >> >> ------------------- >> Ilya >> >> пн, 6 сент. 2021 г. в 22:21, Mike Wiesenberg <mike.wiesenb...@gmail.com>: >> >>> Using Ignite 2.10.0 >>> >>> We had a frustrating series of issues with Ignite the other day. We're >>> using a 4-node cluster with 1 backup per table and cacheMode set to >>> Partitioned, and write behind enabled. We have a client that inserts data >>> into caches and another client that listens for new data in those caches. >>> (Apologies I can't paste logs or configuration due to firm policy) >>> >>> What happened: >>> >>> 1. We observed that our insertion client was not working after startup, >>> it logged every 20 seconds that 'Still awaiting for initial partition map >>> exchange.' This continued until we restarted the node it was trying to >>> connect to, at which point the client connected to another node and the >>> warning stopped. >>> >>> Possible Bug #1 - why didn't it automatically try a different node, or >>> if it would have that same issue connecting to any node, why couldn't the >>> cluster print an error and function anyhow? >>> >>> 2. After rebooting bad node #1, the insertion client still didn't work, >>> it then started printing totally different warnings about 'First 10 long >>> running cache futures [total=1]', whatever that means, and then printed the >>> ID of a node. We killed that referenced node, and then everything started >>> working. >>> >>> Again, why didn't the client switch to a good node automatically(or is >>> there a way to configure such failover capability that I don't know about)? >>> >>> 3. In terms of root cause, it seems bad node #1 had a 'blocked >>> system-critical thread' which according to the stack trace was blocked at >>> CheckpointReadWriteLock.java line 69. Is there a way to automatically >>> recover from this or handle this more gracefully? If not I will probably >>> disable WAL (which I understand will disable checkpointing). >>> >>> Possible Bug #2 - why couldn't it recover from this lock if restarting >>> fixed it? >>> >>> Regards, and thanks in advance, for any advice! >>> >>