Apologies can't paste logs due to firm policy. What do you think about my questions regarding clients switching to good nodes automatically?
On Tue, Sep 7, 2021 at 4:14 AM Ilya Kazakov <kazakov.i...@gmail.com> wrote: > Hello Mike. According to your description, there was a hanging PME. If you > need more detailed analysis, could you share your logs and thread dumps? > > ------------------- > Ilya > > пн, 6 сент. 2021 г. в 22:21, Mike Wiesenberg <mike.wiesenb...@gmail.com>: > >> Using Ignite 2.10.0 >> >> We had a frustrating series of issues with Ignite the other day. We're >> using a 4-node cluster with 1 backup per table and cacheMode set to >> Partitioned, and write behind enabled. We have a client that inserts data >> into caches and another client that listens for new data in those caches. >> (Apologies I can't paste logs or configuration due to firm policy) >> >> What happened: >> >> 1. We observed that our insertion client was not working after startup, >> it logged every 20 seconds that 'Still awaiting for initial partition map >> exchange.' This continued until we restarted the node it was trying to >> connect to, at which point the client connected to another node and the >> warning stopped. >> >> Possible Bug #1 - why didn't it automatically try a different node, or >> if it would have that same issue connecting to any node, why couldn't the >> cluster print an error and function anyhow? >> >> 2. After rebooting bad node #1, the insertion client still didn't work, >> it then started printing totally different warnings about 'First 10 long >> running cache futures [total=1]', whatever that means, and then printed the >> ID of a node. We killed that referenced node, and then everything started >> working. >> >> Again, why didn't the client switch to a good node automatically(or is >> there a way to configure such failover capability that I don't know about)? >> >> 3. In terms of root cause, it seems bad node #1 had a 'blocked >> system-critical thread' which according to the stack trace was blocked at >> CheckpointReadWriteLock.java line 69. Is there a way to automatically >> recover from this or handle this more gracefully? If not I will probably >> disable WAL (which I understand will disable checkpointing). >> >> Possible Bug #2 - why couldn't it recover from this lock if restarting >> fixed it? >> >> Regards, and thanks in advance, for any advice! >> >