Re: Ignite CheckpointReadLock /Long running cache futures

Ilya Kazakov Tue, 07 Sep 2021 01:14:46 -0700

Hello Mike. According to your description, there was a hanging PME. If you
need more detailed analysis, could you share your logs and thread dumps?


-------------------
Ilya

пн, 6 сент. 2021 г. в 22:21, Mike Wiesenberg <mike.wiesenb...@gmail.com>:

> Using Ignite 2.10.0
>
> We had a frustrating series of issues with Ignite the other day. We're
> using a 4-node cluster with 1 backup per table and cacheMode set to
> Partitioned, and write behind enabled. We have a client that inserts data
> into caches and another client that listens for new data in those caches.
> (Apologies I can't paste logs or configuration due to firm policy)
>
> What happened:
>
> 1. We observed that our insertion client was not working after startup, it
> logged every 20 seconds that 'Still awaiting for initial partition map
> exchange.' This continued until we restarted the node it was trying to
> connect to, at which point the client connected to another node and the
> warning stopped.
>
>  Possible Bug #1 - why didn't it automatically try a different node, or if
> it would have that same issue connecting to any node, why couldn't the
> cluster print an error and function anyhow?
>
> 2. After rebooting bad node #1, the insertion client still didn't work, it
> then started printing totally different warnings about 'First 10 long
> running cache futures [total=1]', whatever that means, and then printed the
> ID of a node. We killed that referenced node, and then everything started
> working.
>
>  Again, why didn't the client switch to a good node automatically(or is
> there a way to configure such failover capability that I don't know about)?
>
> 3. In terms of root cause, it seems bad node #1 had a 'blocked
> system-critical thread' which according to the stack trace was blocked at
> CheckpointReadWriteLock.java line 69. Is there a way to automatically
> recover from this or handle this more gracefully? If not I will probably
> disable WAL (which I understand will disable checkpointing).
>
>  Possible Bug #2 - why couldn't it recover from this lock if restarting
> fixed it?
>
> Regards, and thanks in advance, for any advice!
>

Re: Ignite CheckpointReadLock /Long running cache futures

Reply via email to