Re: Ignite CheckpointReadLock /Long running cache futures

Mike Wiesenberg Thu, 09 Sep 2021 06:21:26 -0700

Apologies can't paste logs due to firm policy. What do you think about my
questions regarding clients switching to good nodes automatically?


On Tue, Sep 7, 2021 at 4:14 AM Ilya Kazakov <kazakov.i...@gmail.com> wrote:

> Hello Mike. According to your description, there was a hanging PME. If you
> need more detailed analysis, could you share your logs and thread dumps?
>
> -------------------
> Ilya
>
> пн, 6 сент. 2021 г. в 22:21, Mike Wiesenberg <mike.wiesenb...@gmail.com>:
>
>> Using Ignite 2.10.0
>>
>> We had a frustrating series of issues with Ignite the other day. We're
>> using a 4-node cluster with 1 backup per table and cacheMode set to
>> Partitioned, and write behind enabled. We have a client that inserts data
>> into caches and another client that listens for new data in those caches.
>> (Apologies I can't paste logs or configuration due to firm policy)
>>
>> What happened:
>>
>> 1. We observed that our insertion client was not working after startup,
>> it logged every 20 seconds that 'Still awaiting for initial partition map
>> exchange.' This continued until we restarted the node it was trying to
>> connect to, at which point the client connected to another node and the
>> warning stopped.
>>
>>  Possible Bug #1 - why didn't it automatically try a different node, or
>> if it would have that same issue connecting to any node, why couldn't the
>> cluster print an error and function anyhow?
>>
>> 2. After rebooting bad node #1, the insertion client still didn't work,
>> it then started printing totally different warnings about 'First 10 long
>> running cache futures [total=1]', whatever that means, and then printed the
>> ID of a node. We killed that referenced node, and then everything started
>> working.
>>
>>  Again, why didn't the client switch to a good node automatically(or is
>> there a way to configure such failover capability that I don't know about)?
>>
>> 3. In terms of root cause, it seems bad node #1 had a 'blocked
>> system-critical thread' which according to the stack trace was blocked at
>> CheckpointReadWriteLock.java line 69. Is there a way to automatically
>> recover from this or handle this more gracefully? If not I will probably
>> disable WAL (which I understand will disable checkpointing).
>>
>>  Possible Bug #2 - why couldn't it recover from this lock if restarting
>> fixed it?
>>
>> Regards, and thanks in advance, for any advice!
>>
>

Re: Ignite CheckpointReadLock /Long running cache futures

Reply via email to