Thanks very much for your prompt response Gianluca

just for the community, I could solve this by running the control.sh with
reset lost partitions for individual cachereset_lost_partitions
looks like it worked, those partition issue is resolved, I suppose there
wouldnt be any data loss as we have set all our caches with 2 replicas

coming to the node which was not getting added to the cluster earlier,
removed from baseline --> cleared all persistence store --> brought up the
node --> added the node to baseline, this also seems to have worked fine.

Thanks


On Wed, May 29, 2024 at 5:13 PM Gianluca Bonetti <gianluca.bone...@gmail.com>
wrote:

> Hello Naveen
>
> Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact.
> Three bugfix releases had been rolled out over time up to 2.16 release.
>
> It seems you are restarting your cluster on a regular basis, so you'd
> better upgrade to 2.16 as soon as possible.
> Otherwise it will also be very difficult for people on a community based
> mailing list, on volunteer time, to work out a solution with a 2 years old
> version running.
>
> Besides that, you are not providing very much information about your
> cluster setup.
> How many nodes, what infrastructure, how many caches, overall data size.
> One could only guess you have more than 1 node running, with at least 1
> cache, and non-empty dataset. :)
>
> This document from GridGain may be helpful but I don't see the same for
> Ignite, it may still be worth checking it out.
>
> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode
>
> On the other hand you should also check your failing node.
> If it is always the same node failing, then there should be some root
> cause apart from Ignite.
> Indeed if the nodes configuration is the same across all nodes, and just
> this one fails, you should also consider some network issues (check
> connectivity and network latency between nodes) and hardware related issues
> (faulty disks, faulty memory)
> In the end, one option might be to replace the faulty machine with a brand
> new one.
> In cloud environments this is actually quite cheap and easy to do.
>
> Cheers
> Gianluca
>
> On Wed, 29 May 2024 at 08:43, Naveen Kumar <naveen.band...@gmail.com>
> wrote:
>
>> Hello All
>>
>> We are using Ignite 2.13.0
>>
>> After a cluster restart, one of the node is not coming up and in node
>> logs are seeing this error - Node requires maintenance, non-empty set of
>> maintainance  tasks is found - node is not coming up
>>
>> we are getting errors like time out is reached before computation is
>> completed error in other nodes as well.
>>
>> I could see that, we have control.sh script to backup and clean up the
>> corrupted files, but when I run the command, it fails.
>>
>> I have removed the node from baseline and tried to run as well, still its
>> failing
>>
>> what could be the solution for this, cluster is functioning,
>> however there are requests failing
>>
>> Is there anyway we can start ignite node in  maintenance mode and try
>> running clean corrupted commands
>>
>> Thanks
>> Naveen
>>
>>
>>

-- 
Thanks & Regards,
Naveen Bandaru

Reply via email to