Thanks very much for your prompt response Gianluca just for the community, I could solve this by running the control.sh with reset lost partitions for individual cachereset_lost_partitions looks like it worked, those partition issue is resolved, I suppose there wouldnt be any data loss as we have set all our caches with 2 replicas
coming to the node which was not getting added to the cluster earlier, removed from baseline --> cleared all persistence store --> brought up the node --> added the node to baseline, this also seems to have worked fine. Thanks On Wed, May 29, 2024 at 5:13 PM Gianluca Bonetti <gianluca.bone...@gmail.com> wrote: > Hello Naveen > > Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact. > Three bugfix releases had been rolled out over time up to 2.16 release. > > It seems you are restarting your cluster on a regular basis, so you'd > better upgrade to 2.16 as soon as possible. > Otherwise it will also be very difficult for people on a community based > mailing list, on volunteer time, to work out a solution with a 2 years old > version running. > > Besides that, you are not providing very much information about your > cluster setup. > How many nodes, what infrastructure, how many caches, overall data size. > One could only guess you have more than 1 node running, with at least 1 > cache, and non-empty dataset. :) > > This document from GridGain may be helpful but I don't see the same for > Ignite, it may still be worth checking it out. > > https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode > > On the other hand you should also check your failing node. > If it is always the same node failing, then there should be some root > cause apart from Ignite. > Indeed if the nodes configuration is the same across all nodes, and just > this one fails, you should also consider some network issues (check > connectivity and network latency between nodes) and hardware related issues > (faulty disks, faulty memory) > In the end, one option might be to replace the faulty machine with a brand > new one. > In cloud environments this is actually quite cheap and easy to do. > > Cheers > Gianluca > > On Wed, 29 May 2024 at 08:43, Naveen Kumar <naveen.band...@gmail.com> > wrote: > >> Hello All >> >> We are using Ignite 2.13.0 >> >> After a cluster restart, one of the node is not coming up and in node >> logs are seeing this error - Node requires maintenance, non-empty set of >> maintainance tasks is found - node is not coming up >> >> we are getting errors like time out is reached before computation is >> completed error in other nodes as well. >> >> I could see that, we have control.sh script to backup and clean up the >> corrupted files, but when I run the command, it fails. >> >> I have removed the node from baseline and tried to run as well, still its >> failing >> >> what could be the solution for this, cluster is functioning, >> however there are requests failing >> >> Is there anyway we can start ignite node in maintenance mode and try >> running clean corrupted commands >> >> Thanks >> Naveen >> >> >> -- Thanks & Regards, Naveen Bandaru