Re: Issue replacing a dead node

2025-05-27 Thread Courtney
One last update: After kicking it more, it finally fully joined the cluster. The third time the server was rebooted and after that it eventually reached the UN state. I wish I had kept the link, but I had read that someone had a similar issue joining a node to a cluster with 4.1.x and the answ

Re: Issue replacing a dead node

2025-05-23 Thread Courtney
Some updates after getting back to this. I did hardware tests and could not find any hardware issues. Instead of trying a replace, I went the route of removing the dead node entirely and then adding in a new node. The new node is still joining, but I am hitting some oddities in the log. When j

Re: Issue replacing a dead node

2025-05-16 Thread Sebastian Marsching
To add on to what Bowen already wrote, if you cannot find any reason in the logs at all, I would retry using different hardware. In the recent past I have seen two cases where strange Cassandra problems were actually caused by broken hardware (in both cases, a faulty memory module caused the i

Re: Issue replacing a dead node

2025-05-16 Thread Courtney
Is it bad to leave the replacement node up and running for hours even when the cluster forgets it for the old node being replaced? I'll have to set the logging to trace. debug produced nothing. I did stop the service, which produced errors in the other nodes in the datacenter since they had ope

Re: Issue replacing a dead node

2025-05-16 Thread Bowen Song via user
In my experience, failed bootstrap / node replacement always leave some traces in the logs. At the very minimal, there's going to be logs about streaming sessions failing or aborting. I have never seen it silently fails or stops without leaving any traces in the log. I can't think of anything t

Re: Issue replacing a dead node

2025-05-15 Thread Courtney
I checked all the logs and really couldn't find anything. I couldn't find any sort of errors in dmesg, system.log, debug.log, gc.log (maybe up the log level?), systemd journal...the logs are totally clean. It just stops gossiping all of a sudden at 22GB of data each time, then the old node retu

Re: Issue replacing a dead node

2025-05-15 Thread Bowen Song via user
The dead node being replaced went back to DN state indicating the new replacement node failed to join the cluster, usually because the streaming was interrupted (e.g. by network issues, or long STW GC pauses). I would start looking for red flags in the logs, including Cassandra's logs, GC logs,