Hello,
I have been bootstrapping 4 new nodes into an existing production
cluster. Each node was bootstrapped one at a time, the first 2
completing without errors, but ran into issues with the 3rd one. The 4th
node has not been started yet.
On bootstrapping the third node, the data steaming sessions completed
without issue, but bootstrapping did not finish. The node is stuck in
JOINING state even 19 hours or so after data streaming completed.
Other reports of this issue seem to be related either to network
connectivity issues between nodes, or multiple nodes bootstrapping
simultaneously. I haven't found any evidence of either of these
situations, no errors or stracktraces in the logs.
I'm just looking for the safest way to proceed - I'm fine with removing
the hanging node altogether, just looking for confirmation that wouldn't
leave the cluster in a bad state, and what data points to be looking at
to gauge the situation.
If removing the node and starting over is OK, is any other maintenance
on the existing nodes recommended? I've read of people
scrubbing/rebuilding nodes coming out of this situation, but not sure if
that's necessary.
Please let me know if any additional info would be helpful.
Thanks!
--
Chris Hornung