I'm going to take a stab at a hypothesis: Sunday: I drain and decommission 2.3.4.193 *but* I forget to run node cleanup on the rest of the nodes. The ring looks clean but I did not see "Annoucing that ..." in the logs.
Tuesday: the ghost node reappears on the ring for all nodes. Could this be caused by old hinted handoffs for 2.3.4.193 that were processed at that time, causing the rest of the nodes to think that the 2.3.4.193 is still present (albeit down)? Should cleanup be run periodically? I run repair every few days (my gcgraceperiod is 10 days). -- Alexis Lê-Quôc (@datadoghq) On Wednesday, March 23, 2011 at 4:56 AM, aaron morton wrote: > When the node starts it reads the stored token information from the > LocationInfo CF in the System KS. > > It looks like the log message "is now part of the cluster" is only logged > when an endpoint is added to a nodes view of the ring via gossip It is not > logged when the endpoint is added during startup. > > In the logs for the 2.3.4.193 machine was there a message that just says > "Decommissioned" or starts with "Announcing that I have left the ring for" > These indicate it finished the decommission? > > Did it appear on all the nodes or just the 1.2.3.197 ? > > Aaron > > On 23 Mar 2011, at 11:23, Alexis Lê-Quôc wrote: > > > Hi, > > > > I've seen some strange occurrence of a deleted node reappearing all of > > a sudden in the ring, which leads to my question: where is the ring > > structure maintained (memory with local copies?) and what prompts it > > to change? I appreciate any thoughts on the events below. > > > > I'm running 0.7.4 on 4 EC2 large machines with a replication factor of > > 3. On Sunday I dropped a node that was misbehaving (drained then > > decommissioned). Everything was well until a few minutes ago: > > > > On 1.2.3.47 (nevermind the temporary key imbalance) > > ubuntu@YYY:~$ nodetool -h localhost ring > > 1.2.3.47 Up Normal 17.89 GB 12.48% 0 > > 1.2.3.36 Up Normal 27.72 GB 25.00% > > 42535295865117307932921825928971026432 > > 1.2.3.193 Up Normal 42.14 GB 50.00% > > 127605887595351923798765477786913079296 > > 1.2.3.252 Up Normal 36.71 GB 12.52% > > 148904621249875869977532879268261763219 > > > > Then all of a sudden the node that used to sit in the middle shows up > > (as "Down"). > > The machine itself was decommissioned over the week-end. It's > > confirmed that it is not in play. > > > > ubuntu@YYY:~$ nodetool -h localhost ring > > 1.2.3.47 Up Normal 17.93 GB 12.48% 0 > > 1.2.3.36 Up Normal 27.76 GB 25.00% > > 42535295865117307932921825928971026432 > > 2.3.4.193 Down Normal 12.35 GB 25.00% > > 85070591730234615865843651857942052864 > > 1.2.3.193 Up Normal 42.24 GB 25.00% > > 127605887595351923798765477786913079296 > > 1.2.3.252 Up Normal 36.66 GB 12.52% > > 148904621249875869977532879268261763219 > > > > From logs on each node: > > 2011-03-22T21:30:17.040407+00:00 Node /2.3.4.193 is now part of the cluster > > 2011-03-22T21:30:16.956335+00:00 Node /2.3.4.193 is now part of the cluster > > 2011-03-22T21:30:18.887269+00:00 Node /2.3.4.193 is now part of the cluster > > 2011-03-22T21:30:18.978861+00:00 Node /2.3.4.193 is now part of the cluster > > > > (a node coming back from the dead) > > > > On 1.2.3.193, trying to remove the ghost token... > > ubuntu@XXX:~$ nodetool -h localhost ring > > > > 148904621249875869977532879268261763219 > > 1.2.3.47 Up Normal 17.93 GB 12.48% 0 > > 1.2.3.36 Up Normal 27.76 GB 25.00% > > 42535295865117307932921825928971026432 > > 2.3.4.193 Down Leaving 12.35 GB 25.00% > > 85070591730234615865843651857942052864 > > 1.2.3.193 Up Normal 52.06 GB 25.00% > > 127605887595351923798765477786913079296 > > 1.2.3.252 Up Normal 43.11 GB 12.52% > > 148904621249875869977532879268261763219 > > > > ubuntu@XXX:~$ nodetool -h localhost removetoken status > > RemovalStatus: Removing token > > (85070591730234615865843651857942052864). Waiting for replication > > confirmation from [/1.2.3.193]. > > > > (wait wait wait) > > > > ubuntu@XXX:~$ nodetool -h localhost removetoken force > > RemovalStatus: Removing token > > (85070591730234615865843651857942052864). Waiting for replication > > confirmation from [/1.2.3.193]. > > > > (fixed) > > ubuntu@XXX:~$ nodetool -h localhost ring > > 1.2.3.47 Up Normal 17.93 GB 12.48% 0 > > 1.2.3.36 Up Normal 27.76 GB 25.00% > > 42535295865117307932921825928971026432 > > 1.2.3.193 Up Normal 53.73 GB 50.00% > > 127605887595351923798765477786913079296 > > 1.2.3.252 Up Normal 43.11 GB 12.52% > > 148904621249875869977532879268261763219 > > > > -- > > Alexis Lê-Quôc >