We were seeing this issue on 2.0.1, with 9 nodes in one DC, 12 nodes in another DC. Each DC has replication factor of 3 for all keyspaces. Does anyone know how to work this around and make nodetool removenode work?
On Fri, Jan 24, 2014 at 6:50 AM, Andrew Losey <and...@addthis.com> wrote: > The problem described in ticket 6542, > https://issues.apache.org/jira/browse/CASSANDRA-6542, has been observed > in my environment. This isn't a new problem, as it's been seen across > several differently sized, vnode enabled, clusters for much longer than the > age of the ticket. The problem has definitely been hanging around since > 1.2.11 (we're on 1.2.12), and likely longer than that. > > About 10% of the time, depending on the size of a cluster, 'removenode' > works. 'removenode status' will slowly report a decrement to the list of > IPs in 'removenode status'. > > Typical output looks like this: > > "RemovalStatus: Removing token (1133935256116267454566500603062154024). > Waiting for replication confirmation from > [/xxx.xxx.xxx.xxx,/xxx.xxx.xxx.xxx,/etc,/etc]" > > And likewise, 'nodetool status' on each node shows the node-to-be-removed > as DownLeaving status. As a replication confirmation comes through, an IP > disappears from the waiting list and is no longer listed in 'nodetool > status' on that respective node. > > But this rarely works the way it's supposed to. Typically, one or two > nodes offer their replication confirmation and then, as described in the > ticket, nothing else happens. After hours or even days of waiting, you have > to use 'nodetool removenode force' to complete the process. > > Does this happen for everyone? If it does, what versions are you running? > What's the size of your cluster? Any log entries observed that indicate > there's a problem with the process? Are there any rain dances people do to > make removenode work the first time? Maybe we can get a bump in visibility > on this issue.