Hi all, Here at Bump, we recently had a problem where a Riak node that we wiped incorrectly reconnected to our cluster. This node had no idea about the ring status and so assumed it owned all of the partitions. None of them panicked or failed, so we had an undetected split-brain situation going on that caused application problems.
To guard against this in the future, I wrote a Nagios plugin that some of you might find useful: https://github.com/xb95/nagios-plugins/blob/master/check_riak_ring.py It can be run like this: check_riak_ring.py --down-ok hostA hostB hostC It will then talk to those hosts, determine what nodes they're connected to, and recursively check until it has talked to every node in the cluster. It then examines the state-of-the-ring that each node believes is true and alerts if somebody disagrees. The --down-ok flag is there so it doesn't alert if a node is unreachable. You can skip using that flag if you'd prefer it to alert whenever it can't talk to a node. -- Mark Smith // Operations Lead m...@bumptechnologies.com _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com