Hi all,

Here at Bump, we recently had a problem where a Riak node that we
wiped incorrectly reconnected to our cluster. This node had no idea
about the ring status and so assumed it owned all of the partitions.
None of them panicked or failed, so we had an undetected split-brain
situation going on that caused application problems.

To guard against this in the future, I wrote a Nagios plugin that some
of you might find useful:

    https://github.com/xb95/nagios-plugins/blob/master/check_riak_ring.py

It can be run like this:

    check_riak_ring.py --down-ok hostA hostB hostC

It will then talk to those hosts, determine what nodes they're
connected to, and recursively check until it has talked to every node
in the cluster. It then examines the state-of-the-ring that each node
believes is true and alerts if somebody disagrees.

The --down-ok flag is there so it doesn't alert if a node is
unreachable. You can skip using that flag if you'd prefer it to alert
whenever it can't talk to a node.


-- 
Mark Smith // Operations Lead
m...@bumptechnologies.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to