On Fri, May 24, 2013 at 2:23 AM, Shane McEwan <sh...@mcewan.id.au> wrote:

> Firstly, the new "enable_health_checks" config option. It's not clear what
> impact this option will have on a busy cluster. The way I read it, if a
> riak_kv_vnode message queue goes over the "vnode_mailbox_limit"
> DisableThreshold then the riak_kv service is disabled. Doesn't this
> effectively take the node down? If you have a busy cluster isn't it
> possible that all nodes could hit this limit around the same time and bring
> the wholeicluster down? Especially if you're running close to the limit and
> one node gets taken offline then the other nodes try to take up the slack
> which puts them over the threshold. Or am I misunderstanding this feature?
>

The health check is primarily meant to aid in situations where you have a
lagging/slow node (often much worse than a stopped/unreachable node). In
this case the health check will detect the backed up work on that node and
divert traffic elsewhere, allowing it to catch up. So, the node is not
effectively down in the sense that it is running and processing work.
However, it will *not* be sent requests until EnableThreshold is met, and
is down in that regard. We suggest the default values unless you notice
nodes being disabled by health check regularly under normal
workload/expected latencies (which is extremely rare).

You have touched on one issue with the health check as it stands, which
will be addressed in pieces over the next couple releases. If your cluster
is overloaded and health check kicks in, although its still highly unlikely
all will occur at the same time, its not unfeasible that the offloading of
more work to the other overloaded nodes will cause cascading health checks
to kick in. This is because it was not designed to deal w/ the overload
case but slow nodes. In the next release the health check will be
deprecated (note: until you upgrade to this release we suggest leaving it
on) in favor of "Overload Protection" [1] [2]. Overload protection differs
from health check, as its name suggests, in that it deals with cluster
overload. Unfortunately as implemented in the next release the two don't
work well in concert, one limit is always reached first . A future release
will re-implement the health check to allow it to play better with overload
protection. Of course, the best protection is to plan for the expected peak
capacity your cluster will need to handle [3]. Overload protection is a
last line of defense to try to prevent cascading failures in scenarios
where the extra capacity is insufficient.

Is there any way to query the riak_kv_vnode message queue length to get an
> idea of how close we are to hitting the limit?
>
>
You can run `riak-admin top -sort msg_q` [4].


> Secondly, reformatting the indexes to fix the 2i integer encoding problem.
> Do I need to run the "riak-admin reformat-indexes" on each node or just
> once for the whole cluster?
>
>
It should be run per-node -- letting each one complete before starting the
next.

Cheers,
Jordan

[1] https://github.com/basho/riak_kv/pull/547
[2] https://github.com/basho/riak_kv/pull/544
[3]
http://docs.basho.com/riak/1.3.1/references/appendices/Cluster-Capacity-Planning/
[4]
http://docs.basho.com/riak/latest/references/Command-Line-Tools---riak-admin/#top
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to