On Fri, May 24, 2013 at 2:23 AM, Shane McEwan <sh...@mcewan.id.au> wrote:
> Firstly, the new "enable_health_checks" config option. It's not clear what > impact this option will have on a busy cluster. The way I read it, if a > riak_kv_vnode message queue goes over the "vnode_mailbox_limit" > DisableThreshold then the riak_kv service is disabled. Doesn't this > effectively take the node down? If you have a busy cluster isn't it > possible that all nodes could hit this limit around the same time and bring > the wholeicluster down? Especially if you're running close to the limit and > one node gets taken offline then the other nodes try to take up the slack > which puts them over the threshold. Or am I misunderstanding this feature? > The health check is primarily meant to aid in situations where you have a lagging/slow node (often much worse than a stopped/unreachable node). In this case the health check will detect the backed up work on that node and divert traffic elsewhere, allowing it to catch up. So, the node is not effectively down in the sense that it is running and processing work. However, it will *not* be sent requests until EnableThreshold is met, and is down in that regard. We suggest the default values unless you notice nodes being disabled by health check regularly under normal workload/expected latencies (which is extremely rare). You have touched on one issue with the health check as it stands, which will be addressed in pieces over the next couple releases. If your cluster is overloaded and health check kicks in, although its still highly unlikely all will occur at the same time, its not unfeasible that the offloading of more work to the other overloaded nodes will cause cascading health checks to kick in. This is because it was not designed to deal w/ the overload case but slow nodes. In the next release the health check will be deprecated (note: until you upgrade to this release we suggest leaving it on) in favor of "Overload Protection" [1] [2]. Overload protection differs from health check, as its name suggests, in that it deals with cluster overload. Unfortunately as implemented in the next release the two don't work well in concert, one limit is always reached first . A future release will re-implement the health check to allow it to play better with overload protection. Of course, the best protection is to plan for the expected peak capacity your cluster will need to handle [3]. Overload protection is a last line of defense to try to prevent cascading failures in scenarios where the extra capacity is insufficient. Is there any way to query the riak_kv_vnode message queue length to get an > idea of how close we are to hitting the limit? > > You can run `riak-admin top -sort msg_q` [4]. > Secondly, reformatting the indexes to fix the 2i integer encoding problem. > Do I need to run the "riak-admin reformat-indexes" on each node or just > once for the whole cluster? > > It should be run per-node -- letting each one complete before starting the next. Cheers, Jordan [1] https://github.com/basho/riak_kv/pull/547 [2] https://github.com/basho/riak_kv/pull/544 [3] http://docs.basho.com/riak/1.3.1/references/appendices/Cluster-Capacity-Planning/ [4] http://docs.basho.com/riak/latest/references/Command-Line-Tools---riak-admin/#top
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com