Re: Rebalancing pain and 100 percentile stat

Joseph Blomstedt Tue, 10 Jan 2012 11:01:20 -0800

I'll leave the stats questions for someone else, but I'll try to
answer your rebalancing questions.

First off, what ring size is your cluster? If it's larger than 256,
you could be hitting a known gossip overload condition. Generally,
anything under 1024 is fine, but for EC2, 256/512 may be the real
practical limit. This condition is discussed a bit in the 1.0.2
release notes under the "Large Ring Sizes" and "Potential
Cluster/Gossip Overload" sections. The good news is that the upcoming
1.0.3 point release includes a gossip throttling capability that
reduces this problem, and the next major release of Riak includes
changes that remove the problem entirely.

Assuming the above isn't the issue, then it's you're looking simply at
too much rebalancing happening at once. There are a few issues here to
discuss.

First, the partition claiming algorithm traditionally used in Riak
does not conform to minimal consistent hashing. In many cases, it
often falls back to a complete re-claiming of the entire ring and
therefore you end with 80% of the cluster data being shuffled around
during a rebalance. In Riak 1.0, we added a new claim algorithm that
almost always results in minimal rebalancing. However, this algorithm
is not enabled by default. It will be enabled by default in the
upcoming 1.0.3 and 1.1 releases. We already have a large EDS customer
using the new claim algorithm in production with great success on
1.0.2. I can provide details on enabling in for 1.0.2, but it would be
easiest to just wait until 1.0.3 ships.

Now, regardless of the number of transfers, it is always a good idea
for rebalancing to not incapacitate your cluster. Rather than having
an explicit priority level between requests, Riak provides several
handoff related knobs that can be tuned to minimize the impact of data
handoff. In 1.0.2, there are two knobs: handoff_concurrency, and
forced_ownership_handoff. The first knob limits the total number of
outbound transfers that a node can initiate at one time. The default
is 4. The second knob determines how many primary ownership handoffs
can occur while the cluster is under load (normally, handoff only
occurs when a node has been idle for a default timeout of 1 minute).
The second knob is a cluster wide setting and defaults to 8. In other
words, up to 8 handoffs can be triggered across the entire cluster
while the cluster is under load.

To prevent cluster overload, you should consider tuning theses values.
For what it's worth, handoff_concurrency is being set to a default of
1 in the next release of Riak, as we found 4 to be a bit aggressive.
To change the values, adding something like the following to your
riak_core section of the app.config file on each Riak node:
====
{handoff_concurrency, 1},
{forced_ownership_handoff, 1},
====

One final issue is that the handoff_concurrency limit only effects
outbound transfers. It's entirely possible for multiple nodes to
initiate outbound transfers to the same target node, thus overloading
that node. This will be changed in the next major release of Riak, so
that handoff_concurrency limits both incoming and outgoing transfers.
Of course, if your cluster is under load, then the only handoff that
will occur is the forced ownership transfers anyway. So, limiting
forced_ownership_handoff to 1 or 2 provides a similar solution, as it
limits transfers across the entire cluster.

Regards,
Joe

On Tue, Jan 10, 2012 at 11:17 AM, Elias Levy
<fearsome.lucid...@gmail.com> wrote:
> Good day,
>
> Yesterday we went through the exercise of doubling the size of our initial 3
> node cluster.  The rebalancing took four hours or so.  Each node now has
> between 7 and 8 GB of data stored in the Leveldb backend with an n_val of 3.
>
>
> During rebalancing the cluster became nearly useless. The CPUs where largely
> pegged, and IO was hit hard, as could be expected.  Op throughput dropped to
> the floor and services depending on Riak become unresponsive, as they waited
> on Riak ops and then failed with timeouts.  This was in EC2 m1.large
> instances with ephemeral drives.
>
> A few questions.
>
> Is this expected behavior?  Granted, these are EC2 boxes and Leveldb depends
> heavily on disk, but I can't imagine folks using Riak on production his this
> type of performance hit resulting from rebalancing.
>
> Are client ops prioritized higher than the rebalancing work?  If not, why
> not?
>
> I recall there was some mention that by using Riak EDS it was possible to
> fill in a new node slowly, and then have it join the cluster when its nearly
> in sync, thus limiting the stampede effect that regular rebalancing seems to
> bring on.  Is that the case?  If so, is there any documentation on this?
>
>
> I am also starting to suspect the timing statistics that Riak reports.
>  We've been charting them, and comparing them before and after the doubling
> of the cluster size shows some odd results.  Nearly all stats in the older
> nodes (mean, median, 95 percentile, 99 percentile, 100 percentile) are about
> the same before and after the addition of new nodes.  This is the case even
> though there are now double as many nodes to handle requests, and each node
> has about half as much data as before.  The only one that did change was the
> puts 100 percentile.  It increased during the rebalancing, and then stayed
> at its higher value.  How is this possible?
>
> In fact, I find the 100 percentile stat a bit incredible.  One of the boxes
> is showing a 8,012 seconds 100 percentile stat (yes, I did divide by one
> million to convert from microseconds).
>
> Or are the time stats not rolling stats?  Are they computed since the start
> of the node, meaning that the 100 percentile will never go down, only up?
>
> Elias
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

-- 
Joseph Blomstedt <j...@basho.com>
Software Engineer
Basho Technologies, Inc.
http://www.basho.com/

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Rebalancing pain and 100 percentile stat

Reply via email to