Thanks for the excellent writeup.

I have a few notes on your writeup and then a little history to help
explain the motivation for the v3 work.

The Claiming Problem

  One other property of the broader claim algorithm + claimant + handoff
  manager group of processes that's worth mentioning is safety during
  transition.  The cluster should ensure that target N-val copies
  are always available even during transitions.  Much earlier in Riak's
  life the claim would just execute and ownership transfer immediately,
  without putting the data in place (fine, it's eventually consistent,
right?)
  but that meant if more than two vnodes in a preference list changed
  ownership then clients would read not found until at least one of the
  objects it was receiving had transferred. The claimant now shepherds those
  transitions so it should be safe.  The solution of transferring the
  data before ownership has fixed the notfound problem, but Riak lost
  agility in adding capacity to the cluster - existing data has to transfer
  to new nodes before they are freed up, and they continue to grow
  while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
  capacity and proxying back to the original vnode is probably a better
  option.

  Predicting load on the cluster is also difficult with the single
  ring with a target n-val set at creation time being used for all
  buckets despite their n-value.  To compute the operations sent to
  each vnode you need to know the proportion of access to each N-value.

  There's also the problem that if a bucket is created with an N-value
  larger than target N all bets are off about the number of physical nodes
  values are written to (*cough* strong consistency N-5)

  Having a partitioning-scheme-per-N-value is one way of sidestepping the
  load prediction and max-N problems.

Promixity of Vnodes

  An alternate solution to the target_n_val problem is to change the way
  fallback partitions are added and apply an additional uniqueness
constraint
  as target nodes are added.  That provides safety against multiple node
  failures (although can potentially cause loading problems).  I think
  you imply this a couple of points when you talk about 'at runtime'.

Proximity of vnodes as the partition list wraps

  One kludge I considered solving the wraparound problem is to go from
  a ring to a 'spiral' where you add extra target_n_val-1 additional
  vnodes that alias the few vnodes in the ring.

  Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
```
  v0 | v1 | v2 | v3
  nA | nB | nC | nA

  p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
  p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
  p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
```
  You get 2/4 preflists violating target_n_val=3.

  If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
  only use it for assignment

```
  v0 | v1 | v2 | v3 | v0' | v1'
  nA | nB | nC | nA | nB  | nC

  p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
  p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
  p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
```
  The additional vnodes can never be hashed directly, just during
  wraparound.


As you say, the v3 algorithm was written (by me) a long time ago and
never made it to production.  It was due to a few factors, partly
the non-determinism, partly because I didn't like the (very stupid)
optimization system tying up the claimant node for multiple seconds,
but more troublingly when we did some commissioning tests for a large
customer that ran with a ring size of 256 with 60 nodes we experienced
a performance drop of around 5% when the cluster was maxed out for
reads.  The diversity measurements were much 'better' in that the
v3 claimed cluster was far more diverse and performed better during
node failures, but the (unproven) fear that having a greater number
of saturated disterl connections between nodes dropped performance
without explanation stopped me from promoting it to default.

The reason the v3 algorithm was created was to resolve problems with
longer lived clusters created with the v2 claim that had had nodes
added and removed over time.  I don't remember all the details now,
but I think the cluster had a ring size of 1024 (to future proof,
as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.

In that particular configuration, the v2 algorithm had left the original
sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
new nodes in place, but that left many places were the original sequential
assignments still existed.

What we hadn't realized at the time is that sequential node assignment
is the *worst* possible plan for handling fallback load.

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

In the real scenario, although it was no longer sequentially assigned
there were still a large number of very similar preference lists to
the original assignment (as growing a few nodes on that ring size
only reassigns preference lists in proportion to the new nodes claiming
partitions).

The production cluster was running fairly close to capacity, so the
increased loading during failure, even though it wasn't as bad as doubled
was enough to push it over the performance 'step' lowering tail latencies
and slowed it down enough to overload the vnodes and exhaust memory
crashing the next node causing a cascade.  This was before vnodes had
overload protection so would present differently now.


Another pre-claimant problem that shaped some of the earlier claim
code vnode 'want' threshods was that when the nodes were individually
allowed to say if they wanted to claim more vnodes (with the
wants_claim function, before calling choose_claim), there were some states
the cluster would get into where two nodes both decided they were under
capacity and continually tried to claim, causing the vnode to flip/flop
back and forth between them (that was a reason for writing one of the early
QuickCheck tests).


I'm not sure if you've encountered it or not, but the riak_core_claim_sim
is also a good tool for testing the behavior of the claim functions and
the claimant.  You don't mention it in your write up, but one of the
important functions of the claimant is to make sure it only performs
safe transitions between rings.  It makes sure that the n val is not
violated during handoff.



What to do?

  Fixing the claim algorithm is one way of doing things, but I worry
  it has a number of problems that are hard to solve (multi-AZ, multi-Nval
  etc).

  One more radical option is to dump the ring and just publish a table
  per-vnode of the nodes and vnode hash you'd like to service them.
  Riak doesn't really need consistent hashing - it doesn't *really* use
  it's original form (the Dynamo A scheme), and is more of a hybrid
  of the B/C schemes.

  Use cluster metadata and publish out the tables, update riak_core_apl
  to take the new data and serve up the preference lists.  Obviously
  it trickles into things like the vnode and handoff managers, but it
  may be possible.

  That gives you the advantage of no longer being constrained in how
  you assign the nodes - a separation of policy and execution.  You
  could keep the existing ring based algorithms, or you could do something
  better.

  It may be interesting to change the number of vnodes/hashing algorithm
  too.  Jordan West was a big fan of Consistent Jump Hashing at one point.

  The thing you give up if you lose the power-of-2 partitioning scheme
  is the ability to split and combine partitions.  Each partition in
  a 64 vnode ring maps to exactly two (non-consecutive) partitions in a 128
  vnode ring.  Which is a very nice for replicating between clusters
  with different ring sizes and localizing where to look for data.

Good luck!




On Wed, May 17, 2017 at 6:37 AM Daniel Abrahamsson <hams...@gmail.com>
wrote:

> Thanks for the writeup and detailed investigation, Martin.
>
> We ran into these issues a few months when we expanded a 5 node cluster
> into a 8 node cluster. We ended up rebuilding the cluster and writing a
> small escript to verify that the generated riak ring lived up to our
> requirements (which were 1: to survive an AZ outage, and 2: to survive any
> 2 nodes going down at the same time).
>
> This will be a great document to refer to when explaining the subtleties
> of setting up a Riak cluster.
>
> //Daniel
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to