-Description-

In Riak 2.1.0, the default configuration for handoff.ip causes vnodes
marked for transfer during handoff to be removed without transferring
data to their new destination nodes. A mandatory change to
configuration (riak.conf) will resolve this issue. While not all users
are impacted by this issue, we recommend that all 2.1.0 users upgrade
to 2.1.1 which will be released shortly.

NOTE: This is known to occur for ownership handoff. Investigation as
to whether hinted handoff is affected is ongoing and this advisory
will be updated when more information is available.

-Affected Users-

All users of 2.1.0 using riak.conf to configure their clusters are
potentially impacted. Users that are using app.config and vm.args to
configure their clusters are unaffected but should upgrade to 2.1.1
upon release.

To verify whether you are affected, the below command must be run on
each node in your cluster:
     riak config effective | grep handoff.ip

Affected nodes will have a handoff ip of 127.0.0.1
     handoff.ip = 127.0.0.1

-Impact-

This bug impacts vnodes that are in process of handoff. Handoff data
will be looped back to the source node during ownership handoff rather
than being transferred to the destination node. Once ownership handoff
is completed the data is removed from the source node. In the event of
significant ownership handoff, which can happen during cluster
expansion or contraction, all replicas of an object may be lost. Data
loss occurs if all replicas of an object are lost as a result of this
configuration issue. Replica loss can be triggered by cluster
membership changes or other Riak cluster activity that triggers
handoff behavior. Data loss is mitigated as long as at least one
replica still exists and the below steps are followed.

-Mitigation-

You can immediately mitigate the issue by setting transfer limit to
zero across the cluster by issuing the following on any node:

     riak-admin transfer-limit 0

Then configure handoff.ip in riak.conf to an external IP address or
0.0.0.0 on all nodes.

Perform a rolling restart of Riak across your cluster to activate the
new setting.

After correcting the configuration and restarting the nodes, you
should run Riak KV repair on each cluster member as documented at
http://docs.basho.com/riak/latest/ops/running/recovery/repairing-partitions/
to recreate any missing replicas from available replicas elsewhere in
the cluster.  It is recommended to perform the Riak KV repair in a
round-robin fashion on each node of your cluster (node0, node1, node2,
etc). Repeat this round-robin repair “n_val - 1” times. For example:
the default configuration for n_val is 3, which means you would run
Riak KV repair twice across the entire cluster.

NOTE: It is important to ensure that you execute in a round-robin
fashion: node0, node1, node2 and then repeat.

A forthcoming 2.1.1 release will provide an updated default configuration.

Questions?

Please open a ticket with Basho if you have any questions about the above issue.

Cheers,

Tyler Hannan  |  Director of Technical Marketing
Basho Technologies
t: @tylerhannan
c: 720-280-9216

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to