The problem with unreachable nodes still remains, since you don't know how long they will be gone. The only 'safe' minimum time to keep deleted values is forever. This can be easily emulated in the application layer by using a special value (or use Riak metadata for example). So it essentially a trade off like most things. If you are sure that no node will ever be down for more than 24 hours, your solution would work.

If it is really essential for an application that deleted keys don't ever reappear, you should just store this information explicitly (that way you also know when the key was deleted btw.). If not, then one can live with the current behaviour, which is much simpler implementation wise.

I would just separate the two issues of logically deleting and physically deleting (which is just an operational issue as opposed to an issue for your application design). The latter could be handled by the storage backend. Bitcask already has a key expiration feature. If it where fixed, so that expired key are actually counted towards the triggering of merges, and the ttl could be set per key, you would be good to go ;-).

Btw, this whole issue is not really Riak specific. It is essentially a consequence of eventual consistency, where you have to make a trade off between the amount of bookkeeping information you want to store and the maximum amount of time (or number of updates) any part of the system can diverge from the rest of the system before you get undesired results.

Cheers,
Nico

Am 16.06.2011 16:50, schrieb Kresten Krab Thorup:
...when doing a delete, Riak actually stores a "deleted" record, but then it is too eagerly deleting it for real after 
that. There should be a configurable "zombie time" between requesting a delete and the "deleted record" being 
deleted for real; so that the deleted record's vector clock will show that the delete is more recent than the other value(s) in 
case those are later reconciled. The current infrastructure just doesn't have a good place to "enqueue" such a 
"delete this for real in 24 hours"-ish request.

Also, the master branch now has support for specifying a vector clock with a 
delete (in 14.x releases you can in stead do a PUT w/ X-Riak-Deleted=true and a 
proper vector clock, and an empty content). That's better (more consistent), 
but not a real fix.

Kresten

On 16/06/2011, at 11.58, "Nico 
Meyer"<nico.me...@adition.com<mailto:nico.me...@adition.com>>  wrote:

Hello David,

this behaviour is quite expected if you think about how Riak works.
Assuming you use the default replication factor of n=3, each key is stored on 
all of your three nodes. If you delete a key while one node (let's call it A) 
is down, the key is deleted from the two nodes that are still up (let's call 
them B and C), and remains on the downed node A.
Once node A is up again, the situation is indistinguishable from B and C having 
a hard drive crash and loosing all their data, in that A has the key and B and 
C know nothing about it.

If you do a GET of the deleted key at this point, the result depends on the r-value that 
you choose. For r>1 you will get a not_found on the first get. For r=1 you might get 
the data or a not_found, depending on which two nodes answer first 
(see<https://issues.basho.com/show_bug.cgi?id=992>  
https://issues.basho.com/show_bug.cgi?id=992 about basic quorum for an explanation). 
Also, at that point read repair will kick in and re-replicate the key to all nodes, so 
subsequent GETs will always return the original datum.

listing keys on the other hand does not use quorum but just does a set union of 
all keys of all the nodes in you cluster. Essentially it is equivalent to r=1 
without basic quorum. The same is true for map/reduce queries to my knowledge

The essential problem is that a real physical delete is indistinguishable from 
data loss (or never having had the data in the first place), while those two 
things are logically different.
If you want to be sure that a key is deleted with all its replicas you must 
delete it with a write quorum setting of w=n. Also you need to tell Riak not to 
count fallback vnodes toward you write quorum. This feature is quite new and I 
believe only available in the head revision. Also I forgot the name of the 
parameter and don't know if it is even applicable for DELETEs.
Anyhow, if you do all this, your DELETEs will simply fail if any of the nodes 
that has a copy of the key is down (so in your case, if any node is down).

If you only want to logically delete, and don't care about freeing the disk 
space and RAM that is used by the key, you should use a special value, which is 
interpreted by your application as a not found. That way you also get proper 
conflict resolution between DELETEs and PUTs (say one client deletes a key 
while another one updates it).

Cheers,
Nico

Am 16.06.2011 00:55, schrieb David Mitchell:
Erlang: R13B04
Riak: 0.14.2

I have a three node cluster, and while one node was down, I deleted every key 
in a certain bucket.  Then, I started the node that was down, and it joined the 
cluster.

Now, when do a listing on these keys in this bucket, and I get the entire list. 
 I can also get the values of the bucket.  However, when I try to delete the 
keys, the keys are not deleted.

Can anyone help me get the nodes back in a consistent state?  I have tried 
restarting the nodes.

David






_______________________________________________
riak-users mailing list
<mailto:riak-users@lists.basho.com>riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
<http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to