As I understand, it has to do with a node being up but missing the delete message (remember, if you apply the delete at CL.QUORUM, you can have almost half the replicas miss it and still succeed). Imagine that you have 3 nodes A, B, and C, each of which has a column 'foo' with a value 'bar'. Their state would be: A: 'foo':'bar' B: 'foo':'bar' C: 'foo':'bar'
We attempt to delete column 'foo', and it succeeds on nodes A and B (meaning that we succeeded on CL.QUORUM). Unfortunately the packet going to node C runs afoul of the network gods and gets zapped in transit. The state is now: A: 'foo':deleted B: 'foo':deleted C: 'foo':'bar' If we try a read at this point, at CL.QUORUM, we are guaranteed to get at least one record that 'foo' was deleted and because of timestamps we know to tell the client as much. After GCGraceSeconds and a compaction, the state of the nodes will be: A: None B: None C: 'foo':'bar' Some time later, we attempt a read and just happen to get C's response first. The response will be that 'foo' is storing 'bar'. Not only that, but read repair happens as well, so the state will become: A: 'foo':'bar' B: 'foo':'bar' C: 'foo':'bar' We have the infamous undelete. ----- Original Message ----- From: "A J" <s5a...@gmail.com> To: user@cassandra.apache.org Sent: Thursday, June 30, 2011 8:25:29 PM Subject: Meaning of 'nodetool repair has to run within GCGraceSeconds' I am little confused of the reason why nodetool repair has to run within GCGraceSeconds. The documentation at: http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair is not very clear to me. How can a delete be 'unforgotten' if I don't run nodetool repair? (I understand that if a node is down for more than GCGraceSeconds, I should not get it up without resynching is completely. Otherwise deletes may reappear.http://wiki.apache.org/cassandra/DistributedDeletes ) But not sure how exactly nodetool repair ties into this mechanism of distributed deletes. Thanks for any clarifications.