First some specifics:

> I think my problem is that I don't want to remember to run read repair. I

You are not expected to remember to do so manually. Typically periodic
repairs would be automated in some fashion, such as by having a cron
job on each node that starts the repair. Typically some kind of logic
may be applied to avoid running repair on all nodes at the same time.

> want to know from cassandra that I "need" to run repair "now". This seems
> like a important functionality that need to be there. I don't really want to
> find out hard way that I forgot to run "repair" :)

See further below.

> Say Node A, B, C. Now A is inconsistent and needs repair. Now Node B goes
> down. Even with Quorum this will fail read and write. There could be other

WIth writes and reads at QUORUM, a read following a write is
guaranteed to see the write. If enough nodes are down such that QUORUM
is not satisfied, the read operation will fail. Node B going down
above is not a problem. If your RF is 3, a write would have been
required to succeed on A and B, or B and C, or A and B. Since reads
have the same requirement, there is by definition always overlap
between the read set and write set. This is the fundamental point of
using QUORUM.

> scenarios. Looks like repair is critical commands that is expected to be
> run, but "when"? Saying once within GCGraceSeconds might be ok for some but
> not for everyone where we want bring all nodes in sync ASAP.

Let me try to put it in a different light.

The reasons to use 'nodetool repair' seems to fall roughly into two categories:

(a) Ensuring that 'nodetool repair' has been run within GCGraceSeconds.
(b) Helping to increase the 'average' level of consistency as observed
by the application.

These two cases are very very different. Cassandra makes certain
promises on data consistency, that clients can control in part by
consistency levels. If (a) fails, such that a 'nodetool repair' was
not run in time, the cluster will behave *incorrectly*. It will fail
to satisfy the guarantees that it supposedly promises. This is
essentially a binary condition; either you run nodetool repair as
often as is required for correct functioning, or you don't. This is a
*hard* requirement, but is entirely irrelevant until you actually
reach the limit imposed by GCGraceSeconds. There is no need to run
'repair' as soon as possible (for some definition of 'soon as
possible') in order to satisfy (a). You're 100% fine until you're not,
at which time you've caused Cassandra to violate its guarantees. So -
it's *important* to run repair due to (a), but it is not *urgent* to
do so.

(b) on the other hand is very different. Assuming your application and
cluster is one that wants to run repair more often than GCGraceSeconds
for whatever reason (for example, for performance you want to use
CL.ONE and turn off read-repair, but your data set is such that it's
practical to use pretty frequent repairs to keep inconsistencies
down), it may be beneficial to do so. But this is essentially a soft
'preference' for how often repairs should be run; there is no magic
limit at which something breaks where it did not break before. This
becomes a matter of setting a reasonable repair frequency for your use
case, and and individual node perhaps failing a repair once for some
obscure reason is not an issue.

For (b), you should be fine just triggering repair sufficiently often
as appropriate with no need to even have strict monitoring or demands.
Almost by definition the requirements are not strict; if they were
stricter, you should be using QUORUM or maybe ONE + read repair. So in
this case, "remembering" is not a problem - you just install your
cronjob that does it often enough, approximately, and don't worry
about it.

For (a), there is the hard requirement. So this is where you *really*
want it completing, and preferably have some kind of
alarm/notification if a repair doesn't run in time.

Note that for (b), it doesn't help to know the moment a write didn't
get replicated fully. That's bound to happen often (every time a node
is restarted, there is some short hiccup, etc). A single write failing
to replicate is an almost irrelevant event.

For (a) on the other hand, it *is* helpful and required to keep track
of the time of the last successful repair. Cassandra could be better
at making this easier I think, but it is an entirely different problem
than detecting that "somewhere in the cluster, a non-zero amount of
writes may possibly have failed to replicate". The former is directly
relevant and important; the latter is almost always completely
irrelevant to the problem at hand.

Sorry to be harping on the same issue, but I really think it's worth
trying to be clear about this from the start :) If you do have a
use-case that somehow truly is not consistent with the above, it would
however be interesting to hear what it is.

Is the above clear? I'm thinking maybe it's worth adding to the FAQ
unless it's more confusing than helpful.

-- 
/ Peter Schuller

Reply via email to