> I think what I feel is that there is a need to know if repair is required
> flag in order for team to manage the cluster.

And again, repair is always required essentially. You should *always*
run it within the necessary period as determined by GCGraceSeconds.

> Atleast at minimum, Is there a flag somewhere that tells if repair was run
> within GCGracePeriod?

No, and it's not what you want either since by the time that flags
says "false", it's already too late :) This is why my best suggestion
for a simple improvement would be to expose the time since the last
successful repair.

Currently this information is, to my knowledge, not exposed by
Cassandra so it is the responsibility of your deployment strategy to
monitor for this. One simple version (not to be used as-is) might be:

  set -e # important
  touch /path/to/flagfile.tmp
  nodetool -h localhost repair
  mv /path/to/flagfile.tmp /path/to/flagfile

The mtime of /path/to/flagfile is the indicator of when repair
succeeded last, assuming a recent version of Cassandra where 'nodetool
repair' is blocking.

The key point is: What you want to monitor, is the time since last
successful repair. If that time is less than some triggering low water
mark, someone needs to be informed because you are X hours away from
violating the requirements imposed by GCGraceSeconds.

(Cassandra could make this easier, but just be clear on what it is
that you're actually looking for. You're *not* looking for "has a
write been timed out ever in the cluster", but rather "are we closer
to GCGraceSeconds than some threshold which we normally should never
reach if repairs are functioning and running as intended".)

-- 
/ Peter Schuller

Reply via email to