> Yes but that doesn't really provide the monitoring that will really be > helpful. If I don't realize it until 2 days then we potentially could be > returning inconsistent results or not have data sync for 2 days until repair > is run. It will be best to be able to monitor these things so that it can be > run as soon as it is required (eg node down). Have such monitoring will be > helpful for operations team to monitor also who may not know all internals > of cassandra.
For the purpose of this discussion, nodes are always down in any non-trivial time window. You may have flapping in the ring, individual requests may time out, etc. Do not assume repair is not required just because you have not had some kind of major outtage where a human became consciously aware that a node was officially "down". Unless you really know what you're doing, the thing to monitor is the completion of repairs at sufficient frequency. In the event that repair *doesn't* run, there needs to be enough time left until tombstone expiry for someone to take some kind of action (whether that is re-running repair again or re-configuring gcgraceseconds temporarily is another matter). Repair is not something that you only run in the event of some major issue; repair is a regularly scheduled operation for your typical cluster. The invariant required by Cassandra is that repairs complete prior to tombstones expiring (see URL in previous e-mail). Some applications, given some combination of consistency levels, use-case and requirements, may benefit from more frequent repair. But the important part, is the minimum repair frequency mandated by Cassandra - and determined by GCGraceSeconds. -- / Peter Schuller