On Thu, Nov 15, 2012 at 4:12 PM, Dwight Smith <dwight.sm...@genesyslab.com> wrote: > I have a 4 node cluster, version 1.1.2, replication factor of 4, read/write > consistency of 3, level compaction. Several questions.
Hinted Handoff is broken in your version [1] (and all versions between 1.0.0 and 1.0.3 [2]). Upgrade to 1.1.6 ASAP so that the answers below actually apply, because working Hinted Handoff is involved. > 1) Should nodetool repair be run regularly to assure it has completed > before gc_grace? If it is not run, what are the exposures? If you do DELETE logical operations, yes. If not, no. gc_grace_seconds only applies to tombstones, and if you do not delete you have no tombstones. If you only DELETE in one columnfamily, that is the only one you have to repair within gc_grace. Exposure is zombie data, where a node missed a DELETE (and associated tombstone) but had a previous value for that column or row and this zombie value is resurrected and propagated by read repair. > 2) If a node goes down, and is brought back up prior to the 1 hour > hinted handoff expiration, should repair be run immediately? In theory, if hinted handoff is working, no. This is a good thing because otherwise simply restarting a node would trigger the need for repair. In practice I would be shocked if anyone has scientifically tested it to the degree required to be certain all edge cases are covered, so I'm not sure I would rely on this being true. Especially as key components of this guarantee such as Hinted Handoff can be broken for 3-5 point releases before anyone notices. It is because of this uncertainty that I recommend periodic repair even in clusters that don't do DELETE. > 3) If the hinted handoff has expired, the plan is to remove the node > and start a fresh node in its place. Does this approach cause problems? Yes. 1) You've lost any data that was only ever replicated to this node. With RF>=3, this should be relatively rare, even with CL.ONE, because writes are much more likely to succeed-but-report-they-failed than vice versa. If you run periodic repair, you cover the case where something gets under-replicated and then even less replicated as nodes are replaced. 2) When you replace the node in its place (presumably using replace_token) you will only stream the relevant data from a single other replica. This means that, given 3 nodes A B C where datum X is on A and B, and B fails, it might be bootstrapped using C as a source, decreasing your replica count of X by 1. In order to deal with these issues, you need to run a repair of the affected node after bootstrapping/replace_tokening. Until this repair completes, CL.ONE reads might be stale or missing. I think what operators really want is a path by which they can bootstrap and then repair, before returning the node to the cluster. Unfortunately there are significant technical reasons which prevent this from being trivial. As such, I suggest increasing gc_grace_seconds and max_hint_window_in_ms to reduce the amount of repair you need to run. The negative to increasing gc_grace is that you store tombstones for longer before purging them. The negative to increasing max_hint_window_in_ms is that hints for a given token are stored in one row.. and very wide rows can exhibit pathological behavior. Also if you set max_hint_window_in_ms too high, you could cause cascading failure as nodes fill with hints, become less performant... thereby increasing the cluster-wide hint rate. Unless you have a very high write rate or really lazy ops people who leave nodes down for very long times, the cascading failure case is relatively unlikely. =Rob [1] https://issues.apache.org/jira/browse/CASSANDRA-4772 [2] https://issues.apache.org/jira/browse/CASSANDRA-3466 -- =Robert Coli AIM>ALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb