Frustration with "repair" process in 1.1.11

Oleg Dulin Fri, 01 Nov 2013 12:16:03 -0700

First I need to vent.

<rant>

One of my cassandra cluster is a dual data center setup, with DC1acting as primary, and DC2 acting as a hot backup.

Well, guess what ? I am pretty sure that it falls behind onreplication. So I am told I need to run repair.

I run repair (with -pr) on DC2. First time I run it it gets *stuck*(i.e. frozen) within the first 30 seconds, with no error or any sort ofmessage. I then run it again -- and it completes in seconds on eachnode, with about 50 gigs of data on each.


That seems suspicious, so I do some research.

I am told on IRC that running repair -pr will only do the repair on"100" tokens (the offset from DC1 to DC2)… Seriously ???

Repair process is, indeed, a joke:https://issues.apache.org/jira/browse/CASSANDRA-5396 . Repair is theworst thing you can do to your cluster, it consumes enormous resources,and can leave your cluster in an inconsistent state. Oh and by the wayyou must run it every week…. Whoever invented that process must notlive in a real world, with real applications.

</rant>

No… lets have a constructive conversation.

How do I know, with certainty, that my DC2 cluster is up to date onreplication ? I have a few options:

1) I set read repair chance to 100% on critical column families and Iwrite a tool to scan every CF, every column of every row. This strikesme as very silly.Q1: Do I need to scan every column or is looking at one column enoughto trigger a read repair ?

2) Can someone explain to me how the repair works such that I don'ttotally trash my cluster or spill into work week ?


Is there any improvement and clarity in 1.2 ? How about 2.0 ?



--
Regards,
Oleg Dulin
http://www.olegdulin.com

Frustration with "repair" process in 1.1.11

Reply via email to