Re: (unofficial) Community Poll for Production Operators : Repair

Hiller, Dean Tue, 14 May 2013 04:48:34 -0700

We had to roll out a fix in cassandra as a slow node was slowing down our 
clients of cassandra in 1.2.2 for some reason.  Every time we had a slow node, 
we found out fast as performance degraded.  We tested this in QA and had the 
same issue.  This means a repair made that node slow which made our clients 
slow.  With this fix which I think one our team is going to try to get it back 
into cassandra, the slow node does not affect our clients anymore.


I am curious though, if someone else would use the "tc" program to simulate 
linux packet delay on a single node, does your client's response time get much 
slower?  We simulated a 500ms delay on the node to simulate the slow node….it 
seems the co-ordinator node was incorrectly waiting for BOTH responses on 
CL_QUOROM instead of just one (as itself was one as well) or something like 
that.  (I don't know too much as my colleague was the one that debugged this 
issue)

Dean

From: Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, May 14, 2013 1:42 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: (unofficial) Community Poll for Production Operators : Repair

Hi Rob,

1) 1.2.2 on 6 to 12 EC2 m1.xlarge
2) Quorum R&W . Almost no deletes (just some TTL)
3) Yes
4) On each node once a week (rolling repairs using crontab)
5) The only behavior that is quite odd or unexplained to me is why a repair 
doesn't fix a counter mismatch between 2 nodes. I mean when I read my counters 
with a CL.One I have inconsistency (the counter value may change anytime I read 
it, depending, I guess, on what node I read from. Reading with CL.Quorum fixes 
this bug, but the data is still wrong on some nodes. About performance, it's 
quite expensive to run a repair but doing it in a low charge period and in a 
rolling fashion works quite well and has no impact on the service.

Hope this will help somehow. Let me know if you need more information.

Alain



2013/5/10 Robert Coli <rc...@eventbrite.com<mailto:rc...@eventbrite.com>>
Hi!

I have been wondering how Repair is actually used by operators. If
people operating Cassandra in production could answer the following
questions, I would greatly appreciate it.

1) What version of Cassandra do you run, on what hardware?
2) What consistency level do you write at? Do you do DELETEs?
3) Do you run a regularly scheduled repair?
4) If you answered "yes" to 3, what is the frequency of the repair?
5) What has been your subjective experience with the performance of
repair? (Does it work as you would expect? Does its overhead have a
significant impact on the performance of your cluster?)

Thanks!

=Rob

Re: (unofficial) Community Poll for Production Operators : Repair

Reply via email to