RE: best way to measure repair times?

Jason Kushmaul | WDA Thu, 19 Mar 2015 13:42:06 -0700

Ian,

In my experience I don’t get any output from repair (2.0.7) that is useful 
until the keyspace is finished.  Perhaps this has been solved but we do 
something much more painful:



We tail the log on the node having repair run on it, watching for the first 
repair session, and then count each “session completed” line.  Each keyspace 
being repaired will produce num_tokens worth of messages.

Find the start time:
$grep AntiEntropy /var/log/cassandra/system.log | grep –m 1 "new session"
INFO [AntiEntropySessions:1] 2015-01-06 08:00:01,817 RepairSession.java (line 
244) [repair #1c1023c0-95b0-11e4-abc7-9d8c76a06ae7] new session: will sync 
/10.x.y.z, /10.x.y.z on range (2770269247941187446,2771538486312712323] for 
menomena.[x, y, z]
Note – you have to catch the *first* message, there will be more to follow.  
This is something that would be great if there was a differentiator in the log 
output to know if it is the initial start of a repair vs a new range.


So start_time = 2015-01-06 08:00:01,817


From there you count session completed messages:
$grep AntiEntropy /var/log/cassandra/system.log | grep "session completed" | wc 
-l
INFO [AntiEntropySessions:192] 2015-01-06 14:35:13,874 RepairSession.java (line 
282) [repair #1c1023c0-95b0-11e4-abc7-9d8c76a06ae7] session completed 
successfully

Since I have num_tokens=256; If I see a count of 412, I know that 
OpsCenter(256) is finished and menomena(256) is about 40% finished.

As Jan said, you could then use this to calculate remaining time from the start 
time and the remainder of the ranges.

I’ve found this to give me immediate indication of progress, rather than having 
to wait for the keyspace to be finished.  We are running 2.0.7, maybe some of 
this has been exposed through nodetool repair (which would be sweet).  This 
seems to be more or less accurate, but anyone correct me if I am wrong please.  
We use this more for automatically detecting long running repairs more than to 
simply watch progress, which our internal zabbix server will whine about it to 
my team.


Jason Kushmaul | V.P. Mobile Engineering
4050 Hunsaker Drive | East Lansing, MI 48823 USA
517-337-2701 x 5225| 517-337-2754 (fax)

From: Jan [mailto:cne...@yahoo.com]
Sent: Thursday, March 19, 2015 4:04 PM
To: user@cassandra.apache.org
Subject: Re: best way to measure repair times?

Ian;

to respond to your specific question:

You could pipe the output of your repair into a file and subsequently determine 
the time taken.
example:

nodetool repair -dc DC1

[2014-07-24 21:59:55,326] Nothing to repair for keyspace 'system'

[2014-07-24 21:59:55,617] Starting repair command #2, repairing 490 ranges

  for keyspace system_traces (seq=true, full=true)

[2014-07-24 22:23:14,299] Repair session 323b9490-137e-11e4-88e3-c972e09793ca

  for range (820981369067266915,822627736366088177] finished

[2014-07-24 22:23:14,320] Repair session 38496a61-137e-11e4-88e3-c972e09793ca

  for range (2506042417712465541,2515941262699962473] finished



What to look for:

a)  Look for the specific name of the Keyspace & the word 'starting repair'

b)  Look for the word 'finished'.

c)  Compute the average time per keyspace and you would be able to have a rough 
idea of how long your repairs would take on a regular basis.    This is only 
for continual operational repair, not the first time its done.



hope this helps

Jan/





On Thursday, March 19, 2015 12:55 PM, Paulo Motta 
<pauloricard...@gmail.com<mailto:pauloricard...@gmail.com>> wrote:

From: http://www.datastax.com/dev/blog/modern-hinted-handoff
Repair and the fine print
At first glance, it may appear that Hinted Handoff lets you safely get away 
without needing repair. This is only true if you never have hardware failure. 
Hardware failure means that

 1.  We lose “historical” data for which the write has already finished, so 
there is nothing to tell the rest of the cluster exactly what data has gone 
missing
 2.  We can also lose hints-not-yet-replayed from requests the failed node 
coordinated
With sufficient dedication, you can get by with “only run repair after hardware 
failure and rely on hinted handoff the rest of the time,” but as your clusters 
grow (and hardware failure becomes more common) performing repair as a one-off 
special case will become increasingly difficult to do perfectly. Thus, we 
continue to recommend running a full repair weekly.


2015-03-19 16:42 GMT-03:00 Robert Coli 
<rc...@eventbrite.com<mailto:rc...@eventbrite.com>>:
On Thu, Mar 19, 2015 at 12:13 PM, Ali Akhtar 
<ali.rac...@gmail.com<mailto:ali.rac...@gmail.com>> wrote:
Cassandra doesn't guarantee eventual consistency?

If you run regularly scheduled repair, it does. If you do not run repair, it 
does not.

Hinted handoff, for example, is considered an optimization for repair, and does 
not assert that it provides a consistency guarantee.

=Rob
http://twitter.com/rcolidba



--
Paulo Ricardo

--
European Master in Distributed Computing
Royal Institute of Technology - KTH
Instituto Superior Técnico - IST
http://paulormg.com<http://paulormg.com/>

RE: best way to measure repair times?

Reply via email to