Has anyone looked much at the maximum latency of cassandra read/write
requests?  (rather than the average latency and average throughput)

We've been struggling for quite some time trying to figure out why we
we see occasional read or write response times in the 100s of milliseconds
even on fast machines that normally respond in just a few milleconds.   
We've spent a lot of time attempting to trying to tune our benchark
and cassandra configurations to lower these maximum times.   There are
a lot of things that make it a little better, or a little worse, but we've
found it nearly impossible to eliminate these "outliers" completely.

A lot of our initial testing was done with a home-grown benchmark test
written in C++ and using the thrift interface.   However, now that we've 
recently upgraded from 0.6.8 to 1.0.3, that has allowed me to do some 
testing using the official java "stress" tool.   The problem, at least
for this purpose, is that the stress tool only reports *average*
response times over the measurement intervals.    This effectively
hides the large value if they are infrequent relative to measurement
interval.   I've modified the stress test, so that it also tracks the
maximum latency reported over each measurement interval. 

Here is an excerpt from a typical result:

$ bin/stress -d XXX -p 7777 -e QUORUM  -t 4  -i 1  -l 3 -c 1 -n 400000
total interval_op_rate interval_key_rate avg_latency elapsed_time max(millisec)
5780 5780 5780 6.098615916955017E-4 1 13
13837 8057 8057 5.003102891895247E-4 2 4
22729 8892 8892 4.7199730094466935E-4 3 4
31840 9111 9111 4.6504225661288555E-4 4 1
40925 9085 9085 4.6846450192625206E-4 5 1
49076 8151 8151 5.200588884799411E-4 6 100
...
3186625 8886 8886 4.786180508665316E-4 411 10
3195626 9001 9001 4.705032774136207E-4 412 1
3204574 8948 8948 4.710549843540456E-4 414 1
3213524 8950 8950 4.7195530726256986E-4 415 1
3217534 4010 4010 0.0010763092269326683 416 607
3226560 9026 9026 4.695324617770884E-4 417 1
3235425 8865 8865 4.7805978567399887E-4 418 1
3244177 8752 8752 4.848034734917733E-4 419 10
...


My patch adds the final column which logs the maximum response time 
over the one-second interval.  In most cases the average reponse time 
is under 1 msec, and though the maximum might be a bit larger, it's still
just a few milleseonds - usually under 10 msec.   But sometimes (like
interval 416) one of the response took 607 milliseconds.

These numbers aren't too bad if you are supporting an interactive 
application and don't mind a slightly slower response now and then as
long as the average stays low and throughput stays high.  But for 
other types of applications, these slow responses might be a problem.

I'm trying to understand if this is expected or not, and if there is
anything we can do about.   I'd also be interested in hearing from folks
that have run the stress test agains their own Cassandra clusters.   

If anyone wants to try this, I've included  my patch to the stress test 
below.   So far, I've only instrumented the default "insert" operation. 
(It also adusts the output to separate fields with spaces instead of
commas.  I find that easier to read and it caters to gnuplot)

diff -ur stress/src/org/apache/cassandra/stress/operations/Inserter.java 
/home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/operations/Inserter.java
--- stress/src/org/apache/cassandra/stress/operations/Inserter.java     
2011-11-15 02:57:23.000000000 -0600
+++ 
/home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/operations/Inserter.java
  2011-12-09 08:58:04.000000000 -0600
@@ -108,6 +108,11 @@
         session.operations.getAndIncrement();
         session.keys.getAndIncrement();
         session.latency.getAndAdd(System.currentTimeMillis() - start);
+
+long  delta = System.currentTimeMillis() - start;
+if ( delta > session.maxlatency.get() ) {
+session.maxlatency.set(delta);
+}
     }
 
     private Map<String, List<Mutation>> 
getSuperColumnsMutationMap(List<SuperColumn> superColumns)
diff -ur stress/src/org/apache/cassandra/stress/Session.java 
/home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/Session.java
--- stress/src/org/apache/cassandra/stress/Session.java 2011-11-15 
02:57:23.000000000 -0600
+++ 
/home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/Session.java
      2011-12-09 08:48:45.000000000 -0600
@@ -53,6 +53,7 @@
     public final AtomicInteger operations;
     public final AtomicInteger keys;
     public final AtomicLong    latency;
+    public final AtomicLong    maxlatency;
 
     static
     {
@@ -337,6 +338,7 @@
         operations = new AtomicInteger();
         keys = new AtomicInteger();
         latency = new AtomicLong();
+       maxlatency = new AtomicLong();
     }
 
     public int getCardinality()
diff -ur stress/src/org/apache/cassandra/stress/StressAction.java 
/home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/StressAction.java
--- stress/src/org/apache/cassandra/stress/StressAction.java    2011-11-15 
02:57:23.000000000 -0600
+++ 
/home/psfales/MME/apache-cassandra-1.0.3-src/tools/stress/src/org/apache/cassandra/stress/StressAction.java
 2011-12-09 09:00:41.000000000 -0600
@@ -55,8 +55,7 @@
         int threadCount = client.getThreads();
         Consumer[] consumers = new Consumer[threadCount];
 
-        
output.println("total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time");
-
+        output.println("total interval_op_rate interval_key_rate avg_latency 
elapsed_time max(millisec)");
         int itemsPerThread = client.getKeysPerThread();
         int modulo = client.getNumKeys() % threadCount;
 
@@ -129,11 +128,13 @@
                 int opDelta = total - oldTotal;
                 int keyDelta = keyCount - oldKeyCount;
                 double latencyDelta = latency - oldLatency;
-
+               long maxdelta = client.maxlatency.get();
+               client.maxlatency.set(0);
                 long currentTimeInSeconds = (System.currentTimeMillis() - 
testStartTime) / 1000;
                 String formattedDelta = (opDelta > 0) ? 
Double.toString(latencyDelta / (opDelta * 1000)) : "NaN";
 
-                output.println(String.format("%d,%d,%d,%s,%d", total, opDelta 
/ interval, keyDelta / interval, formattedDelta, currentTimeInSeconds));
+                output.println(String.format("%d %d %d %s %d %d", total, 
opDelta / interval, keyDelta / interval, formattedDelta, 
currentTimeInSeconds,maxdelta));
+       
             }
         }
 


-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: peter.fa...@alcatel-lucent.com
Phone: 630 979 8031

Reply via email to