Hi all,

I'm experiencing strange latency spikes when writing and trying to figure out 
what could cause them.

My setup:
- 3 nodes, writing at CL.ONE using Hector client, no reads
- Writing simultaneously to 3 CFs, inserts with 25h TTL, no deletes, no 
updates, RF 3
   - 2 CFs have small data (row count < 2000, row size < 500kB, column 
count/row < 15 000)
   - 1 CF has lots of binary data split into ~60kB columns (row count < 550 
000, row sizes < 2MB, column count/row < 40)
   - Write rate ~300 inserts / s for each CF, total write throughput ~25 MB 
(bytes) / second
   - data is time series using timestamp as column key
- Cassandra 1.2.2 with 256 vnodes on each machine
- Key cache at default 100MB, no row cache
- 1 x Xeon L5430 CPU, 16GB RAM, 2.3T disc on RAID10 (10k SAS), Sun/Oracle JDK 
1.6 (tried also 1.7), 4GB JVM heap, JNA enabled
- all nodes in the same DC, 1Gb network, sub ms latencies between nodes

cassandra.yaml: http://pastebin.com/MSr2prpb
cfstats: http://pastebin.com/Ax5vPUcY
example cfhistograms: http://pastebin.com/qYSL1MX3
example proxy histograms: http://pastebin.com/X3AGGEjh

With this setup I usually get quite nice write latencies of less than 20ms, but 
sometimes (~once in a every few minutes) latencies momentarily spike to more 
than 300ms maxing out at ~2.5 seconds. Spikes are short (< 1 s) and happen on 
all nodes (but not at the same time). Even if avg latencies are very good, 
these spikes cause us headaches due to our SLA.

While investigating I have learned the following:
- No evident GC pressure (nothing in C* logs, GC logging showing constantly < 
30ms collection pauses)
- No I/O bounds (disks provide ~1GB/s linear write and are mostly idle apart 
from memtable flushes for every ~11s)
- No relation between spikes & compaction
- No queuing in memtable FlushWriter, no blocked memtable flushes
- Nothing alarming in logs
- No timeouts, no errors on the client side
- Each client (3 separate machines) experience latencies simultaneously which 
points to cause being in C*, not in the client
- CPU load < 10% (< 20% while compacting)
- Latencies measured both from the client and observed using nodetool 
cfhistograms

Now I'm running out of ideas about what might cause the spikes as I have 
understood that there is really not that many places on the write path that 
could block.

Any ideas?

-Jouni

Reply via email to