Hi all, I'm experiencing strange latency spikes when writing and trying to figure out what could cause them.
My setup: - 3 nodes, writing at CL.ONE using Hector client, no reads - Writing simultaneously to 3 CFs, inserts with 25h TTL, no deletes, no updates, RF 3 - 2 CFs have small data (row count < 2000, row size < 500kB, column count/row < 15 000) - 1 CF has lots of binary data split into ~60kB columns (row count < 550 000, row sizes < 2MB, column count/row < 40) - Write rate ~300 inserts / s for each CF, total write throughput ~25 MB (bytes) / second - data is time series using timestamp as column key - Cassandra 1.2.2 with 256 vnodes on each machine - Key cache at default 100MB, no row cache - 1 x Xeon L5430 CPU, 16GB RAM, 2.3T disc on RAID10 (10k SAS), Sun/Oracle JDK 1.6 (tried also 1.7), 4GB JVM heap, JNA enabled - all nodes in the same DC, 1Gb network, sub ms latencies between nodes cassandra.yaml: http://pastebin.com/MSr2prpb cfstats: http://pastebin.com/Ax5vPUcY example cfhistograms: http://pastebin.com/qYSL1MX3 example proxy histograms: http://pastebin.com/X3AGGEjh With this setup I usually get quite nice write latencies of less than 20ms, but sometimes (~once in a every few minutes) latencies momentarily spike to more than 300ms maxing out at ~2.5 seconds. Spikes are short (< 1 s) and happen on all nodes (but not at the same time). Even if avg latencies are very good, these spikes cause us headaches due to our SLA. While investigating I have learned the following: - No evident GC pressure (nothing in C* logs, GC logging showing constantly < 30ms collection pauses) - No I/O bounds (disks provide ~1GB/s linear write and are mostly idle apart from memtable flushes for every ~11s) - No relation between spikes & compaction - No queuing in memtable FlushWriter, no blocked memtable flushes - Nothing alarming in logs - No timeouts, no errors on the client side - Each client (3 separate machines) experience latencies simultaneously which points to cause being in C*, not in the client - CPU load < 10% (< 20% while compacting) - Latencies measured both from the client and observed using nodetool cfhistograms Now I'm running out of ideas about what might cause the spikes as I have understood that there is really not that many places on the write path that could block. Any ideas? -Jouni