Hi Peter, I'm going to mix the response to your email along with my other email from yesterday since they pertain to the same issue. Sorry this is a little long, but I'm stomped and I'm trying to describe what I've investigated.
In a nutshell, in case someone has encountered this and won't read it to the end : a write-heavy process is going the ring to appear to "freeze" (=> utilization = 0%). Its Hector speed4j logs indicate failures and success at max=38s while other read/write processes are all indicating max=28s. It looks like I've got a magic number I can't figure out. You do say "nodes handling the requests". Two things to always keep in > mind is to (1) spread the requests evenly across all members of the > cluster, and (2) if you are doing a lot of work per row key, spread it > around and be concurrent so that you're not hitting a single row at a > time, which will be under the responsibility of a single set of RF > nodes (you want to put load on the entire cluster evently if you want > to maximize throughput). > I'm using Hector to connect to the cluster along with autoDiscover=true. Furthermore, I see in my logs that updates do get sent to multiple nodes so 1) is ok. Regarding 2), I may be running into this since data updates are very localized by design. I've distributed the keys per storage load but I'm going to have to distribute them by read/write load since the workload is all but random and I'm using BOP. However, I never see an IO bottle neck when using iostat, see below. > For starters, what *is* the throughput? How many counter mutations are > you submitting per second? > I've got two processes doing writes in parallel. The one we are currently discussing ("Process A") only writes while the other one ("Process B") reads 2 to 4x more data than it writes. Process A typically looks like this (numbers come from Hector). Each line below is one cassandra batch ie one Hector Mutator.execute(): 15:15:53 Wrote 86 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160 (153 usecs) 15:15:53 Wrote 90 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160 (97 usecs) 15:15:54 Wrote 85 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160 (754 usecs) 15:15:54 Wrote 81 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160 (561 usecs) 15:15:54 Wrote 86 cassandra mutations using host 176.31.226.128(176.31.226.128):9160 (130 usecs) 15:15:54 Wrote 73 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160 (97 usecs) 15:15:54 Wrote 82 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160 (48 usecs) 15:15:56 Wrote 108 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160 (1653 usecs) 15:15:56 Wrote 84 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160 (23 usecs) I'm pretty sure those are milli-seconds and not micro-seconds as per Hector docs (see last two lines & timestamp) which would amount to 500 to 1000 mutations per second with a min at 65 and a max at 3652. Clusterwide, opscenter is reporting 10 writes requests per second in the 20mn graph but that can't be right. Exact number is somewhere in the thousands of keys read per second but my problem with writes is really so big it doesn't matter what the actual number, see below. What's really puzzling is this, found in the logs created by Hector for Process B: Tag Avg(ms) Min Max Std Dev 95th Count WRITE.success_ 1709.64 0.83 28982.61 6100.55 21052.93 267 READ.success_ 262.64 17.25 1343.53 191.79 610.99 637 (+hardly ever any failures) At the same time, for process A, I see this 15:29:07 Tag Avg(ms) Min Max Std Dev 95th Count 15:29:07 WRITE.success_ 584.76 13.23 38042.17 4242.24 334.8479 15:29:07 WRITE.fail_ 38008.16 38008.16 38008.16 0.00 38008.16 1 (failures every minute) So there is at least one WRITE which is very very long : 28s for Process B and 38s for Process A. In fact, it looks like a magic timeout number because I see those two numbers all the time in the logs. WRITE.success_ 1603.61 1.11 28829.06 6069.97 21987.07 152 WRITE.success_ 307.56 0.81 29958.18 2879.91 39.98 918 WRITE.success_ 1664.64 0.88 29953.52 6127.34 20023.88 276 However, I can't link it to anything. My Hector failover timeout is 2s and everything else is just default install values. Even if Hector was backing-off multiple times until it worked, why would I always get the same 28/38 value... When I get a log like these, there always is a "cluster-freeze" during the preceding minute. By "cluster-freeze", I mean that a couple of nodes go to 0% utilization (no cpu, no system, no io) Once I noticed this, I shutdown Process A and watched Process B performance logs. It's all back to normal now: Tag Avg(ms) Min Max Std Dev 95th Count WRITE.success_ 5.43 0.91 253.49 21.91 10.96 245 READ.success_ 397.07 125.56 1684.41 127.33 613.21 1024 I ran a really long ping during these tests and I never got a timeout, every ping was about 0.2ms I may be overloading the cluster when Process A runs but I would like to understand why so I can do something about it. What I'm trying to figure out is: - why would counter writes timeout on 28 or 38s (5 node cluster) - what could cause the cluster to "freeze" during those timeouts If you have any answers or ideas on how I could find an answer, that would be great.