Interesting development : I changed the maximum size of the batches in "Process A" to get them to go from about 90 per execute() to about 35. All the weird 28s/38s maximum execution times are gone, all timeouts are gone and everything is zipping along just fine. So moral of the story for me is : only batch if you gain something because it might break stuff.
Given this work-around, can anyone explain to me why this was happening ? 2011/12/11 Philippe <watche...@gmail.com> > Hi Peter, > I'm going to mix the response to your email along with my other email from > yesterday since they pertain to the same issue. > Sorry this is a little long, but I'm stomped and I'm trying to describe > what I've investigated. > > In a nutshell, in case someone has encountered this and won't read it to > the end : a write-heavy process is going the ring to appear to "freeze" (=> > utilization = 0%). Its Hector speed4j logs indicate failures and success at > max=38s while other read/write processes are all indicating max=28s. It > looks like I've got a magic number I can't figure out. > > You do say "nodes handling the requests". Two things to always keep in >> mind is to (1) spread the requests evenly across all members of the >> cluster, and (2) if you are doing a lot of work per row key, spread it >> around and be concurrent so that you're not hitting a single row at a >> time, which will be under the responsibility of a single set of RF >> nodes (you want to put load on the entire cluster evently if you want >> to maximize throughput). >> > I'm using Hector to connect to the cluster along with autoDiscover=true. > Furthermore, I see in my logs that updates do get sent to multiple nodes so > 1) is ok. > Regarding 2), I may be running into this since data updates are very > localized by design. I've distributed the keys per storage load but I'm > going to have to distribute them by read/write load since the workload is > all but random and I'm using BOP. However, I never see an IO bottle neck > when using iostat, see below. > > >> For starters, what *is* the throughput? How many counter mutations are >> you submitting per second? >> > I've got two processes doing writes in parallel. The one we are currently > discussing ("Process A") only writes while the other one ("Process B") > reads 2 to 4x more data than it writes. > > Process A typically looks like this (numbers come from Hector). Each line > below is one cassandra batch ie one Hector Mutator.execute(): > 15:15:53 Wrote 86 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160 > (153 usecs) > 15:15:53 Wrote 90 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160 > (97 usecs) > 15:15:54 Wrote 85 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160 > (754 usecs) > 15:15:54 Wrote 81 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160 > (561 usecs) > 15:15:54 Wrote 86 cassandra mutations using host > 176.31.226.128(176.31.226.128):9160 (130 usecs) > 15:15:54 Wrote 73 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160 > (97 usecs) > 15:15:54 Wrote 82 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160 > (48 usecs) > 15:15:56 Wrote 108 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160 > (1653 usecs) > 15:15:56 Wrote 84 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160 > (23 usecs) > I'm pretty sure those are milli-seconds and not micro-seconds as per > Hector docs (see last two lines & timestamp) which would amount to 500 to > 1000 mutations per second with a min at 65 and a max at 3652. > Clusterwide, opscenter is reporting 10 writes requests per second in the > 20mn graph but that can't be right. > > Exact number is somewhere in the thousands of keys read per second but my > problem with writes is really so big it doesn't matter what the actual > number, see below. > > > What's really puzzling is this, found in the logs created by Hector for > Process B: > Tag Avg(ms) Min Max Std > Dev 95th Count > WRITE.success_ 1709.64 0.83 28982.61 > 6100.55 21052.93 267 > READ.success_ 262.64 17.25 1343.53 > 191.79 610.99 637 > (+hardly ever any failures) > > At the same time, for process A, I see this > 15:29:07 Tag Avg(ms) Min > Max Std Dev 95th Count > 15:29:07 WRITE.success_ 584.76 13.23 > 38042.17 4242.24 334.8479 > 15:29:07 WRITE.fail_ 38008.16 38008.16 > 38008.16 0.00 38008.16 1 > (failures every minute) > > So there is at least one WRITE which is very very long : 28s for Process B > and 38s for Process A. In fact, it looks like a magic timeout number > because I see those two numbers all the time in the logs. > WRITE.success_ 1603.61 1.11 28829.06 > 6069.97 21987.07 152 > WRITE.success_ 307.56 0.81 29958.18 > 2879.91 39.98 918 > WRITE.success_ 1664.64 0.88 29953.52 > 6127.34 20023.88 276 > However, I can't link it to anything. My Hector failover timeout is 2s and > everything else is just default install values. Even if Hector was > backing-off multiple times until it worked, why would I always get the same > 28/38 value... > > When I get a log like these, there always is a "cluster-freeze" during the > preceding minute. By "cluster-freeze", I mean that a couple of nodes go to > 0% utilization (no cpu, no system, no io) > > Once I noticed this, I shutdown Process A and watched Process B > performance logs. It's all back to normal now: > Tag Avg(ms) Min Max Std > Dev 95th Count > WRITE.success_ 5.43 0.91 253.49 > 21.91 10.96 245 > READ.success_ 397.07 125.56 1684.41 > 127.33 613.21 1024 > > I ran a really long ping during these tests and I never got a timeout, > every ping was about 0.2ms > > I may be overloading the cluster when Process A runs but I would like to > understand why so I can do something about it. What I'm trying to figure > out is: > - why would counter writes timeout on 28 or 38s (5 node cluster) > - what could cause the cluster to "freeze" during those timeouts > > If you have any answers or ideas on how I could find an answer, that would > be great. >