Re: CPU bound workload

Philippe Sun, 11 Dec 2011 07:06:19 -0800

Hi Peter,
I'm going to mix the response to your email along with my other email from
yesterday since they pertain to the same issue.
Sorry this is a little long, but I'm stomped and I'm trying to describe
what I've investigated.


In a nutshell, in case someone has encountered this and won't read it to
the end : a write-heavy process is going the ring to appear to "freeze" (=>
utilization = 0%). Its Hector speed4j logs indicate failures and success at
max=38s while other read/write processes are all indicating max=28s. It
looks like I've got a magic number I can't figure out.

You do say "nodes handling the requests". Two things to always keep in
> mind is to (1) spread the requests evenly across all members of the
> cluster, and (2) if you are doing a lot of work per row key, spread it
> around and be concurrent so that you're not hitting a single row at a
> time, which will be under the responsibility of a single set of RF
> nodes (you want to put load on the entire cluster evently if you want
> to maximize throughput).
>
I'm using Hector to connect to the cluster along with autoDiscover=true.
Furthermore, I see in my logs that updates do get sent to multiple nodes so
1) is ok.
Regarding 2), I may be running into this since data updates are very
localized by design. I've distributed the keys per storage load but I'm
going to have to distribute them by read/write load since the workload is
all but random and I'm using BOP. However, I never see an IO bottle neck
when using iostat, see below.


> For starters, what *is* the throughput? How many counter mutations are
> you submitting per second?
>
I've got two processes doing writes in parallel. The one we are currently
discussing ("Process A") only writes while the other one ("Process B")
reads 2 to 4x more data than it writes.

Process A typically looks like this (numbers come from Hector). Each line
below is one cassandra batch ie one Hector Mutator.execute():
15:15:53 Wrote 86 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160
(153 usecs)
15:15:53 Wrote 90 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160
(97 usecs)
15:15:54 Wrote 85 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160
(754 usecs)
15:15:54 Wrote 81 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160
(561 usecs)
15:15:54 Wrote 86 cassandra mutations using host
176.31.226.128(176.31.226.128):9160 (130 usecs)
15:15:54 Wrote 73 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160
(97 usecs)
15:15:54 Wrote 82 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160
(48 usecs)
15:15:56 Wrote 108 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160
(1653 usecs)
15:15:56 Wrote 84 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160
(23 usecs)
I'm pretty sure those are milli-seconds and not micro-seconds as per Hector
docs (see last two lines & timestamp) which would amount to 500 to 1000
mutations per second with a min at 65 and a max at 3652.
Clusterwide, opscenter is reporting 10 writes requests per second in the
20mn graph but that can't be right.

Exact number is somewhere in the thousands of keys read per second but my
problem with writes is really so big it doesn't matter what the actual
number, see below.


What's really puzzling is this, found in the logs created by Hector for
Process B:
Tag                                       Avg(ms)      Min      Max  Std
Dev     95th   Count
WRITE.success_                            1709.64     0.83 28982.61
 6100.55 21052.93     267
READ.success_                              262.64    17.25  1343.53
191.79   610.99     637
(+hardly ever any failures)

At the same time, for process A, I see this
15:29:07 Tag                                       Avg(ms)      Min
 Max  Std Dev     95th   Count
15:29:07 WRITE.success_                             584.76    13.23
38042.17  4242.24   334.8479
15:29:07 WRITE.fail_                              38008.16 38008.16
38008.16     0.00 38008.16 1
(failures every minute)

So there is at least one WRITE which is very very long : 28s for Process B
and 38s for Process A. In fact, it looks like a magic timeout number
because I see those two numbers all the time in the logs.
WRITE.success_                            1603.61     1.11 28829.06
 6069.97 21987.07     152
WRITE.success_                             307.56     0.81 29958.18
 2879.91    39.98     918
WRITE.success_                            1664.64     0.88 29953.52
 6127.34 20023.88     276
However, I can't link it to anything. My Hector failover timeout is 2s and
everything else is just default install values. Even if Hector was
backing-off multiple times until it worked, why would I always get the same
28/38 value...

When I get a log like these, there always is a "cluster-freeze" during the
preceding minute. By "cluster-freeze", I mean that a couple of nodes go to
0% utilization (no cpu, no system, no io)

Once I noticed this, I shutdown Process A and watched Process B performance
logs. It's all back to normal now:
Tag                                       Avg(ms)      Min      Max  Std
Dev     95th   Count
WRITE.success_                               5.43     0.91   253.49
 21.91    10.96     245
READ.success_                              397.07   125.56  1684.41
127.33   613.21    1024

I ran a really long ping during these tests and I never got a timeout,
every ping was about 0.2ms

I may be overloading the cluster when Process A runs but I would like to
understand why so I can do something about it. What I'm trying to figure
out is:
 - why would counter writes timeout on 28 or 38s (5 node cluster)
 - what could cause the cluster to "freeze" during those timeouts

If you have any answers or ideas on how I could find an answer, that would
be great.

Re: CPU bound workload

Reply via email to