Interesting development : I changed the maximum size of the batches in
"Process A" to get them to go from about 90 per execute() to about 35. All
the weird 28s/38s maximum execution times are gone, all timeouts are gone
and everything is zipping along just fine. So moral of the story for me is
: only batch if you gain something because it might break stuff.

Given this work-around, can anyone explain to me why this was happening ?

2011/12/11 Philippe <watche...@gmail.com>

> Hi Peter,
> I'm going to mix the response to your email along with my other email from
> yesterday since they pertain to the same issue.
> Sorry this is a little long, but I'm stomped and I'm trying to describe
> what I've investigated.
>
> In a nutshell, in case someone has encountered this and won't read it to
> the end : a write-heavy process is going the ring to appear to "freeze" (=>
> utilization = 0%). Its Hector speed4j logs indicate failures and success at
> max=38s while other read/write processes are all indicating max=28s. It
> looks like I've got a magic number I can't figure out.
>
> You do say "nodes handling the requests". Two things to always keep in
>> mind is to (1) spread the requests evenly across all members of the
>> cluster, and (2) if you are doing a lot of work per row key, spread it
>> around and be concurrent so that you're not hitting a single row at a
>> time, which will be under the responsibility of a single set of RF
>> nodes (you want to put load on the entire cluster evently if you want
>> to maximize throughput).
>>
> I'm using Hector to connect to the cluster along with autoDiscover=true.
> Furthermore, I see in my logs that updates do get sent to multiple nodes so
> 1) is ok.
> Regarding 2), I may be running into this since data updates are very
> localized by design. I've distributed the keys per storage load but I'm
> going to have to distribute them by read/write load since the workload is
> all but random and I'm using BOP. However, I never see an IO bottle neck
> when using iostat, see below.
>
>
>> For starters, what *is* the throughput? How many counter mutations are
>> you submitting per second?
>>
> I've got two processes doing writes in parallel. The one we are currently
> discussing ("Process A") only writes while the other one ("Process B")
> reads 2 to 4x more data than it writes.
>
> Process A typically looks like this (numbers come from Hector). Each line
> below is one cassandra batch ie one Hector Mutator.execute():
> 15:15:53 Wrote 86 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160
> (153 usecs)
> 15:15:53 Wrote 90 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160
> (97 usecs)
> 15:15:54 Wrote 85 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160
> (754 usecs)
> 15:15:54 Wrote 81 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160
> (561 usecs)
> 15:15:54 Wrote 86 cassandra mutations using host
> 176.31.226.128(176.31.226.128):9160 (130 usecs)
> 15:15:54 Wrote 73 cassandra mutations using host X.Y.Z.97(X.Y.Z.97):9160
> (97 usecs)
> 15:15:54 Wrote 82 cassandra mutations using host X.Y.Z.96(X.Y.Z.96):9160
> (48 usecs)
> 15:15:56 Wrote 108 cassandra mutations using host X.Y.Z.95(X.Y.Z.95):9160
> (1653 usecs)
> 15:15:56 Wrote 84 cassandra mutations using host X.Y.Z.109(X.Y.Z.109):9160
> (23 usecs)
> I'm pretty sure those are milli-seconds and not micro-seconds as per
> Hector docs (see last two lines & timestamp) which would amount to 500 to
> 1000 mutations per second with a min at 65 and a max at 3652.
> Clusterwide, opscenter is reporting 10 writes requests per second in the
> 20mn graph but that can't be right.
>
> Exact number is somewhere in the thousands of keys read per second but my
> problem with writes is really so big it doesn't matter what the actual
> number, see below.
>
>
> What's really puzzling is this, found in the logs created by Hector for
> Process B:
> Tag                                       Avg(ms)      Min      Max  Std
> Dev     95th   Count
> WRITE.success_                            1709.64     0.83 28982.61
>  6100.55 21052.93     267
> READ.success_                              262.64    17.25  1343.53
> 191.79   610.99     637
> (+hardly ever any failures)
>
> At the same time, for process A, I see this
> 15:29:07 Tag                                       Avg(ms)      Min
>  Max  Std Dev     95th   Count
> 15:29:07 WRITE.success_                             584.76    13.23
> 38042.17  4242.24   334.8479
> 15:29:07 WRITE.fail_                              38008.16 38008.16
> 38008.16     0.00 38008.16 1
> (failures every minute)
>
> So there is at least one WRITE which is very very long : 28s for Process B
> and 38s for Process A. In fact, it looks like a magic timeout number
> because I see those two numbers all the time in the logs.
> WRITE.success_                            1603.61     1.11 28829.06
>  6069.97 21987.07     152
> WRITE.success_                             307.56     0.81 29958.18
>  2879.91    39.98     918
> WRITE.success_                            1664.64     0.88 29953.52
>  6127.34 20023.88     276
> However, I can't link it to anything. My Hector failover timeout is 2s and
> everything else is just default install values. Even if Hector was
> backing-off multiple times until it worked, why would I always get the same
> 28/38 value...
>
> When I get a log like these, there always is a "cluster-freeze" during the
> preceding minute. By "cluster-freeze", I mean that a couple of nodes go to
> 0% utilization (no cpu, no system, no io)
>
> Once I noticed this, I shutdown Process A and watched Process B
> performance logs. It's all back to normal now:
> Tag                                       Avg(ms)      Min      Max  Std
> Dev     95th   Count
> WRITE.success_                               5.43     0.91   253.49
>  21.91    10.96     245
> READ.success_                              397.07   125.56  1684.41
> 127.33   613.21    1024
>
> I ran a really long ping during these tests and I never got a timeout,
> every ping was about 0.2ms
>
> I may be overloading the cluster when Process A runs but I would like to
> understand why so I can do something about it. What I'm trying to figure
> out is:
>  - why would counter writes timeout on 28 or 38s (5 node cluster)
>  - what could cause the cluster to "freeze" during those timeouts
>
> If you have any answers or ideas on how I could find an answer, that would
> be great.
>

Reply via email to