On 5/6/11 9:47 PM, Jonathan Ellis wrote:
On Fri, May 6, 2011 at 5:13 PM, Alex Araujo
<cassandra-us...@alex.otherinbox.com>  wrote:
I raised the default MAX_HEAP setting from the AMI to 12GB (~80% of
available memory).
This is going to make GC pauses larger for no good reason.
Good point - only doing writes at the moment. I will revert the change and raise this conservatively once I add reads to the mix.

raised
concurrent_writes to 300 based on a (perhaps arbitrary?) recommendation in
'Cassandra: The Definitive Guide'
That's never been a good recommendation.
It seemed to contradict the '8 * number of cores' rule of thumb. I set that back to the default of 32.

Based on the above, would I be correct in assuming that frequent memtable
flushes and/or commitlog I/O are the likely bottlenecks?
Did I miss where you said what CPU usage was?
I observed a consistent 200-350% initially; 300-380% once 'hot' for all runs. Here is an average case sample:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15108 cassandr  20   0 5406m 4.5g  15m S  331 30.4  89:32.50 jsvc

How many replicas are you writing?

Replication factor is 3.

Recent testing suggests that putting the commitlog on the raid0 volume
is better than on the root volume on ec2, since the root isn't really
a separate device.

I migrated the commitlog to the raid0 volume and retested with the above changes. I/O appeared more consistent in iostat. Here's an average case (%util in the teens):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          36.84    4.05   13.97    3.04   18.42   23.68

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 0.00 0.00 0.00 222.00 0.00 18944.00 85.33 13.80 62.16 0.59 13.00 xvdc 0.00 0.00 0.00 231.00 0.00 19480.00 84.33 5.80 25.11 0.78 18.00 xvdd 0.00 0.00 0.00 228.00 0.00 19456.00 85.33 17.43 76.45 0.57 13.00 xvde 0.00 0.00 0.00 229.00 0.00 19464.00 85.00 10.41 45.46 0.44 10.00 md0 0.00 0.00 0.00 910.00 0.00 77344.00 84.99 0.00 0.00 0.00 0.00

and worst case (%util above 60):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          44.33    0.00   24.54    0.82   15.46   14.85

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap1 0.00 1.00 0.00 4.00 0.00 40.00 10.00 0.15 37.50 22.50 9.00 xvdb 0.00 0.00 0.00 427.00 0.00 36440.00 85.34 54.12 147.85 1.69 72.00 xvdc 0.00 0.00 1.00 295.00 8.00 25072.00 84.73 34.56 84.32 2.13 63.00 xvdd 0.00 0.00 0.00 355.00 0.00 30296.00 85.34 94.49 257.61 2.17 77.00 xvde 0.00 0.00 0.00 373.00 0.00 31768.00 85.17 68.50 189.33 1.88 70.00 md0 0.00 0.00 1.00 1418.00 8.00 120824.00 85.15 0.00 0.00 0.00 0.00

Overall, results were roughly the same. The most noticeable difference was no timeouts until number of client threads was 350 (previously 200):

+----------+----------+----------+----------+----------+----------+----------+
| Server | Client | --keep- | Columns | Client | Total | Combined | | Nodes | Nodes | going | | Threads | Threads | Rate (wr | | | | | | | | ites/s) |
+==========+==========+==========+==========+==========+==========+==========+
| 4 | 3 | N | 10000000 | 150 | 450 | 21241 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 10000000 | 200 | 600 | 21536 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 10000000 | 250 | 750 | 19451 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 10000000 | 300 | 900 | 19741 |
+----------+----------+----------+----------+----------+----------+----------+

Those results are after I compiled/deployed the latest cassandra-0.7 with the patch for https://issues.apache.org/jira/browse/CASSANDRA-2578. Thoughts?


Reply via email to