Hi Alex,

This has been a useful thread, we've been comparing your numbers with
our own tests.

Why did you choose four big instances rather than more smaller ones?

For $8/hr you get four m2.4xl with a total of 8 disks.
For $8.16/hr you could have twelve m1.xl with a total of 48 disks, 3x
disk space, a bit less total RAM and much more CPU

When an instance fails, you have a 25% loss of capacity with 4 or an
8% loss of capacity with 12.

I don't think it makes sense (especially on EC2) to run fewer than 6
instances, we are mostly starting at 12-15.
We can also spread the instances over three EC2 availability zones,
with RF=3 and one copy of the data in each zone.

Cheers
Adrian


On Wed, May 11, 2011 at 5:25 PM, Alex Araujo
<cassandra-us...@alex.otherinbox.com> wrote:
> On 5/9/11 9:49 PM, Jonathan Ellis wrote:
>>
>> On Mon, May 9, 2011 at 5:58 PM, Alex Araujo<cassandra->>  How many
>> replicas are you writing?
>>>
>>> Replication factor is 3.
>>
>> So you're actually spot on the predicted numbers: you're pushing
>> 20k*3=60k "raw" rows/s across your 4 machines.
>>
>> You might get another 10% or so from increasing memtable thresholds,
>> but bottom line is you're right around what we'd expect to see.
>> Furthermore, CPU is the primary bottleneck which is what you want to
>> see on a pure write workload.
>>
> That makes a lot more sense.  I upgraded the cluster to 4 m2.4xlarge
> instances (68GB of RAM/8 CPU cores) in preparation for application stress
> tests and the results were impressive @ 200 threads per client:
>
> +--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
> | Server Nodes | Client Nodes | --keep-going |   Columns    |    Client    |
>    Total     |  Rep Factor  |  Test Rate   | Cluster Rate |
> |              |              |              |              |   Threads    |
>   Threads    |              |  (writes/s)  |  (writes/s)  |
> +==============+==============+==============+==============+==============+==============+==============+==============+==============+
> |      4       |      3       |      N       |   10000000   |     200      |
>     600      |      3       |    44644     |    133931    |
> +--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
>
> The issue I'm seeing with app stress tests is that the rate will be
> comparable/acceptable at first (~100k w/s) and will degrade considerably
> (~48k w/s) until a flush and restart.  CPU usage will correspondingly be
> high at first (500-700%) and taper down to 50-200%.  My data model is pretty
> standard (<This> is pseudo-type information):
>
> Users<Column>
> "UserId<32CharHash>" : {
>    "email<String>": "a...@b.com",
>    "first_name<String>": "John",
>    "last_name<String>": "Doe"
> }
>
> UserGroups<SuperColumn>
> "GroupId<UUID>": {
>    "UserId<32CharHash>": {
>        "date_joined<DateTime>": "2011-05-10 13:14.789",
>        "date_left<DateTime>": "2011-05-11 13:14.789",
>        "active<short>": "0|1"
>    }
> }
>
> UserGroupTimeline<Column>
> "GroupId<UUID>": {
>    "date_joined<TimeUUID>": "UserId<32CharHash>"
> }
>
> UserGroupStatus<Column>
> "CompositeId('GroupId<UUID>:UserId<32CharHash>')": {
>    "active<short>": "0|1"
> }
>
> Every new User has a row in Users and a ColumnOrSuperColumn in the other 3
> CFs (total of 4 operations).  One notable difference is that the RAID0 on
> this instance type (surprisingly) only contains two ephemeral volumes and
> appear a bit more saturated in iostat, although not enough to clearly stand
> out as the bottleneck.  Is the bottleneck in this scenario likely memtable
> flush and/or commitlog rotation settings?
>
> RF = 2; ConsistencyLevel = One; -Xmx = 6GB; concurrent_writes: 64; all other
> settings are the defaults.  Thanks, Alex.
>

Reply via email to