Pardon the long delay - went on holiday and got sidetracked before I
could return to this project.
@Joaquin - The DataStax AMI uses a RAID0 configuration on an instance
store's ephemeral drives.
@Jonathan - you were correct about the client node being the
bottleneck. I setup 3 XL client instances to run contrib/stress back on
the 4 node XL Cassandra cluster and incrementally raised number of
threads on the clients until I started seeing timeouts.
I set the following mem settings for the client JVMs: -Xms2G -Xmx10G
I raised the default MAX_HEAP setting from the AMI to 12GB (~80% of
available memory). I used the default AMI cassandra.yaml settings for
the Cassandra nodes until timeouts started appearing, and then raised
concurrent_writes to 300 based on a (perhaps arbitrary?) recommendation
in 'Cassandra: The Definitive Guide' that recommended raising that
number based on number of client threads (timeouts started appearing at
200 threads per client; 600 total threads). The client nodes were in
the same AZ as the Cassandra nodes, and I set the --keep-going option on
the clients for every other run >= 200 threads.
Results
+----------+----------+----------+----------+----------+----------+----------+
| Server | Client | --keep- | Columns | Client | Total |
Combined |
| Nodes | Nodes | going | | Threads | Threads |
Rate |
+==========+==========+==========+==========+==========+==========+==========+
| 4 | 3 | N | 10000000 | 25 | 75 |
13771 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 10000000 | 50 | 150 |
16853 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 10000000 | 75 | 225 |
18511 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 10000000 | 150 | 450 |
20013 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 7574241 | 200 | 600 |
22935 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | Y | 10000000 | 200 | 600 |
19737 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 9843677 | 250 | 750 |
20869 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | Y | 10000000 | 250 | 750 |
21217 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | N | 5015711 | 300 | 900 |
24177 |
+----------+----------+----------+----------+----------+----------+----------+
| 4 | 3 | Y | 10000000 | 300 | 900 |
206134 |
+----------+----------+----------+----------+----------+----------+----------+
Other Observations
* `vmstat` showed no swapping during runs
* `iostat -x` always showed 0's for avgqu-sz, await, and %util on the
/raid0 (data) partition; 0-150, 0-334ms, and 0-60% respectively for the
/ (commitlog) partition
* %steal from iostat ranged from 8-26% every run (one node had an almost
constant 26% while the others averaged closer to 10%)
* `nodetool tpstats` never showed more than 10's of Pending ops in
RequestResponseStage; no more than 1-2K Pending ops in MutationStage.
Usually a single node would register ops; the others would be 0's
* After all test runs, Memtable Switch Count was 1385 for
Keyspace1.Standard1
* Load average on the Cassandra nodes was very high the entire time,
especially for tests where each client ran > 100 threads. Here's one
sample @ 200 threads each (600 total):
[i-94e8d2fb] alex@cassandra-qa-1:~$ uptime
17:18:26 up 1 day, 19:04, 2 users, load average: 20.18, 15.20, 12.87
[i-a0e5dfcf] alex@cassandra-qa-2:~$ uptime
17:18:26 up 1 day, 18:52, 2 users, load average: 22.65, 25.60, 21.71
[i-92dde7fd] alex@cassandra-qa-3:~$ uptime
17:18:26 up 1 day, 18:44, 2 users, load average: 24.19, 28.29, 20.17
[i-08caf067] alex@cassandra-qa-4:~$ uptime
17:18:26 up 1 day, 18:37, 2 users, load average: 31.74, 20.99, 13.97
* Average resource utilization on the client nodes was between 10-80%
CPU; 5-25% memory depending on # of threads. Load average was always
negligible (presumably because there was no I/O)
* After a few runs and truncate operations on Keyspace1.Standard1, the
ring became unbalanced before runs:
[i-94e8d2fb] alex@cassandra-qa-1:~$ nodetool -h localhost ring
Address Status State Load Owns Token
127605887595351923798765477786913079296
10.240.114.143 Up Normal 2.1 GB 25.00% 0
10.210.154.63 Up Normal 330.19 MB 25.00%
42535295865117307932921825928971026432
10.110.63.247 Up Normal 361.38 MB 25.00%
85070591730234615865843651857942052864
10.46.143.223 Up Normal 1.6 GB 25.00%
127605887595351923798765477786913079296
and after runs:
[i-94e8d2fb] alex@cassandra-qa-1:~$ nodetool -h localhost ring
Address Status State Load Owns Token
127605887595351923798765477786913079296
10.240.114.143 Up Normal 3.9 GB 25.00% 0
10.210.154.63 Up Normal 2.05 GB 25.00%
42535295865117307932921825928971026432
10.110.63.247 Up Normal 2.07 GB 25.00%
85070591730234615865843651857942052864
10.46.143.223 Up Normal 3.33 GB 25.00%
127605887595351923798765477786913079296
Based on the above, would I be correct in assuming that frequent
memtable flushes and/or commitlog I/O are the likely bottlenecks? Could
%steal be partially contributing to the low throughput numbers as well?
If a single XL node can do ~12k writes/s, would it be reasonable to
expect ~40k writes/s with the above work load and number of nodes?
Thanks for your help, Alex.
On 4/25/11 11:23 AM, Joaquin Casares wrote:
Did the images have EBS storage or Instance Store storage?
Typically EBS volumes aren't the best to be benchmarking against:
http://www.mail-archive.com/user@cassandra.apache.org/msg11022.html
Joaquin Casares
DataStax
Software Engineer/Support
On Wed, Apr 20, 2011 at 5:12 PM, Jonathan Ellis <jbel...@gmail.com
<mailto:jbel...@gmail.com>> wrote:
A few months ago I was seeing 12k writes/s on a single EC2 XL. So
something is wrong.
My first suspicion is that your client node may be the bottleneck.
On Wed, Apr 20, 2011 at 2:56 PM, Alex Araujo
<cassandra-us...@alex.otherinbox.com
<mailto:cassandra-us...@alex.otherinbox.com>> wrote:
> Does anyone have any Ec2 benchmarks/experiences they can share?
I am trying
> to get a sense for what to expect from a production cluster on
Ec2 so that I
> can compare my application's performance against a sane
baseline. What I
> have done so far is:
>
> 1. Lunched a 4 node cluster of m1.xlarge instances in the same
availability
> zone using PyStratus
(https://github.com/digitalreasoning/PyStratus). Each
> node has the following specs (according to Amazon):
> 15 GB memory
> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> 1,690 GB instance storage
> 64-bit platform
>
> 2. Changed the default PyStratus directories in order to have
commit logs on
> the root partition and data files on ephemeral storage:
> commitlog_directory: /var/cassandra-logs
> data_file_directories: [/mnt/cassandra-data]
>
> 2. Gave each node 10GB of MAX_HEAP; 1GB HEAP_NEWSIZE in
> conf/cassandra-env.sh
>
> 3. Ran `contrib/stress/bin/stress -d node1,..,node4 -n 10000000
-t 100` on a
> separate m1.large instance:
> total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
> ...
> 9832712,7120,7120,0.004948514851485148,842
> 9907616,7490,7490,0.0043189949802413755,852
> 9978357,7074,7074,0.004560353967289125,863
> 10000000,2164,2164,0.004065933558194335,867
>
> 4. Truncated Keyspace1.Standard1:
> # /usr/local/apache-cassandra/bin/cassandra-cli -host localhost
-port 9160
> Connected to: "Test Cluster" on x.x.x.x/9160
> Welcome to cassandra CLI.
>
> Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
> [default@unknown] use Keyspace1;
> Authenticated to keyspace: Keyspace1
> [default@Keyspace1] truncate Standard1;
> null
>
> 5. Expanded the cluster to 8 nodes using PyStratus and sanity
checked using
> nodetool:
> # /usr/local/apache-cassandra/bin/nodetool -h localhost ring
> Address Status State Load Owns
> Token
> x.x.x.x Up Normal 1.3 GB 12.50%
> 21267647932558653966460912964485513216
> x.x.x.x Up Normal 3.06 GB 12.50%
> 42535295865117307932921825928971026432
> x.x.x.x Up Normal 1.16 GB 12.50%
> 63802943797675961899382738893456539648
> x.x.x.x Up Normal 2.43 GB 12.50%
> 85070591730234615865843651857942052864
> x.x.x.x Up Normal 1.22 GB 12.50%
> 106338239662793269832304564822427566080
> x.x.x.x Up Normal 2.74 GB 12.50%
> 127605887595351923798765477786913079296
> x.x.x.x Up Normal 1.22 GB 12.50%
> 148873535527910577765226390751398592512
> x.x.x.x Up Normal 2.57 GB 12.50%
> 170141183460469231731687303715884105728
>
> 6. Ran `contrib/stress/bin/stress -d node1,..,node8 -n 10000000
-t 100` on a
> separate m1.large instance again:
> total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
> ...
> 9880360,9649,9649,0.003210443956226165,720
> 9942718,6235,6235,0.003206934154398794,731
> 9997035,5431,5431,0.0032615939761032457,741
> 10000000,296,296,0.002660033726812816,742
>
> In a nutshell, 4 nodes inserted at 11,534 writes/sec and 8 nodes
inserted at
> 13,477 writes/sec.
>
> Those numbers seem a little low to me, but I don't have anything
to compare
> to. I'd like to hear others' opinions before I spin my wheels
with with
> number of nodes, threads, memtable, memory, and/or GC
settings. Cheers,
> Alex.
>
--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com