Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Lee Parker Mon, 17 May 2010 17:12:14 -0700

What are your storage-conf settings for Memtable thresholds?  One thing that
could cause lots of CPU usage is dumping the memtables too frequently and
then having to do lots of compaction.  With that much available heap space
you could definitely go larger than the default thresholds.  Also, do you
not have any swap space setup on the machine?  It is a good idea to at least
setup a swap file so that the system can use it when it needs to.


We are running a two node cluster using Amazon large EC2 instances as well.
 The cluster is using a replication factor of 2 and most of my writes and
reads are at a consistency level of ONE except for a few QUORUM calls.  The
only difference in my JVM opts is that my max is set at 6G.  I have the two
ephemeral disks setup as a raid 0 array and that is where I'm storing the
data.  The commit logs are going to the default location so they are using
the local disk.  We currently have more than 90G of data running on these
and have only had issues with CPU utilization when our code was accidentally
duplicating content to one of the servers.  This duplication of content
started causing the server to be in a state of constant major compaction and
it couldn't keep up with new writes.  In the end, I completely dropped that
server and spun up another one to take it's place since the one good server
had all the data anyway.  So, it might have also been an issue with that
box.

One more question, are all of the instances in the same region?

Lee Parker
On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <c...@zipzapplay.com> wrote:

> Here are the current jvm args  and java version:
>
> # Arguments to pass to the JVM
> JVM_OPTS=" \
>         -ea \
>         -Xms128M \
>         -Xmx7G \
>         -XX:TargetSurvivorRatio=90 \
>         -XX:+AggressiveOpts \
>         -XX:+UseParNewGC \
>         -XX:+UseConcMarkSweepGC \
>         -XX:+CMSParallelRemarkEnabled \
>         -XX:+HeapDumpOnOutOfMemoryError \
>         -XX:SurvivorRatio=128 \
>         -XX:MaxTenuringThreshold=0 \
>         -Dcom.sun.management.jmxremote.port=8080 \
>         -Dcom.sun.management.jmxremote.ssl=false \
>         -Dcom.sun.management.jmxremote.authenticate=false"
>
> java -version outputs:
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>
> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
> hammered right now, and it is receiving 0 ops/sec from me since I
> disconnected it from our application right now until I can figure out what's
> going on.
>
> running top on the machine I get:
> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
> 15.13
> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
> 24.6%st
> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
> Swap:        0k total,        0k used,        0k free,  1655556k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>
>
> I have jconsole up and running, and jconsole vm Summary tab says:
>  - total physical memory: 7,872,040 K
>  - Free physical memory: 4,253,036 K
>  - Total swap space: 0K
>  - Free swap space: 0K
>  - Committed virtual memory: 8,096648K
>
> Is there a specific thread I can look at in jconsole that might give me a
> clue?  It's weird that it's still at 100% cpu even though it's getting no
> traffic from outside right now.  I suppose it might still be talking across
> the machines though.
>
> Also, stopping cassandra and starting cassandra on one of the 4 machines
> caused the CPU to go back down to almost normal levels.
>
> Here's the ring;
> Address       Status     Load
> Range                                      Ring
>
> 170141183460469231731687303715884105728
> 10.251.XX.XX Up         2.15 MB
> 42535295865117307932921825928971026432     |<--|
> 10.250.XX.XX  Up         2.42 MB
> 85070591730234615865843651857942052864     |   |
> 10.250.XX.XX Up         2.47 MB
> 127605887595351923798765477786913079296    |   |
> 10.250.XX.XX Up         2.46 MB
> 170141183460469231731687303715884105728    |-->|
>
> Any thoughts?
>
> Best,
>
> Curt
> --
> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
> http://apps.facebook.com/happyhabitat
>
>
> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <green...@gmail.com> wrote:
>
>> Can you provide us with the current JVM args? Also, what type of work load
>> you are giving the ring (op/s)?
>>
>>
>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>>
>>> Hello Cassandra users+experts,
>>>
>>> Hopefully someone will be able to point me in the correct direction. We
>>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>>> was great and ready to move to production. We are currently running a ring
>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>> servers on production with a replication factor of 3 and a QUORUM
>>> consistency level. We ran a test on 1% of our users, and everything was
>>> writing to and reading from cassandra great for the first 3 hours. After
>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>> machines at once. This smells to me like a GC issue, and I'm looking into it
>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>> all the way up and running without CPU spiking I would be forever in their
>>> debt.
>>>
>>> I suspect that anyone else running cassandra on large EC2 instances might
>>> just be able to tell me what JVM args they are successfully using in a
>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>> and did they go to batched writes due to bug 1014? (
>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>> all my questions.
>>>
>>> Is there anyone on the list who is using large EC2 instances in
>>> production? Would you be kind enough to share your JVM arguments and any
>>> other tips?
>>>
>>> Thanks for any help,
>>> Curt
>>> --
>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>> http://apps.facebook.com/happyhabitat
>>>
>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Reply via email to