Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Curt Bererton Mon, 17 May 2010 17:32:15 -0700

Agreed, and I just saw that in storage conf that a higher value for the
MemtableFlushAfterMinutes is suggested otherwise you might get a "flush
storm: of all your memtables flushing at once". I've changed that as well.


--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 5:27 PM, Mark Greene <green...@gmail.com> wrote:

> Since you only have 7.5GB of memory, it's a really bad idea to set your
> heap space to a max of 7GB. Remember, the java process heap will be larger
> than what Xmx is allowed to grow to. If you reach this level, you can
> start swapping which is very very bad. As Brandon pointed out, you haven't
> exhausted your physically memory yet but you still want to lower Xmx to
> something like 5 maybe 6 GB.
>
>
> On Mon, May 17, 2010 at 7:02 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>
>> Here are the current jvm args  and java version:
>>
>> # Arguments to pass to the JVM
>> JVM_OPTS=" \
>>         -ea \
>>         -Xms128M \
>>         -Xmx7G \
>>         -XX:TargetSurvivorRatio=90 \
>>         -XX:+AggressiveOpts \
>>         -XX:+UseParNewGC \
>>         -XX:+UseConcMarkSweepGC \
>>         -XX:+CMSParallelRemarkEnabled \
>>         -XX:+HeapDumpOnOutOfMemoryError \
>>         -XX:SurvivorRatio=128 \
>>         -XX:MaxTenuringThreshold=0 \
>>         -Dcom.sun.management.jmxremote.port=8080 \
>>         -Dcom.sun.management.jmxremote.ssl=false \
>>         -Dcom.sun.management.jmxremote.authenticate=false"
>>
>> java -version outputs:
>> java version "1.6.0_20"
>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>
>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>> hammered right now, and it is receiving 0 ops/sec from me since I
>> disconnected it from our application right now until I can figure out what's
>> going on.
>>
>> running top on the machine I get:
>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>> 15.13
>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>> 24.6%st
>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>
>>
>> I have jconsole up and running, and jconsole vm Summary tab says:
>>  - total physical memory: 7,872,040 K
>>  - Free physical memory: 4,253,036 K
>>  - Total swap space: 0K
>>  - Free swap space: 0K
>>  - Committed virtual memory: 8,096648K
>>
>> Is there a specific thread I can look at in jconsole that might give me a
>> clue?  It's weird that it's still at 100% cpu even though it's getting no
>> traffic from outside right now.  I suppose it might still be talking across
>> the machines though.
>>
>> Also, stopping cassandra and starting cassandra on one of the 4 machines
>> caused the CPU to go back down to almost normal levels.
>>
>> Here's the ring;
>>
>> Address       Status     Load
>> Range                                      Ring
>>
>> 170141183460469231731687303715884105728
>> 10.251.XX.XX Up         2.15 MB
>> 42535295865117307932921825928971026432     |<--|
>> 10.250.XX.XX  Up         2.42 MB
>> 85070591730234615865843651857942052864     |   |
>> 10.250.XX.XX Up         2.47 MB
>> 127605887595351923798765477786913079296    |   |
>> 10.250.XX.XX Up         2.46 MB
>> 170141183460469231731687303715884105728    |-->|
>>
>> Any thoughts?
>>
>> Best,
>>
>> Curt
>> --
>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>> http://apps.facebook.com/happyhabitat
>>
>>
>> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <green...@gmail.com> wrote:
>>
>>> Can you provide us with the current JVM args? Also, what type of work
>>> load you are giving the ring (op/s)?
>>>
>>>
>>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>>>
>>>> Hello Cassandra users+experts,
>>>>
>>>> Hopefully someone will be able to point me in the correct direction. We
>>>> have cassandra 0.6.1 working on our test servers and we *thought* 
>>>> everything
>>>> was great and ready to move to production. We are currently running a ring
>>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>>> servers on production with a replication factor of 3 and a QUORUM
>>>> consistency level. We ran a test on 1% of our users, and everything was
>>>> writing to and reading from cassandra great for the first 3 hours. After
>>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>>> machines at once. This smells to me like a GC issue, and I'm looking into 
>>>> it
>>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>>> all the way up and running without CPU spiking I would be forever in their
>>>> debt.
>>>>
>>>> I suspect that anyone else running cassandra on large EC2 instances
>>>> might just be able to tell me what JVM args they are successfully using in 
>>>> a
>>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>>> and did they go to batched writes due to bug 1014? (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>>> all my questions.
>>>>
>>>> Is there anyone on the list who is using large EC2 instances in
>>>> production? Would you be kind enough to share your JVM arguments and any
>>>> other tips?
>>>>
>>>> Thanks for any help,
>>>> Curt
>>>> --
>>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>>> http://apps.facebook.com/happyhabitat
>>>>
>>>
>>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Reply via email to