What are your storage-conf settings for Memtable thresholds? One thing that could cause lots of CPU usage is dumping the memtables too frequently and then having to do lots of compaction. With that much available heap space you could definitely go larger than the default thresholds. Also, do you not have any swap space setup on the machine? It is a good idea to at least setup a swap file so that the system can use it when it needs to.
We are running a two node cluster using Amazon large EC2 instances as well. The cluster is using a replication factor of 2 and most of my writes and reads are at a consistency level of ONE except for a few QUORUM calls. The only difference in my JVM opts is that my max is set at 6G. I have the two ephemeral disks setup as a raid 0 array and that is where I'm storing the data. The commit logs are going to the default location so they are using the local disk. We currently have more than 90G of data running on these and have only had issues with CPU utilization when our code was accidentally duplicating content to one of the servers. This duplication of content started causing the server to be in a state of constant major compaction and it couldn't keep up with new writes. In the end, I completely dropped that server and spun up another one to take it's place since the one good server had all the data anyway. So, it might have also been an issue with that box. One more question, are all of the instances in the same region? Lee Parker On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <c...@zipzapplay.com> wrote: > Here are the current jvm args and java version: > > # Arguments to pass to the JVM > JVM_OPTS=" \ > -ea \ > -Xms128M \ > -Xmx7G \ > -XX:TargetSurvivorRatio=90 \ > -XX:+AggressiveOpts \ > -XX:+UseParNewGC \ > -XX:+UseConcMarkSweepGC \ > -XX:+CMSParallelRemarkEnabled \ > -XX:+HeapDumpOnOutOfMemoryError \ > -XX:SurvivorRatio=128 \ > -XX:MaxTenuringThreshold=0 \ > -Dcom.sun.management.jmxremote.port=8080 \ > -Dcom.sun.management.jmxremote.ssl=false \ > -Dcom.sun.management.jmxremote.authenticate=false" > > java -version outputs: > java version "1.6.0_20" > Java(TM) SE Runtime Environment (build 1.6.0_20-b02) > Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) > > So pretty much the defaults aside from the 7Gig max heap. CPU is totally > hammered right now, and it is receiving 0 ops/sec from me since I > disconnected it from our application right now until I can figure out what's > going on. > > running top on the machine I get: > top - 18:56:32 up 2 days, 20:57, 2 users, load average: 14.97, 15.24, > 15.13 > Tasks: 87 total, 5 running, 82 sleeping, 0 stopped, 0 zombie > Cpu(s): 40.1%us, 33.9%sy, 0.0%ni, 0.1%id, 0.0%wa, 0.0%hi, 1.3%si, > 24.6%st > Mem: 7872040k total, 3618764k used, 4253276k free, 387536k buffers > Swap: 0k total, 0k used, 0k free, 1655556k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 2566 cassandr 25 0 7906m 639m 10m S 150 8.3 5846:35 java > > > I have jconsole up and running, and jconsole vm Summary tab says: > - total physical memory: 7,872,040 K > - Free physical memory: 4,253,036 K > - Total swap space: 0K > - Free swap space: 0K > - Committed virtual memory: 8,096648K > > Is there a specific thread I can look at in jconsole that might give me a > clue? It's weird that it's still at 100% cpu even though it's getting no > traffic from outside right now. I suppose it might still be talking across > the machines though. > > Also, stopping cassandra and starting cassandra on one of the 4 machines > caused the CPU to go back down to almost normal levels. > > Here's the ring; > Address Status Load > Range Ring > > 170141183460469231731687303715884105728 > 10.251.XX.XX Up 2.15 MB > 42535295865117307932921825928971026432 |<--| > 10.250.XX.XX Up 2.42 MB > 85070591730234615865843651857942052864 | | > 10.250.XX.XX Up 2.47 MB > 127605887595351923798765477786913079296 | | > 10.250.XX.XX Up 2.46 MB > 170141183460469231731687303715884105728 |-->| > > Any thoughts? > > Best, > > Curt > -- > Curt, ZipZapPlay Inc., www.PlayCrafter.com, > http://apps.facebook.com/happyhabitat > > > On Mon, May 17, 2010 at 3:51 PM, Mark Greene <green...@gmail.com> wrote: > >> Can you provide us with the current JVM args? Also, what type of work load >> you are giving the ring (op/s)? >> >> >> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <c...@zipzapplay.com>wrote: >> >>> Hello Cassandra users+experts, >>> >>> Hopefully someone will be able to point me in the correct direction. We >>> have cassandra 0.6.1 working on our test servers and we *thought* everything >>> was great and ready to move to production. We are currently running a ring >>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/) >>> servers on production with a replication factor of 3 and a QUORUM >>> consistency level. We ran a test on 1% of our users, and everything was >>> writing to and reading from cassandra great for the first 3 hours. After >>> that point CPU usage spiked to 100% and stayed there, basically on all 4 >>> machines at once. This smells to me like a GC issue, and I'm looking into it >>> with jconsole right now. If anyone can help me debug this and get cassandra >>> all the way up and running without CPU spiking I would be forever in their >>> debt. >>> >>> I suspect that anyone else running cassandra on large EC2 instances might >>> just be able to tell me what JVM args they are successfully using in a >>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1, >>> and did they go to batched writes due to bug 1014? ( >>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer >>> all my questions. >>> >>> Is there anyone on the list who is using large EC2 instances in >>> production? Would you be kind enough to share your JVM arguments and any >>> other tips? >>> >>> Thanks for any help, >>> Curt >>> -- >>> Curt, ZipZapPlay Inc., www.PlayCrafter.com, >>> http://apps.facebook.com/happyhabitat >>> >> >> >