The main diagnosing feature of the problem I was seeing is very high system CPU with no user CPU utilization(check with top or sar -u), vmstat showing one process waiting for run-time but never seeming to get it, a high page scan rate, and no Cassandra error messages (although nodes dying did *seem* to correlate with flushing memtables and compaction). I am also using 64 bit kernel.
I was having nodes dying every few hours but ever since I switched from mmap (auto= mmap for 64 bit) to mmap_index_only, things have been rock solid reliable. No down time in 48+ hours. You haven't really provided enough information to determine if you are having the same problem I was having but if you think so, I would recommend you at least try switching to mmap_index_only. Can one of the Cassandra devs or anybody who knows about memory mapping comment on this/my particular mmap situation? I have been thinking about it and the start of my problems seemed to correlate to my active dataset and single sstable sizes growing beyond the amount of free system memory (12 GB, my nodes have 24 GB total with 12 GB for Cassandra heap). Does memory mapping somehow force the data to stay in memory or prevent it memory from being reclaimed for other purposes? Google does not turn up any nice simple answers. Dan From: Christopher Kung [mailto:chris.k...@gmail.com] Sent: December-22-10 4:09 To: user@cassandra.apache.org Subject: Cassandra Node Routinely Goes Down - 0.7 RC2 Hey All, I have been having problems running 0.7RC2 where one of my two nodes routinely goes down. Somtimes both of them go down. I am running the nodes using Ubuntu Lucid LTS 64-bit with kernal version 2.6.32. Currently, both nodes are running on micro instances on EC2. I will eventual migrate to large instance...but I can't seem to get Cassandra to stay up for more than 1 day at a time I saw another post recently where someone else was having a similiar problem, and the solution was to change to mmap_index for disk access mode rather than auto. Anyways, the machines are 64-bit, despite being under powered, so I don't see why that's necessary. I checked my logs and there are no error messages. Are the nodes just running into resource issues? Thanks. Chris No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3329 - Release Date: 12/21/10 02:34:00