I have been having severe and strange reliability problems within my Cassandra cluster. This weekend, all four of my nodes were down at once. Even now I am loosing one every few hours. I have attached output from all the system monitoring commands I can think of.
What seems to happen is that the java process locks up and sits and has 100% system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total capacity). JMX freezes and the node effectively dies, but there is typically nothing unusual in the Cassandra logs. About the only thing which seems to be correlated is the flushing of memtables tables. One of the strangest stats I am getting when in this state is memory paging: 3727168.00 pages scanned/second (see sar -B output). Occasionally, if I leave the process alone (~1 h) it recovers (maybe 1 in 5 times), otherwise the only way to terminate the Cassandra process is with a kill -9. When this happens, Cassandra memory usage (as reported by JMX before it dies) is also reasonable (ex 6 GB out of 12 GB heap and 24 GB system). This feels more like a system level problem than a Cassandra problem so I have tried diversifying my cluster, one node runs Ubuntu 10.10, the other three 10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22). Neither change seems be correlated with the problem. These are pretty much stock ubuntu installs so nothing special on that front. Now this has been a relatively sudden development and I can potentially attribute it to a few things: 1. Upgrading to RC2 2. Ever increasing amounts of data (there is less than 100 gb per node so this should not be the problem). 3. Migrating from a set of machines where data+commit log directories were on four small raid 5 hds to machines with two 500 gig drives: one for data and one for commitlog + os. I have seen more IO wait on these new machines. But they have the same memory and system settings. I am about at my wits end on this one, any help would be appreciated.
vmstat output procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 5472 93056 8528 10748436 0 0 90 74 88 11 3 2 94 2 1 0 5472 93052 8528 10748460 0 0 0 4 409 521 0 18 82 0 1 0 5472 93052 8528 10748460 0 0 0 0 365 485 0 16 84 0 1 0 5472 93052 8528 10748460 0 0 0 0 358 481 0 11 89 0 1 0 5472 93052 8528 10748460 0 0 0 0 369 484 0 12 88 0
TOP Output top - 10:58:19 up 12:57, 2 users, load average: 1.07, 1.21, 1.51 Tasks: 185 total, 1 running, 184 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 11.1%sy, 0.0%ni, 88.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24733336k total, 24643380k used, 89956k free, 8752k buffers Swap: 1952760k total, 5472k used, 1947288k free, 10748616k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1442 dhendry 20 0 91.9g 20g 8.4g S 100 88.4 263:15.73 java 1101 root 20 0 11280 584 448 S 0 0.0 0:46.48 irqbalance 4352 dhendry 20 0 19220 1472 1064 R 0 0.0 0:00.02 top 1 root 20 0 23704 1764 1216 S 0 0.0 0:01.54 init 2 root 20 0 0 0 0 S 0 0.0 0:00.02 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
output of sar -u 1 5 Linux 2.6.32-21-server (talk12) 12/20/2010 _x86_64_ (8 CPU) 11:12:25 AM CPU %user %nice %system %iowait %steal %idle 11:12:26 AM all 0.00 0.00 18.20 0.00 0.00 81.80 11:12:27 AM all 0.00 0.00 13.70 0.41 0.00 85.89 11:12:28 AM all 0.00 0.00 10.52 0.00 0.00 89.48 11:12:29 AM all 0.00 0.00 16.94 0.00 0.00 83.06 11:12:30 AM all 0.00 0.00 14.92 0.00 0.00 85.08 Average: all 0.00 0.00 14.32 0.08 0.00 85.60
Output of sar -B 1 5 Linux 2.6.32-21-server (talk12) 12/20/2010 _x86_64_ (8 CPU) 11:12:59 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 11:13:00 AM 0.00 0.00 38.00 0.00 61.00 0.00 3727168.00 0.00 0.00 11:13:01 AM 0.00 0.00 39.00 0.00 61.00 0.00 3714400.00 0.00 0.00 11:13:02 AM 0.00 0.00 33.00 0.00 61.00 0.00 3724384.00 0.00 0.00 11:13:03 AM 0.00 0.00 33.00 0.00 61.00 0.00 3716192.00 0.00 0.00 11:13:04 AM 0.00 0.00 33.00 0.00 61.00 0.00 3713184.00 0.00 0.00 Average: 0.00 0.00 35.20 0.00 61.00 0.00 3719065.60 0.00 0.00
iostat output: avg-cpu: %user %nice %system %iowait %steal %idle 2.30 0.39 1.95 1.83 0.00 93.54 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2.03 26.04 496.73 1216622 23207610 sdb 21.59 1425.70 697.33 66609722 32579816 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 16.53 0.00 0.00 83.47 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 10.59 0.00 0.00 89.41 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0
output of free -m total used free shared buffers cached Mem: 24153 24039 114 0 9 10459 -/+ buffers/cache: 13570 10582 Swap: 1906 5 1901
Cassandra log: INFO [COMMIT-LOG-WRITER] 2010-12-20 10:44:41,936 CommitLogSegment.java (line 50) Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1292863481936.log INFO [COMMIT-LOG-WRITER] 2010-12-20 10:45:39,167 CommitLogSegment.java (line 50) Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1292863539166.log INFO [MutationStage:26] 2010-12-20 10:48:29,140 ColumnFamilyStore.java (line 640) switching in a fresh Memtable for UserEventsByUser at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1292863539166.log', position=49812421) INFO [MutationStage:26] 2010-12-20 10:48:29,140 ColumnFamilyStore.java (line 944) Enqueuing flush of memtable-usereventsbyu...@1784371754(162606261 bytes, 2097613 operations) INFO [FlushWriter:1] 2010-12-20 10:48:29,140 Memtable.java (line 155) Writing memtable-usereventsbyu...@1784371754(162606261 bytes, 2097613 operations)