I stopped writing to the cluster more than 8 hours ago, at worst case, I could only be getting a periodic memtable dump (I think)
Running 16 QUORUM read threads getting 600 records/second Sar for all 3 nodes (collected almost simultaneously: Average: CPU %user %nice %system %iowait %steal %idle Average: all 10.86 0.00 2.61 44.47 0.00 42.06 Average: tps rtps wtps bread/s bwrtn/s Average: 284.76 283.96 0.80 14541.83 7.17 ---------------- Average: CPU %user %nice %system %iowait %steal %idle Average: all 14.33 0.00 2.99 31.45 0.00 51.23 Average: tps rtps wtps bread/s bwrtn/s Average: 219.26 217.96 1.30 4320.16 90.22 ---------------- Average: CPU %user %nice %system %iowait %steal %idle Average: all 51.76 0.00 7.50 28.38 0.00 12.35 Average: tps rtps wtps bread/s bwrtn/s Average: 164.72 163.73 0.99 15892.17 8.72 And the client:------------------------------ Average: CPU %user %nice %system %iowait %steal %idle Average: all 0.35 0.00 0.89 0.00 0.00 98.77 Average: tps rtps wtps bread/s bwrtn/s Average: 0.90 0.10 0.80 25.60 27.20 From: Avinash Lakshman [mailto:avinash.laksh...@gmail.com] Sent: Thursday, April 08, 2010 10:15 AM To: user@cassandra.apache.org Subject: Re: Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL The tooth wave in memory utilization could be memtable dumps. I/O wait in TCP happens when you are overwhelming the server with requests. Could you run sar and find out how many bytes/sec you are receiving/transmitting? Cheers Avinash On Thu, Apr 8, 2010 at 7:45 AM, Mark Jones <mjo...@imagehawk.com<mailto:mjo...@imagehawk.com>> wrote: I don't see any way to increase the # of active Deserializers in storage-conf.xml Tpstats more than 8 hours after insert/read stop Pool Name Active Pending Completed FILEUTILS-DELETE-POOL 0 0 227 STREAM-STAGE 0 0 1 RESPONSE-STAGE 0 0 76724280 ROW-READ-STAGE 8 4091 1138277 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 1 1849826 78135012 GMFD 0 0 136886 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 1803 ROW-MUTATION-STAGE 0 0 68669717 MESSAGE-STREAMING-POOL 0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 438 FLUSH-WRITER-POOL 0 0 438 AE-SERVICE-STAGE 0 0 3 HINTED-HANDOFF-POOL 0 0 3 More than 30 minutes later (with no reads or writes to the cluster) Pool Name Active Pending Completed FILEUTILS-DELETE-POOL 0 0 227 STREAM-STAGE 0 0 1 RESPONSE-STAGE 0 0 76724280 ROW-READ-STAGE 8 4098 1314304 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 1 1663578 78336771 GMFD 0 0 142651 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 1803 ROW-MUTATION-STAGE 0 0 68669717 MESSAGE-STREAMING-POOL 0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 438 FLUSH-WRITER-POOL 0 0 438 AE-SERVICE-STAGE 0 0 3 HINTED-HANDOFF-POOL 0 0 3 The other 2 nodes in the cluster have Pending Counts of 0, but this node seems hung indefinitely processing requests that should have long ago timed out for the client. TOP is showing a huge amount of I/O Wait, but I'm not sure how to track where the wait is happening below here. I now have jconsole up and running on this machine, and the memory usage appears to be a saw tooth wave, going from 1GB up to 4GB over 3 hours, then plunging back to 1GB and resuming its climb. top - 08:33:40 up 1 day, 19:25, 4 users, load average: 7.75, 7.96, 8.16 Tasks: 177 total, 2 running, 175 sleeping, 0 stopped, 0 zombie Cpu(s): 16.6%us, 7.2%sy, 0.0%ni, 34.5%id, 41.1%wa, 0.0%hi, 0.6%si, 0.0%st Mem: 8123068k total, 8062240k used, 60828k free, 2624k buffers Swap: 12699340k total, 1951504k used, 10747836k free, 3757300k cached