I forgot to mention, When things go really bad, I'm seeing I/O waits in the 80->95% range. I restarted cassandra once when a node is in this situation, and it took 45 minutes to start (primarily reading SSTables). Typically, a node would start in about 5 minutes.
Thanks, -Mike On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote: > Hello, > > We've done some additional monitoring, and I think we have more information. > We've been collecting vmstat information every minute, attempting to catch a > node with issues,. > > So, it appears, that the cassandra node runs fine. Then suddenly, without > any correlation to any event that I can identify, the I/O wait time goes way > up, and stays up indefinitely. Even non-cassandra I/O activities (such as > snapshots and backups) start causing large I/O Wait times when they typically > would not. Previous to an issue, we would typically see I/O wait times 3-4% > with very few blocked processes on I/O. Once this issue manifests itself, > i/O wait times for the same activities jump to 30-40% with many blocked > processes. The I/O wait times do go back down when there is literally no > activity. > > - Updating the node to the latest Amazon Linux patches and rebooting the > instance doesn't correct the issue. > - Backing up the node, and replacing the instance does correct the issue. > I/O wait times return to normal. > > One relatively recent change we've made is we upgraded to m1.xlarge instances > which has 4 ephemeral drives available. We create a logical volume from the > 4 drives with the idea that we should be able to get increased I/O > throughput. When we ran m1.large instances, we had the same setup, although > it was only using 2 ephemeral drives. We chose to use LVM, vs. madm because > we were having issues having madm create the raid volume reliably on restart > (and research showed that this was a common problem). LVM just worked (and > had worked for months before this upgrade).. > > For reference, this is the script we used to create the logical volume: > > vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde > lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K > blockdev --setra 65536 /dev/mnt_vg/mnt_lv > sleep 2 > mkfs.xfs /dev/mnt_vg/mnt_lv > sleep 3 > mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data > sleep 3 > > Another tidbit... thus far (and this maybe only a coincidence), we've only > had to replace DB nodes within a single availability zone within us-east. > Other availability zones, in the same region, have yet to show an issue. > > It looks like I'm going to need to replace a third DB node today. Any advice > would be appreciated. > > Thanks, > -Mike > > > On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote: > >> Thanks. >> >> We weren't monitoring this value when the issue occurred, and this >> particular issue has not appeared for a couple of days (knock on wood). >> Will keep an eye out though, >> >> -Mike >> >> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: >> >>> top command? st : time stolen from this vm by the hypervisor >>> >>> jason >>> >>> >>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mthero...@yahoo.com> >>> wrote: >>> Sorry, Not sure what CPU steal is :) >>> >>> I have AWS console with detailed monitoring enabled... things seem to track >>> close to the minute, so I can see the CPU load go to 0... then jump at >>> about the minute Cassandra reports the dropped messages, >>> >>> -Mike >>> >>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote: >>> >>>>> The messages appear right after the node "wakes up". >>>> Are you tracking CPU steal ? >>>> >>>> ----------------- >>>> Aaron Morton >>>> Freelance Cassandra Consultant >>>> New Zealand >>>> >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote: >>>> >>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mthero...@yahoo.com> >>>>> wrote: >>>>>> Another related question. Once we see messages being dropped on one >>>>>> node, our cassandra client appears to see this, reporting errors. We >>>>>> use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients >>>>>> would see an error? If only one node reports an error, shouldn't the >>>>>> consistency level prevent the client from seeing an issue? >>>>> >>>>> If the client is talking to a broken/degraded coordinator node, RF/CL >>>>> are unable to protect it from RPCTimeout. If it is unable to >>>>> coordinate the request in a timely fashion, your clients will get >>>>> errors. >>>>> >>>>> =Rob >>>> >>> >>> >> >