We've also had issues with ephemeral drives in a single AZ in us-east-1, so much so that we no longer use that AZ. Though our issues tended to be obvious from instance boot - they wouldn't suddenly degrade.
On Apr 28, 2013, at 2:27 PM, Alex Major wrote: > Hi Mike, > > We had issues with the ephemeral drives when we first got started, although > we never got to the bottom of it so I can't help much with troubleshooting > unfortunately. Contrary to a lot of the comments on the mailing list we've > actually had a lot more success with EBS drives (PIOPs!). I'd definitely > suggest try striping 4 EBS drives (Raid 0) and using PIOPs. > > You could be having a noisy neighbour problem, I don't believe that m1.large > or m1.xlarge instances get all of the actual hardware, virtualisation on EC2 > still sucks in isolating resources. > > We've also had more success with Ubuntu on EC2, not so much with our > Cassandra nodes but some of our other services didn't run as well on Amazon > Linux AMIs. > > Alex > > > > On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux <mthero...@yahoo.com> wrote: > I forgot to mention, > > When things go really bad, I'm seeing I/O waits in the 80->95% range. I > restarted cassandra once when a node is in this situation, and it took 45 > minutes to start (primarily reading SSTables). Typically, a node would start > in about 5 minutes. > > Thanks, > -Mike > > On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote: > >> Hello, >> >> We've done some additional monitoring, and I think we have more information. >> We've been collecting vmstat information every minute, attempting to catch >> a node with issues,. >> >> So, it appears, that the cassandra node runs fine. Then suddenly, without >> any correlation to any event that I can identify, the I/O wait time goes way >> up, and stays up indefinitely. Even non-cassandra I/O activities (such as >> snapshots and backups) start causing large I/O Wait times when they >> typically would not. Previous to an issue, we would typically see I/O wait >> times 3-4% with very few blocked processes on I/O. Once this issue >> manifests itself, i/O wait times for the same activities jump to 30-40% with >> many blocked processes. The I/O wait times do go back down when there is >> literally no activity. >> >> - Updating the node to the latest Amazon Linux patches and rebooting the >> instance doesn't correct the issue. >> - Backing up the node, and replacing the instance does correct the issue. >> I/O wait times return to normal. >> >> One relatively recent change we've made is we upgraded to m1.xlarge >> instances which has 4 ephemeral drives available. We create a logical >> volume from the 4 drives with the idea that we should be able to get >> increased I/O throughput. When we ran m1.large instances, we had the same >> setup, although it was only using 2 ephemeral drives. We chose to use LVM, >> vs. madm because we were having issues having madm create the raid volume >> reliably on restart (and research showed that this was a common problem). >> LVM just worked (and had worked for months before this upgrade).. >> >> For reference, this is the script we used to create the logical volume: >> >> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde >> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K >> blockdev --setra 65536 /dev/mnt_vg/mnt_lv >> sleep 2 >> mkfs.xfs /dev/mnt_vg/mnt_lv >> sleep 3 >> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data >> sleep 3 >> >> Another tidbit... thus far (and this maybe only a coincidence), we've only >> had to replace DB nodes within a single availability zone within us-east. >> Other availability zones, in the same region, have yet to show an issue. >> >> It looks like I'm going to need to replace a third DB node today. Any >> advice would be appreciated. >> >> Thanks, >> -Mike >> >> >> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote: >> >>> Thanks. >>> >>> We weren't monitoring this value when the issue occurred, and this >>> particular issue has not appeared for a couple of days (knock on wood). >>> Will keep an eye out though, >>> >>> -Mike >>> >>> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: >>> >>>> top command? st : time stolen from this vm by the hypervisor >>>> >>>> jason >>>> >>>> >>>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mthero...@yahoo.com> >>>> wrote: >>>> Sorry, Not sure what CPU steal is :) >>>> >>>> I have AWS console with detailed monitoring enabled... things seem to >>>> track close to the minute, so I can see the CPU load go to 0... then jump >>>> at about the minute Cassandra reports the dropped messages, >>>> >>>> -Mike >>>> >>>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote: >>>> >>>>>> The messages appear right after the node "wakes up". >>>>> Are you tracking CPU steal ? >>>>> >>>>> ----------------- >>>>> Aaron Morton >>>>> Freelance Cassandra Consultant >>>>> New Zealand >>>>> >>>>> @aaronmorton >>>>> http://www.thelastpickle.com >>>>> >>>>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote: >>>>> >>>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mthero...@yahoo.com> >>>>>> wrote: >>>>>>> Another related question. Once we see messages being dropped on one >>>>>>> node, our cassandra client appears to see this, reporting errors. We >>>>>>> use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients >>>>>>> would see an error? If only one node reports an error, shouldn't the >>>>>>> consistency level prevent the client from seeing an issue? >>>>>> >>>>>> If the client is talking to a broken/degraded coordinator node, RF/CL >>>>>> are unable to protect it from RPCTimeout. If it is unable to >>>>>> coordinate the request in a timely fashion, your clients will get >>>>>> errors. >>>>>> >>>>>> =Rob >>>>> >>>> >>>> >>> >> > >