Re: Really odd issue (AWS related?)

Ben Chobot Tue, 30 Apr 2013 11:13:56 -0700

We've also had issues with ephemeral drives in a single AZ in us-east-1, so 
much so that we no longer use that AZ. Though our issues tended to be obvious 
from instance boot - they wouldn't suddenly degrade.


On Apr 28, 2013, at 2:27 PM, Alex Major wrote:

> Hi Mike,
> 
> We had issues with the ephemeral drives when we first got started, although 
> we never got to the bottom of it so I can't help much with troubleshooting 
> unfortunately. Contrary to a lot of the comments on the mailing list we've 
> actually had a lot more success with EBS drives (PIOPs!). I'd definitely 
> suggest try striping 4 EBS drives (Raid 0) and using PIOPs.
> 
> You could be having a noisy neighbour problem, I don't believe that m1.large 
> or m1.xlarge instances get all of the actual hardware, virtualisation on EC2 
> still sucks in isolating resources.
> 
> We've also had more success with Ubuntu on EC2, not so much with our 
> Cassandra nodes but some of our other services didn't run as well on Amazon 
> Linux AMIs.
> 
> Alex
> 
> 
> 
> On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux <mthero...@yahoo.com> wrote:
> I forgot to mention,
> 
> When things go really bad, I'm seeing I/O waits in the 80->95% range.  I 
> restarted cassandra once when a node is in this situation, and it took 45 
> minutes to start (primarily reading SSTables).  Typically, a node would start 
> in about 5 minutes.
> 
> Thanks,
> -Mike
>  
> On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:
> 
>> Hello,
>> 
>> We've done some additional monitoring, and I think we have more information. 
>>  We've been collecting vmstat information every minute, attempting to catch  
>> a node with issues,.
>> 
>> So, it appears, that the cassandra node runs fine.  Then suddenly, without 
>> any correlation to any event that I can identify, the I/O wait time goes way 
>> up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
>> snapshots and backups) start causing large I/O Wait times when they 
>> typically would not.  Previous to an issue, we would typically see I/O wait 
>> times 3-4% with very few blocked processes on I/O.  Once this issue 
>> manifests itself, i/O wait times for the same activities jump to 30-40% with 
>> many blocked processes.  The I/O wait times do go back down when there is 
>> literally no activity.   
>> 
>> -  Updating the node to the latest Amazon Linux patches and rebooting the 
>> instance doesn't correct the issue.
>> -  Backing up the node, and replacing the instance does correct the issue.  
>> I/O wait times return to normal.
>> 
>> One relatively recent change we've made is we upgraded to m1.xlarge 
>> instances which has 4 ephemeral drives available.  We create a logical 
>> volume from the 4 drives with the idea that we should be able to get 
>> increased I/O throughput.  When we ran m1.large instances, we had the same 
>> setup, although it was only using 2 ephemeral drives.  We chose to use LVM, 
>> vs. madm because we were having issues having madm create the raid volume 
>> reliably on restart (and research showed that this was a common problem).  
>> LVM just worked (and had worked for months before this upgrade)..
>> 
>> For reference, this is the script we used to create the logical volume:
>> 
>> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
>> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
>> blockdev --setra 65536 /dev/mnt_vg/mnt_lv
>> sleep 2
>> mkfs.xfs /dev/mnt_vg/mnt_lv
>> sleep 3
>> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
>> sleep 3
>> 
>> Another tidbit... thus far (and this maybe only a coincidence), we've only 
>> had to replace DB nodes within a single availability zone within us-east.  
>> Other availability zones, in the same region, have yet to show an issue.
>> 
>> It looks like I'm going to need to replace a third DB node today.  Any 
>> advice would be appreciated.
>> 
>> Thanks,
>> -Mike
>> 
>> 
>> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
>> 
>>> Thanks.
>>> 
>>> We weren't monitoring this value when the issue occurred, and this 
>>> particular issue has not appeared for a couple of days (knock on wood).  
>>> Will keep an eye out though,
>>> 
>>> -Mike
>>> 
>>> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
>>> 
>>>> top command? st : time stolen from this vm by the hypervisor
>>>> 
>>>> jason
>>>> 
>>>> 
>>>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mthero...@yahoo.com> 
>>>> wrote:
>>>> Sorry, Not sure what CPU steal is :)
>>>> 
>>>> I have AWS console with detailed monitoring enabled... things seem to 
>>>> track close to the minute, so I can see the CPU load go to 0... then jump 
>>>> at about the minute Cassandra reports the dropped messages,
>>>> 
>>>> -Mike
>>>> 
>>>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>>>> 
>>>>>> The messages appear right after the node "wakes up".
>>>>> Are you tracking CPU steal ? 
>>>>> 
>>>>> -----------------
>>>>> Aaron Morton
>>>>> Freelance Cassandra Consultant
>>>>> New Zealand
>>>>> 
>>>>> @aaronmorton
>>>>> http://www.thelastpickle.com
>>>>> 
>>>>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>>>>> 
>>>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mthero...@yahoo.com> 
>>>>>> wrote:
>>>>>>> Another related question.  Once we see messages being dropped on one 
>>>>>>> node, our cassandra client appears to see this, reporting errors.  We 
>>>>>>> use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients 
>>>>>>> would see an error?  If only one node reports an error, shouldn't the 
>>>>>>> consistency level prevent the client from seeing an issue?
>>>>>> 
>>>>>> If the client is talking to a broken/degraded coordinator node, RF/CL
>>>>>> are unable to protect it from RPCTimeout. If it is unable to
>>>>>> coordinate the request in a timely fashion, your clients will get
>>>>>> errors.
>>>>>> 
>>>>>> =Rob
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
>

Re: Really odd issue (AWS related?)

Reply via email to