Re: Raid Issue on EC2 Datastax ami, 1.2.11

Aaron Morton Mon, 09 Dec 2013 23:47:07 -0800

Thanks for the update Philip, other people have reported high await on a single 
volume previously but I don’t think it’s been blamed on noisy neighbours. It’s 
interesting that you can have noisy neighbours for IO only.


Out of interest was there much steal reported in top or iostat ? 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 6/12/2013, at 4:42 am, Philippe Dupont <pdup...@teads.tv> wrote:

> Hi again,
> 
> I have much more in formations on this case :
> 
> We did further investigations on the nodes affected and did find some await 
> problems on one of the 4 disk in raid:
> http://imageshack.com/a/img824/2391/s7q3.jpg
> 
> Here was the iostat of the node :
> http://imageshack.us/a/img7/7282/qq3w.png
> 
> You can see that the write and read throughput are exactly the same on the 4 
> disks of the instance. So the raid0 looks good enough. Yet, the global await, 
> r_await and w_await are 3 to 5 times bigger on xvde disk than in other disks.
> 
> We reported this to amazon support, and there is their answer :
> " Hello,
> 
> I deeply apologize for any inconvenience this has been causing you and thank 
> you for the additional information and screenshots.
> 
> Using the instance you based your "iostat" on ("i-xxxxxxxx"), I have looked 
> into the underlying hardware it is currently using and I can see it appears 
> to have a noisy neighbor leading to the higher "await" time on that 
> particular device.  Since most AWS services are multi-tenant, situations can 
> arise where one customer's resource has the potential to impact the 
> performance of a different customer's resource that reside on the same 
> underlying hardware (a "noisy neighbor").  While these occurrences are rare, 
> they are nonetheless inconvenient and I am very sorry for any impact it has 
> created.
> 
> I have also looked into the initial instance referred to when the case was 
> created ("i-xxxxxxx") and cannot see any existing issues (neighboring or 
> otherwise) as to any I/O performance impacts; however, at the time the case 
> was created, evidence on our end suggests there was a noisy neighbor then as 
> well.  Can you verify if you are still experiencing above average "await" 
> times on this instance?
> 
> If you would like to mitigate the impact of encountering "noisy neighbors", 
> you can look into our Dedicated Instance option; Dedicated Instances launch 
> on hardware dedicated to only a single customer (though this can feasibly 
> lead to a situation where a customer is their own noisy neighbor).  However, 
> this is an option available only to instances that are being launched into a 
> VPC and may require modification of the architecture of your use-case.  I 
> understand the instances belonging to your cluster in question have been 
> launched into EC2-Classic, I just wanted to bring this your attention as a 
> possible solution.  You can read more about Dedicated Instances here:
> http://aws.amazon.com/dedicated-instances/
> 
> Again, I am very sorry for the performance impact you have been experiencing 
> due to having noisy neighbors.  We understand the frustration and are always 
> actively working to increase capacity so the effects of noisy neighbors is 
> lessened.  I hope this information has been useful and if you have any 
> additional questions whatsoever, please do not hesitate to ask! "
> 
> To conclude, the only other solution to avoid VPC and Reserved Instance is to 
> replace this instance by a new one, hoping to not having other "Noisy 
> neighbors"...
> I hope that will help someone.
> 
> Philippe
> 
> 
> 2013/11/28 Philippe DUPONT <pdup...@teads.tv>
> Hi,
> 
> We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge based 
> on datastax AMI with 4 storage in raid0 mode.
> 
> Here is the ticket we opened with amazon support :
> 
> "This raid is created using the datastax public AMI : ami-b2212dc6. Sources 
> are also available here : https://github.com/riptano/ComboAMI
> 
> As you can see in the screenshot attached 
> (http://imageshack.com/a/img854/4592/xbqc.jpg)  randomly but frequently one 
> of the storage get fully used (100%) but 3 others are standing in low use.
> 
> Because of this, the node becomes slow and the whole cassandra cluster is 
> impacted. We are losing data due to writes fails and availability for our 
> customers.
> 
> it was in this state for one hour, and we decided to restart it.
> 
> We already removed 3 other instances because of this same issue."
> (see other screenshots)
> http://imageshack.com/a/img824/2391/s7q3.jpg
> http://imageshack.com/a/img10/556/zzk8.jpg
> 
> Amazon support took a close look at the instance as well as it's underlying 
> hardware for any potential health issues and both seem to be healthy.
> 
> Have someone already experienced something like this ?
> 
> Should I contact the AMI author better?
> 
> Thanks a lot,
> 
> Philippe.
> 
> 
> 
>

Re: Raid Issue on EC2 Datastax ami, 1.2.11

Reply via email to