Thanks for the update Philip, other people have reported high await on a single volume previously but I don’t think it’s been blamed on noisy neighbours. It’s interesting that you can have noisy neighbours for IO only.
Out of interest was there much steal reported in top or iostat ? Cheers ----------------- Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 6/12/2013, at 4:42 am, Philippe Dupont <pdup...@teads.tv> wrote: > Hi again, > > I have much more in formations on this case : > > We did further investigations on the nodes affected and did find some await > problems on one of the 4 disk in raid: > http://imageshack.com/a/img824/2391/s7q3.jpg > > Here was the iostat of the node : > http://imageshack.us/a/img7/7282/qq3w.png > > You can see that the write and read throughput are exactly the same on the 4 > disks of the instance. So the raid0 looks good enough. Yet, the global await, > r_await and w_await are 3 to 5 times bigger on xvde disk than in other disks. > > We reported this to amazon support, and there is their answer : > " Hello, > > I deeply apologize for any inconvenience this has been causing you and thank > you for the additional information and screenshots. > > Using the instance you based your "iostat" on ("i-xxxxxxxx"), I have looked > into the underlying hardware it is currently using and I can see it appears > to have a noisy neighbor leading to the higher "await" time on that > particular device. Since most AWS services are multi-tenant, situations can > arise where one customer's resource has the potential to impact the > performance of a different customer's resource that reside on the same > underlying hardware (a "noisy neighbor"). While these occurrences are rare, > they are nonetheless inconvenient and I am very sorry for any impact it has > created. > > I have also looked into the initial instance referred to when the case was > created ("i-xxxxxxx") and cannot see any existing issues (neighboring or > otherwise) as to any I/O performance impacts; however, at the time the case > was created, evidence on our end suggests there was a noisy neighbor then as > well. Can you verify if you are still experiencing above average "await" > times on this instance? > > If you would like to mitigate the impact of encountering "noisy neighbors", > you can look into our Dedicated Instance option; Dedicated Instances launch > on hardware dedicated to only a single customer (though this can feasibly > lead to a situation where a customer is their own noisy neighbor). However, > this is an option available only to instances that are being launched into a > VPC and may require modification of the architecture of your use-case. I > understand the instances belonging to your cluster in question have been > launched into EC2-Classic, I just wanted to bring this your attention as a > possible solution. You can read more about Dedicated Instances here: > http://aws.amazon.com/dedicated-instances/ > > Again, I am very sorry for the performance impact you have been experiencing > due to having noisy neighbors. We understand the frustration and are always > actively working to increase capacity so the effects of noisy neighbors is > lessened. I hope this information has been useful and if you have any > additional questions whatsoever, please do not hesitate to ask! " > > To conclude, the only other solution to avoid VPC and Reserved Instance is to > replace this instance by a new one, hoping to not having other "Noisy > neighbors"... > I hope that will help someone. > > Philippe > > > 2013/11/28 Philippe DUPONT <pdup...@teads.tv> > Hi, > > We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge based > on datastax AMI with 4 storage in raid0 mode. > > Here is the ticket we opened with amazon support : > > "This raid is created using the datastax public AMI : ami-b2212dc6. Sources > are also available here : https://github.com/riptano/ComboAMI > > As you can see in the screenshot attached > (http://imageshack.com/a/img854/4592/xbqc.jpg) randomly but frequently one > of the storage get fully used (100%) but 3 others are standing in low use. > > Because of this, the node becomes slow and the whole cassandra cluster is > impacted. We are losing data due to writes fails and availability for our > customers. > > it was in this state for one hour, and we decided to restart it. > > We already removed 3 other instances because of this same issue." > (see other screenshots) > http://imageshack.com/a/img824/2391/s7q3.jpg > http://imageshack.com/a/img10/556/zzk8.jpg > > Amazon support took a close look at the instance as well as it's underlying > hardware for any potential health issues and both seem to be healthy. > > Have someone already experienced something like this ? > > Should I contact the AMI author better? > > Thanks a lot, > > Philippe. > > > >