Hi again, I have much more in formations on this case :
We did further investigations on the nodes affected and did find some await problems on one of the 4 disk in raid: http://imageshack.com/a/img824/2391/s7q3.jpg Here was the iostat of the node : http://imageshack.us/a/img7/7282/qq3w.png<http://www.google.com/url?q=http%3A%2F%2Fimageshack.us%2Fa%2Fimg7%2F7282%2Fqq3w.png&sa=D&sntz=1&usg=AFQjCNGTu2l8P6sedK0Wc9lhoI6_3O3ixw> You can see that the write and read throughput are exactly the same on the 4 disks of the instance. So the raid0 looks good enough. Yet, the global await, r_await and w_await are 3 to 5 times bigger on xvde disk than in other disks. We reported this to amazon support, and there is their answer : " Hello, I deeply apologize for any inconvenience this has been causing you and thank you for the additional information and screenshots. Using the instance you based your "iostat" on ("i-xxxxxxxx"), I have looked into the underlying hardware it is currently using and I can see it appears to have a noisy neighbor leading to the higher "await" time on that particular device. Since most AWS services are multi-tenant, situations can arise where one customer's resource has the potential to impact the performance of a different customer's resource that reside on the same underlying hardware (a "noisy neighbor"). While these occurrences are rare, they are nonetheless inconvenient and I am very sorry for any impact it has created. I have also looked into the initial instance referred to when the case was created ("i-xxxxxxx") and cannot see any existing issues (neighboring or otherwise) as to any I/O performance impacts; however, at the time the case was created, evidence on our end suggests there was a noisy neighbor then as well. Can you verify if you are still experiencing above average "await" times on this instance? If you would like to mitigate the impact of encountering "noisy neighbors", you can look into our Dedicated Instance option; Dedicated Instances launch on hardware dedicated to only a single customer (though this can feasibly lead to a situation where a customer is their own noisy neighbor). However, this is an option available only to instances that are being launched into a VPC and may require modification of the architecture of your use-case. I understand the instances belonging to your cluster in question have been launched into EC2-Classic, I just wanted to bring this your attention as a possible solution. You can read more about Dedicated Instances here: http://aws.amazon.com/dedicated-instances/ Again, I am very sorry for the performance impact you have been experiencing due to having noisy neighbors. We understand the frustration and are always actively working to increase capacity so the effects of noisy neighbors is lessened. I hope this information has been useful and if you have any additional questions whatsoever, please do not hesitate to ask! " To conclude, the only other solution to avoid VPC and Reserved Instance is to replace this instance by a new one, hoping to not having other "Noisy neighbors"... I hope that will help someone. Philippe 2013/11/28 Philippe DUPONT <pdup...@teads.tv> > Hi, > > We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge > based on datastax AMI with 4 storage in raid0 mode. > > Here is the ticket we opened with amazon support : > > "This raid is created using the datastax public AMI : ami-b2212dc6. > Sources are also available here : https://github.com/riptano/ComboAMI > > As you can see in the screenshot attached ( > http://imageshack.com/a/img854/4592/xbqc.jpg) randomly but frequently > one of the storage get fully used (100%) but 3 others are standing in low > use. > > Because of this, the node becomes slow and the whole cassandra cluster is > impacted. We are losing data due to writes fails and availability for our > customers. > > it was in this state for one hour, and we decided to restart it. > > We already removed 3 other instances because of this same issue." > (see other screenshots) > http://imageshack.com/a/img824/2391/s7q3.jpg > http://imageshack.com/a/img10/556/zzk8.jpg > > Amazon support took a close look at the instance as well as it's > underlying hardware for any potential health issues and both seem to be > healthy. > > Have someone already experienced something like this ? > > Should I contact the AMI author better? > > Thanks a lot, > > Philippe. > > > >