Hi Aaron, As you can see in the picture, there is not much steal on iostat. That's the same with top. https://imageshack.com/i/0jm4jyp
Philippe 2013/12/10 Aaron Morton <aa...@thelastpickle.com> > Thanks for the update Philip, other people have reported high await on a > single volume previously but I don’t think it’s been blamed on noisy > neighbours. It’s interesting that you can have noisy neighbours for IO only. > > Out of interest was there much steal reported in top or iostat ? > > Cheers > > ----------------- > Aaron Morton > New Zealand > @aaronmorton > > Co-Founder & Principal Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > On 6/12/2013, at 4:42 am, Philippe Dupont <pdup...@teads.tv> wrote: > > Hi again, > > I have much more in formations on this case : > > We did further investigations on the nodes affected and did find some > await problems on one of the 4 disk in raid: > http://imageshack.com/a/img824/2391/s7q3.jpg > > Here was the iostat of the node : > http://imageshack.us/a/img7/7282/qq3w.png<http://www.google.com/url?q=http%3A%2F%2Fimageshack.us%2Fa%2Fimg7%2F7282%2Fqq3w.png&sa=D&sntz=1&usg=AFQjCNGTu2l8P6sedK0Wc9lhoI6_3O3ixw> > > You can see that the write and read throughput are exactly the same on the > 4 disks of the instance. So the raid0 looks good enough. Yet, the global > await, r_await and w_await are 3 to 5 times bigger on xvde disk than in > other disks. > > We reported this to amazon support, and there is their answer : > " Hello, > I deeply apologize for any inconvenience this has been causing you and > thank you for the additional information and screenshots. Using the > instance you based your "iostat" on ("i-xxxxxxxx"), I have looked into the > underlying hardware it is currently using and I can see it appears to have > a noisy neighbor leading to the higher "await" time on that particular > device. Since most AWS services are multi-tenant, situations can arise > where one customer's resource has the potential to impact the performance > of a different customer's resource that reside on the same underlying > hardware (a "noisy neighbor"). While these occurrences are rare, they are > nonetheless inconvenient and I am very sorry for any impact it has created. > I have also looked into the initial instance referred to when the case was > created ("i-xxxxxxx") and cannot see any existing issues (neighboring or > otherwise) as to any I/O performance impacts; however, at the time the case > was created, evidence on our end suggests there was a noisy neighbor then > as well. Can you verify if you are still experiencing above average "await" > times on this instance? If you would like to mitigate the impact of > encountering "noisy neighbors", you can look into our Dedicated Instance > option; Dedicated Instances launch on hardware dedicated to only a single > customer (though this can feasibly lead to a situation where a customer is > their own noisy neighbor). However, this is an option available only to > instances that are being launched into a VPC and may require modification > of the architecture of your use-case. I understand the instances belonging > to your cluster in question have been launched into EC2-Classic, I just > wanted to bring this your attention as a possible solution. You can read > more about Dedicated Instances here: > http://aws.amazon.com/dedicated-instances/ Again, I am very sorry for the > performance impact you have been experiencing due to having noisy > neighbors. We understand the frustration and are always actively working to > increase capacity so the effects of noisy neighbors is lessened. I hope > this information has been useful and if you have any additional questions > whatsoever, please do not hesitate to ask! " > > To conclude, the only other solution to avoid VPC and Reserved Instance is > to replace this instance by a new one, hoping to not having other "Noisy > neighbors"... > I hope that will help someone. > > Philippe > > > 2013/11/28 Philippe DUPONT <pdup...@teads.tv> > >> Hi, >> >> We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge >> based on datastax AMI with 4 storage in raid0 mode. >> >> Here is the ticket we opened with amazon support : >> >> "This raid is created using the datastax public AMI : ami-b2212dc6. >> Sources are also available here : https://github.com/riptano/ComboAMI >> >> As you can see in the screenshot attached ( >> http://imageshack.com/a/img854/4592/xbqc.jpg) randomly but frequently >> one of the storage get fully used (100%) but 3 others are standing in low >> use. >> >> Because of this, the node becomes slow and the whole cassandra cluster is >> impacted. We are losing data due to writes fails and availability for our >> customers. >> >> it was in this state for one hour, and we decided to restart it. >> >> We already removed 3 other instances because of this same issue." >> (see other screenshots) >> http://imageshack.com/a/img824/2391/s7q3.jpg >> http://imageshack.com/a/img10/556/zzk8.jpg >> >> Amazon support took a close look at the instance as well as it's >> underlying hardware for any potential health issues and both seem to be >> healthy. >> >> Have someone already experienced something like this ? >> >> Should I contact the AMI author better? >> >> Thanks a lot, >> >> Philippe. >> >> >> >> > >