Re: Raid Issue on EC2 Datastax ami, 1.2.11

Philippe Dupont Thu, 12 Dec 2013 07:23:14 -0800

Hi Aaron,

As you can see in the picture, there is not much steal on iostat. That's
the same with top.
https://imageshack.com/i/0jm4jyp



Philippe


2013/12/10 Aaron Morton <aa...@thelastpickle.com>

> Thanks for the update Philip, other people have reported high await on a
> single volume previously but I don’t think it’s been blamed on noisy
> neighbours. It’s interesting that you can have noisy neighbours for IO only.
>
> Out of interest was there much steal reported in top or iostat ?
>
> Cheers
>
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 6/12/2013, at 4:42 am, Philippe Dupont <pdup...@teads.tv> wrote:
>
> Hi again,
>
> I have much more in formations on this case :
>
> We did further investigations on the nodes affected and did find some
> await problems on one of the 4 disk in raid:
> http://imageshack.com/a/img824/2391/s7q3.jpg
>
> Here was the iostat of the node :
> http://imageshack.us/a/img7/7282/qq3w.png<http://www.google.com/url?q=http%3A%2F%2Fimageshack.us%2Fa%2Fimg7%2F7282%2Fqq3w.png&sa=D&sntz=1&usg=AFQjCNGTu2l8P6sedK0Wc9lhoI6_3O3ixw>
>
> You can see that the write and read throughput are exactly the same on the
> 4 disks of the instance. So the raid0 looks good enough. Yet, the global
> await, r_await and w_await are 3 to 5 times bigger on xvde disk than in
> other disks.
>
> We reported this to amazon support, and there is their answer :
> " Hello,
> I deeply apologize for any inconvenience this has been causing you and
> thank you for the additional information and screenshots. Using the
> instance you based your "iostat" on ("i-xxxxxxxx"), I have looked into the
> underlying hardware it is currently using and I can see it appears to have
> a noisy neighbor leading to the higher "await" time on that particular
> device. Since most AWS services are multi-tenant, situations can arise
> where one customer's resource has the potential to impact the performance
> of a different customer's resource that reside on the same underlying
> hardware (a "noisy neighbor"). While these occurrences are rare, they are
> nonetheless inconvenient and I am very sorry for any impact it has created.
> I have also looked into the initial instance referred to when the case was
> created ("i-xxxxxxx") and cannot see any existing issues (neighboring or
> otherwise) as to any I/O performance impacts; however, at the time the case
> was created, evidence on our end suggests there was a noisy neighbor then
> as well. Can you verify if you are still experiencing above average "await"
> times on this instance? If you would like to mitigate the impact of
> encountering "noisy neighbors", you can look into our Dedicated Instance
> option; Dedicated Instances launch on hardware dedicated to only a single
> customer (though this can feasibly lead to a situation where a customer is
> their own noisy neighbor). However, this is an option available only to
> instances that are being launched into a VPC and may require modification
> of the architecture of your use-case. I understand the instances belonging
> to your cluster in question have been launched into EC2-Classic, I just
> wanted to bring this your attention as a possible solution. You can read
> more about Dedicated Instances here:
> http://aws.amazon.com/dedicated-instances/ Again, I am very sorry for the
> performance impact you have been experiencing due to having noisy
> neighbors. We understand the frustration and are always actively working to
> increase capacity so the effects of noisy neighbors is lessened. I hope
> this information has been useful and if you have any additional questions
> whatsoever, please do not hesitate to ask! "
>
> To conclude, the only other solution to avoid VPC and Reserved Instance is
> to replace this instance by a new one, hoping to not having other "Noisy
> neighbors"...
> I hope that will help someone.
>
> Philippe
>
>
> 2013/11/28 Philippe DUPONT <pdup...@teads.tv>
>
>> Hi,
>>
>> We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge
>> based on datastax AMI with 4 storage in raid0 mode.
>>
>> Here is the ticket we opened with amazon support :
>>
>> "This raid is created using the datastax public AMI : ami-b2212dc6.
>> Sources are also available here : https://github.com/riptano/ComboAMI
>>
>> As you can see in the screenshot attached (
>> http://imageshack.com/a/img854/4592/xbqc.jpg)  randomly but frequently
>> one of the storage get fully used (100%) but 3 others are standing in low
>> use.
>>
>> Because of this, the node becomes slow and the whole cassandra cluster is
>> impacted. We are losing data due to writes fails and availability for our
>> customers.
>>
>> it was in this state for one hour, and we decided to restart it.
>>
>> We already removed 3 other instances because of this same issue."
>> (see other screenshots)
>> http://imageshack.com/a/img824/2391/s7q3.jpg
>> http://imageshack.com/a/img10/556/zzk8.jpg
>>
>> Amazon support took a close look at the instance as well as it's
>> underlying hardware for any potential health issues and both seem to be
>> healthy.
>>
>> Have someone already experienced something like this ?
>>
>> Should I contact the AMI author better?
>>
>> Thanks a lot,
>>
>> Philippe.
>>
>>
>>
>>
>
>

Re: Raid Issue on EC2 Datastax ami, 1.2.11

Reply via email to