I forgot to mention,

When things go really bad, I'm seeing I/O waits in the 80->95% range.  I 
restarted cassandra once when a node is in this situation, and it took 45 
minutes to start (primarily reading SSTables).  Typically, a node would start 
in about 5 minutes.

Thanks,
-Mike
 
On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

> Hello,
> 
> We've done some additional monitoring, and I think we have more information.  
> We've been collecting vmstat information every minute, attempting to catch  a 
> node with issues,.
> 
> So, it appears, that the cassandra node runs fine.  Then suddenly, without 
> any correlation to any event that I can identify, the I/O wait time goes way 
> up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
> snapshots and backups) start causing large I/O Wait times when they typically 
> would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
> with very few blocked processes on I/O.  Once this issue manifests itself, 
> i/O wait times for the same activities jump to 30-40% with many blocked 
> processes.  The I/O wait times do go back down when there is literally no 
> activity.   
> 
> -  Updating the node to the latest Amazon Linux patches and rebooting the 
> instance doesn't correct the issue.
> -  Backing up the node, and replacing the instance does correct the issue.  
> I/O wait times return to normal.
> 
> One relatively recent change we've made is we upgraded to m1.xlarge instances 
> which has 4 ephemeral drives available.  We create a logical volume from the 
> 4 drives with the idea that we should be able to get increased I/O 
> throughput.  When we ran m1.large instances, we had the same setup, although 
> it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because 
> we were having issues having madm create the raid volume reliably on restart 
> (and research showed that this was a common problem).  LVM just worked (and 
> had worked for months before this upgrade)..
> 
> For reference, this is the script we used to create the logical volume:
> 
> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
> blockdev --setra 65536 /dev/mnt_vg/mnt_lv
> sleep 2
> mkfs.xfs /dev/mnt_vg/mnt_lv
> sleep 3
> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
> sleep 3
> 
> Another tidbit... thus far (and this maybe only a coincidence), we've only 
> had to replace DB nodes within a single availability zone within us-east.  
> Other availability zones, in the same region, have yet to show an issue.
> 
> It looks like I'm going to need to replace a third DB node today.  Any advice 
> would be appreciated.
> 
> Thanks,
> -Mike
> 
> 
> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
> 
>> Thanks.
>> 
>> We weren't monitoring this value when the issue occurred, and this 
>> particular issue has not appeared for a couple of days (knock on wood).  
>> Will keep an eye out though,
>> 
>> -Mike
>> 
>> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
>> 
>>> top command? st : time stolen from this vm by the hypervisor
>>> 
>>> jason
>>> 
>>> 
>>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mthero...@yahoo.com> 
>>> wrote:
>>> Sorry, Not sure what CPU steal is :)
>>> 
>>> I have AWS console with detailed monitoring enabled... things seem to track 
>>> close to the minute, so I can see the CPU load go to 0... then jump at 
>>> about the minute Cassandra reports the dropped messages,
>>> 
>>> -Mike
>>> 
>>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>>> 
>>>>> The messages appear right after the node "wakes up".
>>>> Are you tracking CPU steal ? 
>>>> 
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Consultant
>>>> New Zealand
>>>> 
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>> 
>>>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>>>> 
>>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mthero...@yahoo.com> 
>>>>> wrote:
>>>>>> Another related question.  Once we see messages being dropped on one 
>>>>>> node, our cassandra client appears to see this, reporting errors.  We 
>>>>>> use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients 
>>>>>> would see an error?  If only one node reports an error, shouldn't the 
>>>>>> consistency level prevent the client from seeing an issue?
>>>>> 
>>>>> If the client is talking to a broken/degraded coordinator node, RF/CL
>>>>> are unable to protect it from RPCTimeout. If it is unable to
>>>>> coordinate the request in a timely fashion, your clients will get
>>>>> errors.
>>>>> 
>>>>> =Rob
>>>> 
>>> 
>>> 
>> 
> 

Reply via email to