Re: [ceph-users] spiky io wait within VMs running on rbd

Gregory Farnum Thu, 12 Jun 2014 14:08:40 -0700

To be clear, that's the solution to one of the causes of this issue.
The log message is very general, and just means that a disk access
thread has been gone for a long time (15 seconds, in this case)
without checking in (so usually, it's been inside of a read/write
syscall for >=15 seconds).
Other causes include simple overload of the OSDs in question, or a
broken local filesystem, or...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Thu, Jun 12, 2014 at 1:59 PM, Mark Nelson <mark.nel...@inktank.com> wrote:
> Can you check and see if swap is being used on your OSD servers when this
> happens, and even better, use something like collectl or another tool to
> look for major page faults?
>
> If you see anything like this, you may want to tweak swappiness to be lower
> (say 10).
>
> Mark
>
>
> On 06/12/2014 03:17 PM, Xu (Simon) Chen wrote:
>>
>> I've done some more tracing. It looks like the high IO wait in VMs are
>> somewhat correlated when some OSDs have high inflight ops (ceph admin
>> socket, dump_ops_in_flight).
>>
>> When in_flight_ops is high, I see something like this in the OSD log:
>> 2014-06-12 19:57:24.572338 7f4db6bdf700  1 heartbeat_map reset_timeout
>> 'OSD::op_tp thread 0x7f4db6bdf700' had timed out after 15
>>
>> Any ideas why this happens?
>>
>> Thanks.
>> -Simon
>>
>>
>>
>> On Thu, Jun 12, 2014 at 11:14 AM, Mark Nelson <mark.nel...@inktank.com
>> <mailto:mark.nel...@inktank.com>> wrote:
>>
>>     On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote:
>>
>>         1) I did check iostat on all OSDs, and iowait seems normal.
>>         2) ceph -w shows no correlation between high io wait and high
>> iops.
>>         Sometimes the reverse is true: when io wait is high (since it's a
>>         cluster wide thing), the overall ceph iops drops too.
>>
>>
>>     Not sure if you are doing it yet, but you may want to look at the
>>     statistics the OSDs can provide via the admin socket, especially
>>     outstanding operations and dump_historic_ops.  If you look at these
>>     for all of your OSDs you can start getting a feel for whether any
>>     specific OSDs are slow and if so, what slow ops are hanging up on.
>>
>>         3) We have collectd running in VMs, and that's how we identified
>> the
>>         frequent high io wait. This happens for even lightly used VMs.
>>
>>         Thanks.
>>         -Simon
>>
>>
>>         On Thu, Jun 12, 2014 at 9:26 AM, David <da...@visions.se
>>         <mailto:da...@visions.se>
>>         <mailto:da...@visions.se <mailto:da...@visions.se>>> wrote:
>>
>>              Hi Simon,
>>
>>              Did you check iostat on the OSDs to check their
>>         utilization? What
>>              does your ceph -w say - pehaps you’re maxing your cluster’s
>>         IOPS?
>>              Also, are you running any monitoring of your VMs iostats?
>> We’ve
>>              often found some culprits overusing IOs..
>>
>>              Kind Regards,
>>              David Majchrzak
>>
>>              12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen
>>         <xche...@gmail.com <mailto:xche...@gmail.com>
>>              <mailto:xche...@gmail.com <mailto:xche...@gmail.com>>>:
>>
>>
>>
>>               > Hi folks,
>>               >
>>               > We have two similar ceph deployments, but one of them is
>>         having
>>              trouble: VMs running with ceph-provided block devices are
>>         seeing
>>              frequent high io wait, every a few minutes, usually 15-20%,
>>         but as
>>              high as 60-70%. This is cluster-wide and not correlated
>>         with VM's IO
>>              load. We turned on rbd cache and enabled writeback in qemu,
>>         but the
>>              problem persists. No-deepscrub doesn't help either.
>>               >
>>               > Without providing any one of our probably wrong
>>         theories, any
>>              ideas on how to troubleshoot?
>>               >
>>               > Thanks.
>>               > -Simon
>>               > _________________________________________________
>>
>>               > ceph-users mailing list
>>               > ceph-users@lists.ceph.com
>>         <mailto:ceph-users@lists.ceph.com>
>>         <mailto:ceph-us...@lists.ceph.__com
>>         <mailto:ceph-users@lists.ceph.com>>
>>               >
>>         http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>>
>>
>>
>>         _________________________________________________
>>
>>         ceph-users mailing list
>>         ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>         http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>>     _________________________________________________
>>
>>     ceph-users mailing list
>>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>     http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] spiky io wait within VMs running on rbd

Reply via email to