To be clear, that's the solution to one of the causes of this issue. The log message is very general, and just means that a disk access thread has been gone for a long time (15 seconds, in this case) without checking in (so usually, it's been inside of a read/write syscall for >=15 seconds). Other causes include simple overload of the OSDs in question, or a broken local filesystem, or... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com
On Thu, Jun 12, 2014 at 1:59 PM, Mark Nelson <mark.nel...@inktank.com> wrote: > Can you check and see if swap is being used on your OSD servers when this > happens, and even better, use something like collectl or another tool to > look for major page faults? > > If you see anything like this, you may want to tweak swappiness to be lower > (say 10). > > Mark > > > On 06/12/2014 03:17 PM, Xu (Simon) Chen wrote: >> >> I've done some more tracing. It looks like the high IO wait in VMs are >> somewhat correlated when some OSDs have high inflight ops (ceph admin >> socket, dump_ops_in_flight). >> >> When in_flight_ops is high, I see something like this in the OSD log: >> 2014-06-12 19:57:24.572338 7f4db6bdf700 1 heartbeat_map reset_timeout >> 'OSD::op_tp thread 0x7f4db6bdf700' had timed out after 15 >> >> Any ideas why this happens? >> >> Thanks. >> -Simon >> >> >> >> On Thu, Jun 12, 2014 at 11:14 AM, Mark Nelson <mark.nel...@inktank.com >> <mailto:mark.nel...@inktank.com>> wrote: >> >> On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote: >> >> 1) I did check iostat on all OSDs, and iowait seems normal. >> 2) ceph -w shows no correlation between high io wait and high >> iops. >> Sometimes the reverse is true: when io wait is high (since it's a >> cluster wide thing), the overall ceph iops drops too. >> >> >> Not sure if you are doing it yet, but you may want to look at the >> statistics the OSDs can provide via the admin socket, especially >> outstanding operations and dump_historic_ops. If you look at these >> for all of your OSDs you can start getting a feel for whether any >> specific OSDs are slow and if so, what slow ops are hanging up on. >> >> 3) We have collectd running in VMs, and that's how we identified >> the >> frequent high io wait. This happens for even lightly used VMs. >> >> Thanks. >> -Simon >> >> >> On Thu, Jun 12, 2014 at 9:26 AM, David <da...@visions.se >> <mailto:da...@visions.se> >> <mailto:da...@visions.se <mailto:da...@visions.se>>> wrote: >> >> Hi Simon, >> >> Did you check iostat on the OSDs to check their >> utilization? What >> does your ceph -w say - pehaps you’re maxing your cluster’s >> IOPS? >> Also, are you running any monitoring of your VMs iostats? >> We’ve >> often found some culprits overusing IOs.. >> >> Kind Regards, >> David Majchrzak >> >> 12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen >> <xche...@gmail.com <mailto:xche...@gmail.com> >> <mailto:xche...@gmail.com <mailto:xche...@gmail.com>>>: >> >> >> >> > Hi folks, >> > >> > We have two similar ceph deployments, but one of them is >> having >> trouble: VMs running with ceph-provided block devices are >> seeing >> frequent high io wait, every a few minutes, usually 15-20%, >> but as >> high as 60-70%. This is cluster-wide and not correlated >> with VM's IO >> load. We turned on rbd cache and enabled writeback in qemu, >> but the >> problem persists. No-deepscrub doesn't help either. >> > >> > Without providing any one of our probably wrong >> theories, any >> ideas on how to troubleshoot? >> > >> > Thanks. >> > -Simon >> > _________________________________________________ >> >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> <mailto:ceph-users@lists.ceph.com> >> <mailto:ceph-us...@lists.ceph.__com >> <mailto:ceph-users@lists.ceph.com>> >> > >> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> >> >> >> _________________________________________________ >> >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> _________________________________________________ >> >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com