I was able to collect dump data during slow request, but this time I saw that it was related to high load average and iowait so I keep watching. And it was on particular two osds, but yesterday on other osds. I see in dump of these two osds that operations are stuck on queued_for_pg, for example:
"description": "osd_op(client.13057605.0:51528 17.15 17:a93a5511:::notify.2:head [watch ping cookie 94259433737472] snapc 0=[] ondisk+write+known_if_redirected e10936)", "initiated_at": "2017-10-20 12:34:29.134946", "age": 484.314936, "duration": 55.421058, "type_data": { "flag_point": "started", "client_info": { "client": "client.13057605", "client_addr": "10.192.1.78:0/3748652520", "tid": 51528 }, "events": [ { "time": "2017-10-20 12:34:29.134946", "event": "initiated" }, { "time": "2017-10-20 12:34:29.135075", "event": "queued_for_pg" }, { "time": "2017-10-20 12:35:24.555957", "event": "reached_pg" }, { "time": "2017-10-20 12:35:24.555978", "event": "started" }, { "time": "2017-10-20 12:35:24.556004", "event": "done" } ] } }, I've read thread http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021588.html . Very similar problem, can it be connected to Proxmox? I have quite old version of proxmox-ve: 4.4-80, and ceph jewel clients on pve nodes. С уважением, Ухина Ольга Моб. тел.: 8(905)-566-46-62 2017-10-20 11:05 GMT+03:00 Ольга Ухина <olga.uh...@gmail.com>: > Hi! Thanks for your help. > How can I increase interval of history for command ceph daemon osd.<id> > dump_historic_ops? It shows only for several minutes. > I see slow requests on random osds each time and on different hosts (there > are three). As I see in logs the problem doesn't relate to scrubbing. > > Regards, > Olga Ukhina > > > 2017-10-20 4:42 GMT+03:00 Brad Hubbard <bhubb...@redhat.com>: > >> I guess you have both read and followed >> http://docs.ceph.com/docs/master/rados/troubleshooting/troub >> leshooting-osd/?highlight=backfill#debugging-slow-requests >> >> What was the result? >> >> On Fri, Oct 20, 2017 at 2:50 AM, J David <j.david.li...@gmail.com> wrote: >> > On Wed, Oct 18, 2017 at 8:12 AM, Ольга Ухина <olga.uh...@gmail.com> >> wrote: >> >> I have a problem with ceph luminous 12.2.1. >> >> […] >> >> I have slow requests on different OSDs on random time (for example at >> night, >> >> but I don’t see any problems at the time of problem >> >> […] >> >> 2017-10-18 01:20:38.187326 mon.st3 mon.0 10.192.1.78:6789/0 22689 : >> cluster >> >> [WRN] Health check update: 49 slow requests are blocked > 32 sec >> >> (REQUEST_SLOW) >> > >> > This looks almost exactly like what we have been experiencing, and >> > your use-case (Proxmox client using rbd) is the same as ours as well. >> > >> > Unfortunately we were not able to find the source of the issue so far, >> > and haven’t gotten much feedback from the list. Extensive testing of >> > every component has ruled out any hardware issue we can think of. >> > >> > Originally we thought our issue was related to deep-scrub, but that >> > now appears not to be the case, as it happens even when nothing is >> > being deep-scrubbed. Nonetheless, although they aren’t the cause, >> > they definitely make the problem much worse. So you may want to check >> > to see if deep-scrub operations are happening at the times where you >> > see issues and (if so) whether the OSDs participating in the >> > deep-scrub are the same ones reporting slow requests. >> > >> > Hopefully you have better luck finding/fixing this than we have! It’s >> > definitely been a very frustrating issue for us. >> > >> > Thanks! >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Cheers, >> Brad >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com