On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtar <mmokh...@petasan.org> wrote:
> On 2018-03-02 07:54, Alex Gorbachev wrote: > > On Thu, Mar 1, 2018 at 10:57 PM, David Turner <drakonst...@gmail.com> > wrote: > > Blocked requests and slow requests are synonyms in ceph. They are 2 names > for the exact same thing. > > > On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <a...@iss-integration.com> > wrote: > > > On Thu, Mar 1, 2018 at 2:47 PM, David Turner <drakonst...@gmail.com> > wrote: > > `ceph health detail` should show you more information about the slow > requests. If the output is too much stuff, you can grep out for blocked > or > something. It should tell you which OSDs are involved, how long they've > been slow, etc. The default is for them to show '> 32 sec' but that may > very well be much longer and `ceph health detail` will show that. > > > Hi David, > > Thank you for the reply. Unfortunately, the health detail only shows > blocked requests. This seems to be related to a compression setting > on the pool, nothing in OSD logs. > > I replied to another compression thread. This makes sense since > compression is new, and in the past all such issues were reflected in > OSD logs and related to either network or OSD hardware. > > Regards, > Alex > > > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev <a...@iss-integration.com> > wrote: > > > Is there a switch to turn on the display of specific OSD issues? Or > does the below indicate a generic problem, e.g. network and no any > specific OSD? > > 2018-02-28 18:09:36.438300 7f6dead56700 0 > mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% > total 15997 MB, used 6154 MB, avail 9008 MB > 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check failed: 73 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check update: 74 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:53.794882 7f6de8551700 0 > mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": > "status", "format": "json"} v 0) v1 > > -- > > > I was wrong where the pool compression does not matter, even > uncompressed pool also generates these slow messages. > > Question is why no subsequent message relating to specific OSDs (like > in Jewel and prior, like this example from RH: > > 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] > 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs > > 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds > old, received at {date-time}: osd_op(client.4240.0:8 > benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 > currently waiting for subops from [610] > > In comparison, my Luminous cluster only shows the general slow/blocked > message: > > 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check failed: 116 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check update: 66 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check update: 5 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log > [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are > blocked > 32 sec) > 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log > [INF] : Cluster is now healthy > > So where are the details? > > Thanks, > Alex > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > Hi Alex, > > I agree, the health detail should display osd info. > Regarding the slow requests: ESX is very sensitive to io delay, this could > be due to cluster overload or disk going bad. If you have a monitoring > tool, do you see high load spikes (such as disk % busy ) on your disks ? > Generally it is good (if you can) simulate the max client workload + > simulate concurrent recovery and scrubbing and monitor your resources and > see if they get overloaded. Most if not all issues with ESX will go away if > you have enough hardware. I suspect this is what you are see-ing, specially > the VMotion and compression could be the extra load which triggered the > issue, in such case the problem is not specific to a particular osd. The > other possibility is that a particular disk is going bad, the monitoring > tool should be able to identify this. > > Note that VMotion generates an io pattern that keeps hitting the same osd > with concurrent small writes, if you can use rbd striping, creating 64k > stripes may help. A controller with write back cache will also help. > > Maged > Thank you, Maged. This does seem to be related to compression. I did two things - turned off compression on pool and tried vmotion to same image - failed. Then I made another image and that time it worked. So maybe some compressed parts of the old image introduced slowness. To be fair, I also set all Arecas to 75% buffer, so will be testing more today if that is the case. All OSDs look healthy. Will update...but yes odd that no OSD detail display with slow requests. May need a tracker for that. Best regards, Alex > -- -- Alex Gorbachev Storcium
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com