Hi,I took a snapshot of MDS.0's logs. We have five active MDS in total, each one reporting laggy OSDs/clients, but I cannot find anything related to that in the log snippet. Anyhow, I uploaded the log for your reference with ceph-post-file ID 79b5138b-61d7-4ba7-b0a9-c6f02f47b881.
This is what ceph status looks like after a couple of days. This is not normal:
HEALTH_WARN 55 client(s) laggy due to laggy OSDs 8 clients failing to respond to capability release 1 clients failing to advance oldest client/flush tid 5 MDSs report slow requests(55 clients are actually "just" 11 unique client IDs, but each MDS makes their own report.)
osd mon_osd_laggy_halflife is not configured on our cluster, so it's the default of 3600.
Janek On 20/09/2023 13:17, Dhairya Parmar wrote:
Hi Janek,The PR venky mentioned makes use of OSD's laggy parameters (laggy_interval and laggy_probability) to find if any OSD is laggy or not. These laggy parameters can reset to 0 if the interval between the last modification done to OSDMap and the time stamp when OSD was marked down exceeds the grace interval thresholdwhich is the value we get by `mon_osd_laggy_halflife * 48` wheremon_osd_laggy_halflife is a configurable value which is by default 3600 so only if the interval I talked about exceeds 172800; the laggy parameters would reset to 0. I'd recommend taking a look at what your configured value is(using cmd:ceph config get osd mon_osd_laggy_halflife).There is also a "hack" to reset the parameters manually(*Not recommended, justfor info*): set mon_osd_laggy_weight to 1 using `ceph config set osdmon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said laggy andyou will see the lagginess go away. *Dhairya Parmar* Associate Software Engineer, CephFS Red Hat Inc. <https://www.redhat.com/> dpar...@redhat.com <https://www.redhat.com/> On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar <vshan...@redhat.com> wrote: Hey Janek, I took a closer look at various places where the MDS would consider a client as laggy and it seems like a wide variety of reasons are taken into consideration and not all of them might be a reason to defer client eviction, so the warning is a bit misleading. I'll post a PR for this. In the meantime, could you share the debug logs stated in my previous email? On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar <vshan...@redhat.com> wrote: > Hi Janek, > > On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff < > janek.bevendo...@uni-weimar.de> wrote: > >> Hi Venky, >> >> As I said: There are no laggy OSDs. The maximum ping I have for any OSD >> in ceph osd perf is around 60ms (just a handful, probably aging disks). The >> vast majority of OSDs have ping times of less than 1ms. Same for the host >> machines, yet I'm still seeing this message. It seems that the affected >> hosts are usually the same, but I have absolutely no clue why. >> > > It's possible that you are running into a bug which does not clear the > laggy clients list which the MDS sends to monitors via beacons. Could you > help us out with debug mds logs (by setting debug_mds=20) for the active > mds for around 15-20 seconds and share the logs please? Also reset the log > level once done since it can hurt performance. > > # ceph config set mds.<> debug_mds 20 > > and reset via > > # ceph config rm mds.<> debug_mds > > >> Janek >> >> >> On 19/09/2023 12:36, Venky Shankar wrote: >> >> Hi Janek, >> >> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff < >> janek.bevendo...@uni-weimar.de> wrote: >> >>> Thanks! However, I still don't really understand why I am seeing this. >>> >> >> This is due to a changes that was merged recently in pacific >> >> https://github.com/ceph/ceph/pull/52270 >> >> The MDS would not evict laggy clients if the OSDs report as laggy. Laggy >> OSDs can cause cephfs clients to not flush dirty data (during cap revokes >> by the MDS) and thereby showing up as laggy and getting evicted by the MDS. >> This behaviour was changed and therefore you get warnings that some client >> are laggy but they are not evicted since the OSDs are laggy. >> >> >>> The first time I had this, one of the clients was a remote user dialling >>> in via VPN, which could indeed be laggy. But I am also seeing it from >>> neighbouring hosts that are on the same physical network with reliable ping >>> times way below 1ms. How is that considered laggy? >>> >> Are some of your OSDs reporting laggy? This can be check via `perf dump` >> >> > ceph tell mds.<> perf dump >> (search for op_laggy/osd_laggy) >> >> >>> On 18/09/2023 18:07, Laura Flores wrote: >>> >>> Hi Janek, >>> >>> There was some documentation added about it here: >>> https://docs.ceph.com/en/pacific/cephfs/health-messages/ >>> >>> There is a description of what it means, and it's tied to an mds >>> configurable. >>> >>> On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff < >>> janek.bevendo...@uni-weimar.de> wrote: >>> >>>> Hey all, >>>> >>>> Since the upgrade to Ceph 16.2.14, I keep seeing the following warning: >>>> >>>> 10 client(s) laggy due to laggy OSDs >>>> >>>> ceph health detail shows it as: >>>> >>>> [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs >>>> mds.***(mds.3): Client *** is laggy; not evicted because some >>>> OSD(s) is/are laggy >>>> more of this... >>>> >>>> When I restart the client(s) or the affected MDS daemons, the message >>>> goes away and then comes back after a while. ceph osd perf does not >>>> list >>>> any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so >>>> I'm on a total loss what this even means. >>>> >>>> I have never seen this message before nor was I able to find anything >>>> about it. Do you have any idea what this message actually means and how >>>> I can get rid of it? >>>> >>>> Thanks >>>> Janek >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> >>> >>> >>> -- >>> >>> Laura Flores >>> >>> She/Her/Hers >>> >>> Software Engineer, Ceph Storage <https://ceph.io> >>> >>> Chicago, IL >>> >>> lflo...@ibm.com | lflo...@redhat.com <lflo...@redhat.com> >>> M: +17087388804 >>> >>> >>> -- >>> Bauhaus-Universität Weimar >>> Bauhausstr. 9a, R308 >>> 99423 Weimar, Germany >>> >>> Phone: +49 3643 58 3577www.webis.de <http://3577www.webis.de> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >> >> >> -- >> Cheers, >> Venky >> >> -- >> Bauhaus-Universität Weimar >> Bauhausstr. 9a, R308 >> 99423 Weimar, Germany >> >> Phone: +49 3643 58 3577www.webis.de <http://3577www.webis.de> >> >> > > -- > Cheers, > Venky >-- Cheers,Venky _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
-- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io