On Wed, Aug 22, 2018 at 6:46 AM Eugen Block <ebl...@nde.ag> wrote: > Hello *, > > we have an issue with a Luminous cluster (all 12.2.5, except one on > 12.2.7) for RBD (OpenStack) and CephFS. This is the osd tree: > > host1:~ # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 22.57602 root default > -4 1.81998 host host5 > 14 hdd 0.90999 osd.14 up 0.84999 0.50000 > 15 hdd 0.90999 osd.15 up 0.84999 0.50000 > -2 6.27341 host host1 > 1 hdd 0.92429 osd.1 up 1.00000 1.00000 > 4 hdd 0.92429 osd.4 up 1.00000 1.00000 > 6 hdd 0.92429 osd.6 up 1.00000 1.00000 > 13 hdd 0.92429 osd.13 up 1.00000 1.00000 > 16 hdd 0.92429 osd.16 up 1.00000 1.00000 > 18 hdd 0.92429 osd.18 up 1.00000 1.00000 > 10 ssd 0.72769 osd.10 up 1.00000 1.00000 > -3 6.27341 host host2 > 2 hdd 0.92429 osd.2 up 1.00000 1.00000 > 5 hdd 0.92429 osd.5 up 1.00000 1.00000 > 7 hdd 0.92429 osd.7 up 1.00000 1.00000 > 12 hdd 0.92429 osd.12 up 1.00000 1.00000 > 17 hdd 0.92429 osd.17 up 1.00000 1.00000 > 19 hdd 0.92429 osd.19 up 1.00000 1.00000 > 9 ssd 0.72769 osd.9 up 1.00000 1.00000 > -5 4.57043 host host3 > 0 hdd 0.92429 osd.0 up 1.00000 1.00000 > 3 hdd 0.92429 osd.3 up 1.00000 1.00000 > 8 hdd 0.92429 osd.8 up 1.00000 1.00000 > 11 hdd 0.92429 osd.11 up 1.00000 1.00000 > 20 ssd 0.87329 osd.20 up 1.00000 0 > -16 3.63879 host host4 > 21 hdd 0.90970 osd.21 up 1.00000 0 > 22 hdd 0.90970 osd.22 up 1.00000 0 > 23 hdd 0.90970 osd.23 up 1.00000 0 > 24 hdd 0.90970 osd.24 up 1.00000 0 > > > A couple of weeks ago a new host was added to the cluster (host4), > containing four bluestore OSDs (HDD) with block.db on LVM (SSD). All > went well and the cluster was in HEALTH_OK state for some time. > > Then suddenly we experienced flapping OSDs, first on host3 (MON, MGR, > OSD) for a single OSD (OSD.20 on SSD). Later host4 (OSD only) started > flapping, too, this time all four OSDs (OSD.21 - OSD.24) were > affected. Only two reboots brought the node back up. > > We found segfaults from safe_timer and were pretty sure that the > cluster was hit by [1], it all sounded very much like our experience. > That's why we started to upgrade the new host to 12.2.7, we waited > before upgrading the other nodes in case some other issues would come > up. Two days later the same host was flapping again, but without a > segfault or any other trace of the cause. We started to assume that > the segfault could be a result of the segfault, not the cause. > > Since it seems impossible to predict that flapping we don't have debug > logs for those OSDs. But the usual logs don't reveal anything > extra-ordinary. The cluster ist healthy again for 5 days now. > > Then I found some clients (CephFS mounted for home directories and > shared storage for compute nodes) reporting this multiple times: > > ---cut here--- > [Mi Aug 22 10:31:33 2018] libceph: osd21 down > [Mi Aug 22 10:31:33 2018] libceph: osd22 down > [Mi Aug 22 10:31:33 2018] libceph: osd23 down > [Mi Aug 22 10:31:33 2018] libceph: osd24 down > [Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x0 (out) > [Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x0 (out) > [Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x0 (out) > [Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x0 (out) > [Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x10000 (in) > [Mi Aug 22 10:31:33 2018] libceph: osd21 up > [Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x10000 (in) > [Mi Aug 22 10:31:33 2018] libceph: osd22 up > [Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x10000 (in) > [Mi Aug 22 10:31:33 2018] libceph: osd24 up > [Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x10000 (in) > [Mi Aug 22 10:31:33 2018] libceph: osd23 up > ---cut here--- > > This output repeats about 20 times per OSD (except for osd20, only one > occurence). But there's no health warning, no trace of that in the > logs, no flapping (yet?), as if nothing has happened. Since these are > those OSDs that were affected by flapping there has to be a > connection, but I can't seem to find it. > > Why isn't there anything in the logs related to these dmesg events? > Why would a client report OSDs down if they haven't been? We checked > the disks for errors, we searched for network issues, no hint for > anything going wrong. >
So, this is actually just noisy logging from the client processing an OSDMap. That should probably be turned down, as it's not really an indicator of...anything...as far as I can tell. -Greg > > Can anyone shed some light on this? Can these client messages somehow > affect the OSD/MON communication in such way that MON starts reporting > OSDs down, too? The OSDs then report themselves up and then the > flapping begins? > How can I find the cause for these reports? > > If there's any more information I can provide, please let me know. > > Any insights are highly appreciated! > > Regards, > Eugen > > [1] http://tracker.ceph.com/issues/23352 > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com