Frédéric, see if the number of objects is decreasing in the pool with `ceph df [detail]`
Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric <frederic.c...@sib.fr> wrote: > It’s been over a week now and the whole cluster keeps flapping, it is > never the same OSDs that go down. > > Is there a way to get the progress of this recovery ? (The pool hat I > deleted is no longer present (for a while now)) > > In fact, there is a lot of i/o activity on the server where osds go down. > > > > Regards, > > > > *De :* ceph-users <ceph-users-boun...@lists.ceph.com> *De la part de* > Webert de Souza Lima > *Envoyé :* 31 July 2018 16:25 > *À :* ceph-users <ceph-users@lists.ceph.com> > *Objet :* Re: [ceph-users] Whole cluster flapping > > > > The pool deletion might have triggered a lot of IO operations on the disks > and the process might be too busy to respond to hearbeats, so the mons mark > them as down due to no response. > > Check also the OSD logs to see if they are actually crashing and > restarting, and disk IO usage (i.e. iostat). > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > *Belo Horizonte - Brasil* > > *IRC NICK - WebertRLZ* > > > > > > On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric <frederic.c...@sib.fr> > wrote: > > Hi Everyone, > > > > I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large > pool that we had (120 TB). > > Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 > OSD), we have SDD for journal. > > > > After I deleted the large pool my cluster started to flapping on all OSDs. > > Osds are marked down and then marked up as follow : > > > > 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 > 172.29.228.72:6800/95783 boot > > 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: > 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs > degraded, 317 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: > 81 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 > 172.29.228.72:6803/95830 boot > > 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 > osds down (OSD_DOWN) > > 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: > 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs > degraded, 223 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: > 76 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 > 172.29.228.246:6812/3144542 boot > > 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 > osds down (OSD_DOWN) > > 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: > 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs > degraded, 220 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: > 83 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: > 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs > degraded, 197 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: > 95 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: > 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs > degraded, 197 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: > 98 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed > (root=default,room=xxxx,host=xxxx) (8 reporters from different host after > 54.650576 >= grace 54.300663) > > 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 > osds down (OSD_DOWN) > > 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: > 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs > degraded, 201 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: > 78 slow requests are blocked > 32 sec (REQUEST_SLOW) > > 2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 > osds down (OSD_DOWN) > > 2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 > 172.29.228.5:6812/14996 boot > > 2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: > 5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED) > > 2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: > Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY) > > 2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: > Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs > degraded, 201 pgs undersized (PG_DEGRADED) > > 2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update: > 11 slow requests are blocked > 32 sec (REQUEST_SLOW) > > > > On the OSDs that failed logs are full of this kind of message : > > 2018-07-31 03:41:28.789681 7f698b66c700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:28.945710 7f698ae6b700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:28.946263 7f698be6d700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:28.994397 7f698b66c700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:28.994443 7f698ae6b700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:29.023356 7f698be6d700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:29.023415 7f698be6d700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:29.102909 7f698ae6b700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > 2018-07-31 03:41:29.102917 7f698b66c700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 > > > > At first it seems like a network issue but we haven’t change a thing on > the network and this cluster has been okay for months. > > > > I can’t figure out what is happening at the moment, some help will be > greatly appreciated ! > > > > Regards, > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com