ceph osd tree down # shows the down osds
ceph osd tree out # shows the out osds

there is no "active/inactive" state on an osd.

You can force an individual osd to do a soft restart with "ceph osd
down <osdid>" -- this will cause it to restart and recontact mons and
osd peers. If that doesn't work, restart the process. Do this with a
few at first just to make sure it helps, not hurts.

You can also adjust "mon_osd_report_timeout" (which defaults to 900s)
-- that's the timeout that is marking your osds down.

-- dan



On Tue, May 5, 2020 at 5:40 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dan,
>
> looking at an older thread, I found that "OSDs do not send beacons if they 
> are not active". Is there any way to activate an OSD manually? Or check which 
> ones are inactive?
>
> Also, I looked at this here:
>
> [root@gnosis ~]# ceph mon feature ls
> all features
>         supported: [kraken,luminous,mimic,osdmap-prune]
>         persistent: [kraken,luminous,mimic,osdmap-prune]
> on current monmap (epoch 3)
>         persistent: [kraken,luminous,mimic,osdmap-prune]
>         required: [kraken,luminous,mimic,osdmap-prune]
>
> Our fs-clients report jewel as their release. Should I do something about 
> that?
>
> Thanks!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <d...@vanderster.com>
> Sent: 05 May 2020 17:35:33
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] Ceph meltdown, need help
>
> OK those requires look correct.
>
> While the pgs are inactive there will be no client IO, so there's
> nothing to pause at this point. In general, I would evict those
> misbehaving clients with ceph tell mds.* client evict id=<id>
>
> For now, keep nodown and noout, let all the PGs get active again. You
> might need to mark some in, if they don't automatically come back in.
> Let the PGs recover, then once the MDSs report no slow ops you can
> consider taking the cephfs offline while the PGs heal fully.
>
> If the beacon messages continue, you need to keep investigating why
> they aren't sent.
> (also as a workaround you can set a much higher timeout).
>
>
> -- dan
>
>
> On Tue, May 5, 2020 at 5:30 PM Frank Schilder <fr...@dtu.dk> wrote:
> >
> > Thanks! Here it is:
> >
> > [root@gnosis ~]# ceph osd dump | grep require
> > require_min_compat_client jewel
> > require_osd_release mimic
> >
> > It looks like we had an extremely aggressive job running on our cluster, 
> > completely flooding everything with small I/O. I think the cluster built up 
> > a huge backlog and is/was really busy trying to serve the IO. It lost 
> > beacons/heartbeats in the process or theygot too old.
> >
> > Is there a way to pause client I/O?
> >
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Dan van der Ster <d...@vanderster.com>
> > Sent: 05 May 2020 17:25:56
> > To: Frank Schilder
> > Cc: ceph-users
> > Subject: Re: [ceph-users] Ceph meltdown, need help
> >
> > Hi,
> >
> > The osds are getting marked down due to this:
> >
> > 2020-05-05 15:18:42.893964 mon.ceph-01 mon.0 192.168.32.65:6789/0
> > 292689 : cluster [INF] osd.40 marked down after no beacon for
> > 903.781033 seconds
> > 2020-05-05 15:18:42.894009 mon.ceph-01 mon.0 192.168.32.65:6789/0
> > 292690 : cluster [INF] osd.60 marked down after no beacon for
> > 903.780916 seconds
> > 2020-05-05 15:18:42.894075 mon.ceph-01 mon.0 192.168.32.65:6789/0
> > 292691 : cluster [INF] osd.170 marked down after no beacon for
> > 903.780957 seconds
> > 2020-05-05 15:18:42.894108 mon.ceph-01 mon.0 192.168.32.65:6789/0
> > 292692 : cluster [INF] osd.244 marked down after no beacon for
> > 903.780661 seconds
> > 2020-05-05 15:18:42.894159 mon.ceph-01 mon.0 192.168.32.65:6789/0
> > 292693 : cluster [INF] osd.283 marked down after no beacon for
> > 903.780998 seconds
> >
> > You're right to set nodown and noout, while trying to understand why
> > the beacon is not being sent.
> >
> > Can you show the output of `ceph osd dump | grep require` ?
> > (I vaguely recall that after a mimic upgrade you need to flip some
> > switch to enable the beacon sending...)
> >
> > --
> > Dan
> >
> >
> >
> > On Tue, May 5, 2020 at 4:42 PM Frank Schilder <fr...@dtu.dk> wrote:
> > >
> > > Dear Dan,
> > >
> > > thank you for your fast response. Please find the log of the first OSD 
> > > that went down and the ceph.log with these links:
> > >
> > > https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l
> > > https://files.dtu.dk/u/hPb5qax2-b6W9vmp/ceph-osd.2.log?l
> > >
> > > I can collect more osd logs if this helps.
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: Dan van der Ster <d...@vanderster.com>
> > > Sent: 05 May 2020 16:25:31
> > > To: Frank Schilder
> > > Cc: ceph-users
> > > Subject: Re: [ceph-users] Ceph meltdown, need help
> > >
> > > Hi Frank,
> > >
> > > Could you share any ceph-osd logs and also the ceph.log from a mon to
> > > see why the cluster thinks all those osds are down?
> > >
> > > Simply marking them up isn't going to help, I'm afraid.
> > >
> > > Cheers, Dan
> > >
> > >
> > > On Tue, May 5, 2020 at 4:12 PM Frank Schilder <fr...@dtu.dk> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > a lot of OSDs crashed in our cluster. Mimic 13.2.8. Current status 
> > > > included below. All daemons are running, no OSD process crashed. Can I 
> > > > start marking OSDs in and up to get them back talking to each other?
> > > >
> > > > Please advice on next steps. Thanks!!
> > > >
> > > > [root@gnosis ~]# ceph status
> > > >   cluster:
> > > >     id:     e4ece518-f2cb-4708-b00f-b6bf511e91d9
> > > >     health: HEALTH_WARN
> > > >             2 MDSs report slow metadata IOs
> > > >             1 MDSs report slow requests
> > > >             nodown,noout,norecover flag(s) set
> > > >             125 osds down
> > > >             3 hosts (48 osds) down
> > > >             Reduced data availability: 2221 pgs inactive, 1943 pgs 
> > > > down, 190 pgs peering, 13 pgs stale
> > > >             Degraded data redundancy: 5134396/500993581 objects 
> > > > degraded (1.025%), 296 pgs degraded, 299 pgs undersized
> > > >             9622 slow ops, oldest one blocked for 2913 sec, daemons 
> > > > [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]...
> > > >  have slow ops.
> > > >
> > > >   services:
> > > >     mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> > > >     mgr: ceph-02(active), standbys: ceph-03, ceph-01
> > > >     mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
> > > >     osd: 288 osds: 90 up, 215 in; 230 remapped pgs
> > > >          flags nodown,noout,norecover
> > > >
> > > >   data:
> > > >     pools:   10 pools, 2545 pgs
> > > >     objects: 62.61 M objects, 144 TiB
> > > >     usage:   219 TiB used, 1.6 PiB / 1.8 PiB avail
> > > >     pgs:     1.729% pgs unknown
> > > >              85.540% pgs not active
> > > >              5134396/500993581 objects degraded (1.025%)
> > > >              1796 down
> > > >              226  active+undersized+degraded
> > > >              147  down+remapped
> > > >              140  peering
> > > >              65   active+clean
> > > >              44   unknown
> > > >              38   undersized+degraded+peered
> > > >              38   remapped+peering
> > > >              17   active+undersized+degraded+remapped+backfill_wait
> > > >              12   stale+peering
> > > >              12   active+undersized+degraded+remapped+backfilling
> > > >              4    active+undersized+remapped
> > > >              2    remapped
> > > >              2    undersized+degraded+remapped+peered
> > > >              1    stale
> > > >              1    undersized+degraded+remapped+backfilling+peered
> > > >
> > > >   io:
> > > >     client:   26 KiB/s rd, 206 KiB/s wr, 21 op/s rd, 50 op/s wr
> > > >
> > > > [root@gnosis ~]# ceph health detail
> > > > HEALTH_WARN 2 MDSs report slow metadata IOs; 1 MDSs report slow 
> > > > requests; nodown,noout,norecover flag(s) set; 125 osds down; 3 hosts 
> > > > (48 osds) down; Reduced data availability: 2219 pgs inactive, 1943 pgs 
> > > > down, 188 pgs peering, 13 pgs stale; Degraded data redundancy: 
> > > > 5214696/500993589 objects degraded (1.041%), 298 pgs degraded, 299 pgs 
> > > > undersized; 9788 slow ops, oldest one blocked for 2953 sec, daemons 
> > > > [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]...
> > > >  have slow ops.
> > > > MDS_SLOW_METADATA_IO 2 MDSs report slow metadata IOs
> > > >     mdsceph-08(mds.0): 100+ slow metadata IOs are blocked > 30 secs, 
> > > > oldest blocked for 2940 secs
> > > >     mdsceph-12(mds.0): 1 slow metadata IOs are blocked > 30 secs, 
> > > > oldest blocked for 2942 secs
> > > > MDS_SLOW_REQUEST 1 MDSs report slow requests
> > > >     mdsceph-08(mds.0): 100 slow requests are blocked > 30 secs
> > > > OSDMAP_FLAGS nodown,noout,norecover flag(s) set
> > > > OSD_DOWN 125 osds down
> > > >     osd.0 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) 
> > > > is down
> > > >     osd.6 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.7 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.8 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.16 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.18 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.19 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.21 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.31 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) 
> > > > is down
> > > >     osd.37 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) 
> > > > is down
> > > >     osd.38 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) 
> > > > is down
> > > >     osd.48 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) 
> > > > is down
> > > >     osd.51 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) 
> > > > is down
> > > >     osd.53 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) 
> > > > is down
> > > >     osd.55 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) 
> > > > is down
> > > >     osd.62 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.67 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.72 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) 
> > > > is down
> > > >     osd.75 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.78 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.79 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.80 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.81 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.82 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14)
> > > >  is down
> > > >     osd.83 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.88 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.89 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.92 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.93 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.95 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.96 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.97 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.100 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.104 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.105 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.107 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.108 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.109 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.111 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14)
> > > >  is down
> > > >     osd.113 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.114 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09)
> > > >  is down
> > > >     osd.116 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.117 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.119 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.122 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.123 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.124 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.125 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09)
> > > >  is down
> > > >     osd.126 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.128 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.131 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.134 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.139 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.140 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.141 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.145 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) 
> > > > is down
> > > >     osd.149 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.151 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09)
> > > >  is down
> > > >     osd.152 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.153 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.154 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14)
> > > >  is down
> > > >     osd.155 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.156 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) 
> > > > is down
> > > >     osd.157 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) 
> > > > is down
> > > >     osd.159 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) 
> > > > is down
> > > >     osd.161 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09)
> > > >  is down
> > > >     osd.162 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.164 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.165 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.166 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.167 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14)
> > > >  is down
> > > >     osd.171 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.172 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) 
> > > > is down
> > > >     osd.174 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.176 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.177 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13)
> > > >  is down
> > > >     osd.179 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.182 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-06) 
> > > > is down
> > > >     osd.183 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) 
> > > > is down
> > > >     osd.184 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.186 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.187 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.190 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14)
> > > >  is down
> > > >     osd.191 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15)
> > > >  is down
> > > >     osd.194 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.195 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.196 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.199 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.200 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.201 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.202 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.203 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.204 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.208 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.210 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08)
> > > >  is down
> > > >     osd.212 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.213 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.214 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09)
> > > >  is down
> > > >     osd.215 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10)
> > > >  is down
> > > >     osd.216 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.218 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09)
> > > >  is down
> > > >     osd.219 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11)
> > > >  is down
> > > >     osd.221 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12)
> > > >  is down
> > > >     osd.224 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16)
> > > >  is down
> > > >     osd.226 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17)
> > > >  is down
> > > >     osd.228 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) 
> > > > is down
> > > >     osd.230 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) 
> > > > is down
> > > >     osd.233 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) 
> > > > is down
> > > >     osd.236 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) 
> > > > is down
> > > >     osd.238 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) 
> > > > is down
> > > >     osd.247 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) 
> > > > is down
> > > >     osd.248 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) 
> > > > is down
> > > >     osd.254 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) 
> > > > is down
> > > >     osd.256 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) 
> > > > is down
> > > >     osd.259 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) 
> > > > is down
> > > >     osd.260 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) 
> > > > is down
> > > >     osd.262 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) 
> > > > is down
> > > >     osd.266 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) 
> > > > is down
> > > >     osd.267 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) 
> > > > is down
> > > >     osd.272 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) 
> > > > is down
> > > >     osd.274 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) 
> > > > is down
> > > >     osd.275 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) 
> > > > is down
> > > >     osd.276 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) 
> > > > is down
> > > >     osd.281 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) 
> > > > is down
> > > >     osd.285 
> > > > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) 
> > > > is down
> > > > OSD_HOST_DOWN 3 hosts (48 osds) down
> > > >     host ceph-11 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16 
> > > > osds) is down
> > > >     host ceph-10 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16 
> > > > osds) is down
> > > >     host ceph-13 
> > > > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16 
> > > > osds) is down
> > > > PG_AVAILABILITY Reduced data availability: 2219 pgs inactive, 1943 pgs 
> > > > down, 188 pgs peering, 13 pgs stale
> > > >     pg 14.513 is stuck inactive for 1681.564244, current state down, 
> > > > last acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,143,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.514 is down, acting 
> > > > [193,2147483647,2147483647,2147483647,2147483647,118,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.515 is down, acting 
> > > > [2147483647,2147483647,2147483647,211,133,135,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.516 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205,2147483647]
> > > >     pg 14.517 is down, acting 
> > > > [2147483647,2147483647,5,2147483647,2147483647,2147483647,2147483647,2147483647,61,112]
> > > >     pg 14.518 is down, acting 
> > > > [2147483647,198,2147483647,2147483647,2147483647,2147483647,4,185,2147483647,2147483647]
> > > >     pg 14.519 is down, acting 
> > > > [2147483647,2147483647,68,2147483647,2147483647,2147483647,2147483647,185,2147483647,94]
> > > >     pg 14.51a is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,101,2147483647]
> > > >     pg 14.51b is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,197,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.51c is down, acting 
> > > > [193,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,197]
> > > >     pg 14.51d is down, acting 
> > > > [2147483647,2147483647,61,2147483647,77,2147483647,2147483647,2147483647,112,2147483647]
> > > >     pg 14.51e is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,112,2147483647,2147483647,193,2147483647,2147483647]
> > > >     pg 14.51f is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,94,2147483647,2147483647]
> > > >     pg 14.520 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,207,2147483647,101,133,2147483647]
> > > >     pg 14.521 is down, acting 
> > > > [205,2147483647,133,2147483647,2147483647,2147483647,2147483647,4,2147483647,193]
> > > >     pg 14.522 is down, acting 
> > > > [101,2147483647,2147483647,11,197,2147483647,136,94,2147483647,2147483647]
> > > >     pg 14.523 is down, acting 
> > > > [2147483647,2147483647,2147483647,118,2147483647,71,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.524 is down, acting 
> > > > [2147483647,111,2147483647,2147483647,2147483647,8,2147483647,112,2147483647,2147483647]
> > > >     pg 14.525 is down, acting 
> > > > [2147483647,2147483647,2147483647,142,2147483647,61,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.526 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,61,193,2147483647,2147483647,2147483647]
> > > >     pg 14.527 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,109,2147483647,2147483647]
> > > >     pg 14.528 is down, acting 
> > > > [2147483647,133,2147483647,2147483647,2147483647,2147483647,4,2147483647,2147483647,2147483647]
> > > >     pg 14.529 is down, acting 
> > > > [2147483647,112,2147483647,2147483647,2147483647,2147483647,185,2147483647,118,2147483647]
> > > >     pg 14.52a is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,136,2147483647,135,2147483647,2147483647]
> > > >     pg 14.52b is down, acting 
> > > > [2147483647,2147483647,2147483647,112,142,211,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.52c is down, acting 
> > > > [185,2147483647,198,2147483647,118,2147483647,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.52d is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,5,2147483647,2147483647,2147483647]
> > > >     pg 14.52e is down, acting 
> > > > [71,101,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,142,2147483647]
> > > >     pg 14.52f is down, acting 
> > > > [198,2147483647,2147483647,2147483647,2147483647,11,2147483647,2147483647,118,2147483647]
> > > >     pg 14.530 is down, acting 
> > > > [142,2147483647,2147483647,2147483647,133,2147483647,2147483647,2147483647,2147483647,112]
> > > >     pg 14.531 is down, acting 
> > > > [2147483647,142,2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.532 is down, acting 
> > > > [135,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,136,118]
> > > >     pg 14.533 is down, acting 
> > > > [2147483647,77,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.534 is down, acting 
> > > > [2147483647,2147483647,2147483647,185,118,2147483647,2147483647,207,2147483647,2147483647]
> > > >     pg 14.535 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,136,142,133,2147483647]
> > > >     pg 14.536 is down, acting 
> > > > [2147483647,11,2147483647,2147483647,136,2147483647,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.537 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,77,2147483647]
> > > >     pg 14.538 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205,2147483647,2147483647]
> > > >     pg 14.539 is down, acting 
> > > > [2147483647,2147483647,2147483647,198,2147483647,2147483647,4,2147483647,2147483647,2147483647]
> > > >     pg 14.53a is down, acting 
> > > > [2147483647,11,136,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.53b is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,112,2147483647,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.53c is down, acting 
> > > > [2147483647,2147483647,2147483647,71,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.53d is down, acting 
> > > > [2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647,2147483647,2147483647,136]
> > > >     pg 14.53e is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,112,185]
> > > >     pg 14.53f is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647]
> > > >     pg 14.540 is down, acting 
> > > > [205,2147483647,2147483647,2147483647,2147483647,2147483647,142,2147483647,112,77]
> > > >     pg 14.541 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,197,211,2147483647,2147483647,2147483647]
> > > >     pg 14.542 is down, acting 
> > > > [112,2147483647,101,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.543 is down, acting 
> > > > [111,2147483647,2147483647,2147483647,2147483647,101,2147483647,2147483647,2147483647,2147483647]
> > > >     pg 14.544 is down, acting 
> > > > [4,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205]
> > > >     pg 14.545 is down, acting 
> > > > [2147483647,2147483647,2147483647,2147483647,2147483647,142,5,2147483647,2147483647,2147483647]
> > > > PG_DEGRADED Degraded data redundancy: 5214696/500993589 objects 
> > > > degraded (1.041%), 298 pgs degraded, 299 pgs undersized
> > > >     pg 1.29 is stuck undersized for 2075.633328, current state 
> > > > active+undersized+degraded, last acting [253,258]
> > > >     pg 1.2a is stuck undersized for 1642.864920, current state 
> > > > active+undersized+degraded, last acting [252,255]
> > > >     pg 1.2b is stuck undersized for 2355.149928, current state 
> > > > active+undersized+degraded+remapped+backfill_wait, last acting [240,268]
> > > >     pg 1.2c is stuck undersized for 1459.277329, current state 
> > > > active+undersized+degraded, last acting [241,273]
> > > >     pg 1.2d is stuck undersized for 803.339131, current state 
> > > > undersized+degraded+peered, last acting [282]
> > > >     pg 2.25 is active+undersized+degraded, acting 
> > > > [253,2147483647,2147483647,258,261,273,277,243]
> > > >     pg 2.28 is stuck undersized for 803.340163, current state 
> > > > active+undersized+degraded, last acting 
> > > > [282,241,246,2147483647,273,252,2147483647,268]
> > > >     pg 2.29 is stuck undersized for 803.341160, current state 
> > > > active+undersized+degraded, last acting 
> > > > [240,258,277,264,2147483647,2147483647,271,250]
> > > >     pg 2.2a is stuck undersized for 1447.684978, current state 
> > > > active+undersized+degraded+remapped+backfilling, last acting 
> > > > [252,270,2147483647,261,2147483647,255,287,264]
> > > >     pg 2.2e is stuck undersized for 2030.849944, current state 
> > > > active+undersized+degraded, last acting 
> > > > [264,2147483647,251,245,257,286,261,258]
> > > >     pg 2.51 is stuck undersized for 1459.274671, current state 
> > > > active+undersized+degraded+remapped+backfilling, last acting 
> > > > [270,2147483647,2147483647,265,241,243,240,252]
> > > >     pg 2.52 is stuck undersized for 2030.850897, current state 
> > > > active+undersized+degraded+remapped+backfilling, last acting 
> > > > [240,2147483647,270,265,269,280,278,2147483647]
> > > >     pg 2.53 is stuck undersized for 1459.273517, current state 
> > > > active+undersized+degraded, last acting 
> > > > [261,2147483647,280,282,2147483647,245,243,241]
> > > >     pg 2.61 is stuck undersized for 2075.633140, current state 
> > > > active+undersized+degraded+remapped+backfilling, last acting 
> > > > [269,2147483647,258,286,270,255,2147483647,264]
> > > >     pg 2.62 is stuck undersized for 803.340577, current state 
> > > > active+undersized+degraded, last acting 
> > > > [2147483647,253,258,2147483647,250,287,264,284]
> > > >     pg 2.66 is stuck undersized for 803.341231, current state 
> > > > active+undersized+degraded, last acting 
> > > > [264,280,265,255,257,269,2147483647,270]
> > > >     pg 2.6c is stuck undersized for 963.369539, current state 
> > > > active+undersized+degraded, last acting 
> > > > [286,269,278,251,2147483647,273,2147483647,280]
> > > >     pg 2.70 is stuck undersized for 873.662725, current state 
> > > > active+undersized+degraded, last acting 
> > > > [2147483647,268,255,273,253,265,278,2147483647]
> > > >     pg 2.74 is stuck undersized for 2075.632312, current state 
> > > > active+undersized+degraded+remapped+backfilling, last acting 
> > > > [240,242,2147483647,245,243,269,2147483647,265]
> > > >     pg 3.24 is stuck undersized for 1570.800184, current state 
> > > > active+undersized+degraded, last acting [235,263]
> > > >     pg 3.25 is stuck undersized for 733.673503, current state 
> > > > undersized+degraded+peered, last acting [232]
> > > >     pg 3.28 is stuck undersized for 2610.307886, current state 
> > > > active+undersized+degraded, last acting [263,84]
> > > >     pg 3.2a is stuck undersized for 1214.710839, current state 
> > > > active+undersized+degraded, last acting [181,232]
> > > >     pg 3.2b is stuck undersized for 2075.630671, current state 
> > > > active+undersized+degraded, last acting [63,144]
> > > >     pg 3.52 is stuck undersized for 1570.777598, current state 
> > > > active+undersized+degraded, last acting [158,237]
> > > >     pg 3.54 is stuck undersized for 1350.257189, current state 
> > > > active+undersized+degraded, last acting [239,74]
> > > >     pg 3.55 is stuck undersized for 2592.642531, current state 
> > > > active+undersized+degraded, last acting [157,233]
> > > >     pg 3.5a is stuck undersized for 2075.608257, current state 
> > > > undersized+degraded+peered, last acting [168]
> > > >     pg 3.5c is stuck undersized for 733.674836, current state 
> > > > active+undersized+degraded, last acting [263,234]
> > > >     pg 3.5d is stuck undersized for 2610.307220, current state 
> > > > active+undersized+degraded, last acting [180,84]
> > > >     pg 3.5e is stuck undersized for 1710.756037, current state 
> > > > undersized+degraded+peered, last acting [146]
> > > >     pg 3.61 is stuck undersized for 1080.210021, current state 
> > > > active+undersized+degraded, last acting [168,239]
> > > >     pg 3.62 is stuck undersized for 831.217622, current state 
> > > > active+undersized+degraded, last acting [84,263]
> > > >     pg 3.63 is stuck undersized for 733.674204, current state 
> > > > active+undersized+degraded, last acting [263,232]
> > > >     pg 3.65 is stuck undersized for 1570.790824, current state 
> > > > active+undersized+degraded, last acting [63,84]
> > > >     pg 3.66 is stuck undersized for 733.682973, current state 
> > > > undersized+degraded+peered, last acting [63]
> > > >     pg 3.68 is stuck undersized for 1570.624462, current state 
> > > > active+undersized+degraded, last acting [229,148]
> > > >     pg 3.69 is stuck undersized for 1350.316213, current state 
> > > > undersized+degraded+peered, last acting [235]
> > > >     pg 3.6b is stuck undersized for 783.813654, current state 
> > > > undersized+degraded+peered, last acting [63]
> > > >     pg 3.6c is stuck undersized for 783.819083, current state 
> > > > undersized+degraded+peered, last acting [229]
> > > >     pg 3.6f is stuck undersized for 2610.321349, current state 
> > > > active+undersized+degraded, last acting [232,158]
> > > >     pg 3.72 is stuck undersized for 1350.358149, current state 
> > > > active+undersized+degraded, last acting [229,74]
> > > >     pg 3.73 is stuck undersized for 1570.788310, current state 
> > > > undersized+degraded+peered, last acting [234]
> > > >     pg 11.20 is stuck undersized for 733.682510, current state 
> > > > active+undersized+degraded, last acting 
> > > > [2147483647,239,87,2147483647,158,237,63,76]
> > > >     pg 11.26 is stuck undersized for 1914.334332, current state 
> > > > active+undersized+degraded, last acting 
> > > > [2147483647,237,2147483647,263,158,148,181,180]
> > > >     pg 11.2d is stuck undersized for 1350.365988, current state 
> > > > active+undersized+degraded, last acting 
> > > > [2147483647,2147483647,73,229,86,158,169,84]
> > > >     pg 11.54 is stuck undersized for 1914.398125, current state 
> > > > active+undersized+degraded, last acting 
> > > > [231,169,2147483647,229,84,85,237,63]
> > > >     pg 11.5b is stuck undersized for 2047.980719, current state 
> > > > active+undersized+degraded, last acting 
> > > > [86,237,168,263,144,1,229,2147483647]
> > > >     pg 11.5e is stuck undersized for 873.643661, current state 
> > > > active+undersized+degraded, last acting 
> > > > [181,2147483647,229,158,231,1,169,2147483647]
> > > >     pg 11.62 is stuck undersized for 1144.491696, current state 
> > > > active+undersized+degraded, last acting 
> > > > [2147483647,85,235,74,63,234,181,2147483647]
> > > >     pg 11.6f is stuck undersized for 873.646628, current state 
> > > > active+undersized+degraded, last acting 
> > > > [234,3,2147483647,158,180,63,2147483647,181]
> > > > SLOW_OPS 9788 slow ops, oldest one blocked for 2953 sec, daemons 
> > > > [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]...
> > > >  have slow ops.
> > > >
> > > >
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to