> > > Hi David, > > Yes heath detail outputs all the errors etc and recovery / backfill is > going on, just taking time 25% misplaced and 1.5 degraded. > > I can list out the pools and see sizes etc.. > > My main problem is I have no client IO from a read perspective, I cannot > start vms I'm openstack and ceph -w states that rd is zero most of the time. > > I can restart the nodes and the osd reconnect and come in correctly.. > > I was chasing a missing object but was able to correct that by setting the > crush weight to 0.0 on the secondary. > > Kind regards > > Lee > > On Sun, 2 Sep 2018, 13:43 David C, <dcsysengin...@gmail.com> wrote: > >> Does "ceph health detail" work? >> Have you manually confirmed the OSDs on the nodes are working? >> What was the replica size of the pools? >> Are you seeing any progress with the recovery? >> >> >> >> On Sun, Sep 2, 2018 at 9:42 AM Lee <lqui...@gmail.com> wrote: >> >>> Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x >>> OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting >>> enviroment, 1 OSD node failed (offline with a the journal SSD dead) left >>> with 2 nodes running correctly, 2 hours later a second OSD node failed >>> complaining of readwrite errors to the physical drives, i assume this was a >>> heat issue as when rebooted this came back online ok and ceph started to >>> repair itself. We have since brought the first failed node back on by >>> replacing the ssd and recreating the journals hoping it would all repair.. >>> Our pools are min 2 repl. >>> >>> The problem we have is client IO (read) is totally blocked, and when I >>> query the stuck PG's it just hangs.. >>> >>> For example the check version command just errors with: >>> >>> Error EINTR: problem getting command descriptions from on various OSD's >>> so I cannot even query the inactive PG's >>> >>> root@node31-a4:~# ceph -s >>> cluster 7c24e1b9-24b3-4a1b-8889-9b2d7fd88cd2 >>> health HEALTH_WARN >>> 83 pgs backfill >>> 2 pgs backfill_toofull >>> 3 pgs backfilling >>> 48 pgs degraded >>> 1 pgs down >>> 31 pgs incomplete >>> 1 pgs recovering >>> 29 pgs recovery_wait >>> 1 pgs stale >>> 48 pgs stuck degraded >>> 31 pgs stuck inactive >>> 1 pgs stuck stale >>> 148 pgs stuck unclean >>> 17 pgs stuck undersized >>> 17 pgs undersized >>> 599 requests are blocked > 32 sec >>> recovery 111489/4697618 objects degraded (2.373%) >>> recovery 772268/4697618 objects misplaced (16.440%) >>> recovery 1/2171314 unfound (0.000%) >>> monmap e5: 3 mons at {bc07s12-a7= >>> 172.27.16.11:6789/0,bc07s13-a7=172.27.16.21:6789/0,bc07s14-a7=172.27.16.15:6789/0 >>> } >>> election epoch 198, quorum 0,1,2 >>> bc07s12-a7,bc07s14-a7,bc07s13-a7 >>> osdmap e18727: 25 osds: 25 up, 25 in; 90 remapped pgs >>> pgmap v70996322: 1792 pgs, 13 pools, 8210 GB data, 2120 kobjects >>> 16783 GB used, 6487 GB / 23270 GB avail >>> 111489/4697618 objects degraded (2.373%) >>> 772268/4697618 objects misplaced (16.440%) >>> 1/2171314 unfound (0.000%) >>> 1639 active+clean >>> 66 active+remapped+wait_backfill >>> 30 incomplete >>> 25 active+recovery_wait+degraded >>> 15 active+undersized+degraded+remapped+wait_backfill >>> 4 active+recovery_wait+degraded+remapped >>> 4 active+clean+scrubbing >>> 2 active+remapped+wait_backfill+backfill_toofull >>> 1 down+incomplete >>> 1 active+remapped+backfilling >>> 1 active+clean+scrubbing+deep >>> 1 stale+active+undersized+degraded >>> 1 active+undersized+degraded+remapped+backfilling >>> 1 active+degraded+remapped+backfilling >>> 1 active+recovering+degraded >>> recovery io 29385 kB/s, 7 objects/s >>> client io 5877 B/s wr, 1 op/s >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com