Hi David, Yes heath detail outputs all the errors etc and recovery / backfill is going on, just taking time 25% misplaced and 1.5 degraded.
I can list out the pools and see sizes etc.. My main problem is I have no client IO from a read perspective, I cannot start vms I'm openstack and ceph -w states that rd is zero most of the time. I can restart the nodes and the osd reconnect and come in correctly.. I was chasing a missing object but was able to correct that by setting the crush weight to 0.0 on the secondary. Kind regards Lee On Sun, 2 Sep 2018, 13:43 David C, <dcsysengin...@gmail.com> wrote: > Does "ceph health detail" work? > Have you manually confirmed the OSDs on the nodes are working? > What was the replica size of the pools? > Are you seeing any progress with the recovery? > > > > On Sun, Sep 2, 2018 at 9:42 AM Lee <lqui...@gmail.com> wrote: > >> Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x >> OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting >> enviroment, 1 OSD node failed (offline with a the journal SSD dead) left >> with 2 nodes running correctly, 2 hours later a second OSD node failed >> complaining of readwrite errors to the physical drives, i assume this was a >> heat issue as when rebooted this came back online ok and ceph started to >> repair itself. We have since brought the first failed node back on by >> replacing the ssd and recreating the journals hoping it would all repair.. >> Our pools are min 2 repl. >> >> The problem we have is client IO (read) is totally blocked, and when I >> query the stuck PG's it just hangs.. >> >> For example the check version command just errors with: >> >> Error EINTR: problem getting command descriptions from on various OSD's >> so I cannot even query the inactive PG's >> >> root@node31-a4:~# ceph -s >> cluster 7c24e1b9-24b3-4a1b-8889-9b2d7fd88cd2 >> health HEALTH_WARN >> 83 pgs backfill >> 2 pgs backfill_toofull >> 3 pgs backfilling >> 48 pgs degraded >> 1 pgs down >> 31 pgs incomplete >> 1 pgs recovering >> 29 pgs recovery_wait >> 1 pgs stale >> 48 pgs stuck degraded >> 31 pgs stuck inactive >> 1 pgs stuck stale >> 148 pgs stuck unclean >> 17 pgs stuck undersized >> 17 pgs undersized >> 599 requests are blocked > 32 sec >> recovery 111489/4697618 objects degraded (2.373%) >> recovery 772268/4697618 objects misplaced (16.440%) >> recovery 1/2171314 unfound (0.000%) >> monmap e5: 3 mons at {bc07s12-a7= >> 172.27.16.11:6789/0,bc07s13-a7=172.27.16.21:6789/0,bc07s14-a7=172.27.16.15:6789/0 >> } >> election epoch 198, quorum 0,1,2 >> bc07s12-a7,bc07s14-a7,bc07s13-a7 >> osdmap e18727: 25 osds: 25 up, 25 in; 90 remapped pgs >> pgmap v70996322: 1792 pgs, 13 pools, 8210 GB data, 2120 kobjects >> 16783 GB used, 6487 GB / 23270 GB avail >> 111489/4697618 objects degraded (2.373%) >> 772268/4697618 objects misplaced (16.440%) >> 1/2171314 unfound (0.000%) >> 1639 active+clean >> 66 active+remapped+wait_backfill >> 30 incomplete >> 25 active+recovery_wait+degraded >> 15 active+undersized+degraded+remapped+wait_backfill >> 4 active+recovery_wait+degraded+remapped >> 4 active+clean+scrubbing >> 2 active+remapped+wait_backfill+backfill_toofull >> 1 down+incomplete >> 1 active+remapped+backfilling >> 1 active+clean+scrubbing+deep >> 1 stale+active+undersized+degraded >> 1 active+undersized+degraded+remapped+backfilling >> 1 active+degraded+remapped+backfilling >> 1 active+recovering+degraded >> recovery io 29385 kB/s, 7 objects/s >> client io 5877 B/s wr, 1 op/s >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com