Last two questions: 1) I have used other systems in the past. In case of split brain or serious problems they offered me to choose which copy is "good" and then work again. Is there a way to tell ceph that all is ok? This morning again I have 19 incomplete pgs after recovery 2) Where can I find paid support? I mean someone that logs in to my cluster and tell cephs that all is active+clean
Thanks, Mario Il giorno mer 29 giu 2016 alle ore 16:08 Mario Giammarco < mgiamma...@gmail.com> ha scritto: > This time at the end of recovery procedure you described it was like most > pgs active+clean 20 pgs incomplete. > After that when trying to use the cluster I got "request blocked more > than" and no vm can start. > I know that something has happened after the broken disk, probably a > server reboot. I am investigating. > But even if I find the origin of the problem it will not help in finding a > solution now. > So I am using my time in repairing the pool only to save the production > data and I will throw away the rest. > Now after marking all pgs as complete with ceph_objectstore_tool I see > that: > > 1) ceph has put out three hdds ( I suppose due to scrub but it is my only > my idea, I will check logs) BAD > 2) it is recovering for objects degraded and misplaced GOOD > 3) vm are not usable yet BAD > 4) I see some pgs in state down+peering (I hope is not BAD) > > Regarding 1) how I can put again that three hdds in the cluster? Should I > remove them from crush and start again? > Can I tell ceph that they are not bad? > Mario > > Il giorno mer 29 giu 2016 alle ore 15:34 Lionel Bouton < > lionel+c...@bouton.name> ha scritto: > >> Hi, >> >> Le 29/06/2016 12:00, Mario Giammarco a écrit : >> > Now the problem is that ceph has put out two disks because scrub has >> > failed (I think it is not a disk fault but due to mark-complete) >> >> There is something odd going on. I've only seen deep-scrub failing (ie >> detect one inconsistency and marking the pg so) so I'm not sure what >> happens in the case of a "simple" scrub failure but what should not >> happen is the whole OSD going down on scrub of deepscrub fairure which >> you seem to imply did happen. >> Do you have logs for these two failures giving a hint at what happened >> (probably /var/log/ceph/ceph-osd.<n>.log) ? Any kernel log pointing to >> hardware failure(s) around the time these events happened ? >> >> Another point : you said that you had one disk "broken". Usually ceph >> handles this case in the following manner : >> - the OSD detects the problem and commit suicide (unless it's configured >> to ignore IO errors which is not the default), >> - your cluster is then in degraded state with one OSD down/in, >> - after a timeout (several minutes), Ceph decides that the OSD won't >> come up again soon and marks the OSD "out" (so one OSD down/out), >> - as the OSD is out, crush adapts pg positions based on the remaining >> available OSDs and bring back all degraded pg to clean state by creating >> missing replicas while moving pgs around. You see a lot of IO, many pg >> in wait_backfill/backfilling states at this point, >> - when all is done the cluster is back to HEALTH_OK >> >> When your disk was broken and you waited 24 hours how far along this >> process was your cluster ? >> >> Best regards, >> >> Lionel >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com