Re: [ceph-users] Another cluster completely hang

Mario Giammarco Wed, 29 Jun 2016 23:38:46 -0700

Last two questions:
1) I have used other systems in the past. In case of split brain or serious
problems they offered me to choose which copy is "good" and then work
again. Is there a way to tell ceph that all is ok? This morning again I
have 19 incomplete pgs after recovery
2) Where can I find paid support? I mean someone that logs in to my cluster
and tell cephs that all is active+clean


Thanks,
Mario

Il giorno mer 29 giu 2016 alle ore 16:08 Mario Giammarco <
mgiamma...@gmail.com> ha scritto:

> This time at the end of recovery procedure you described it was like most
> pgs active+clean 20 pgs incomplete.
> After that when trying to use the cluster I got "request blocked more
> than" and no vm can start.
> I know that something has happened after the broken disk, probably a
> server reboot. I am investigating.
> But even if I find the origin of the problem it will not help in finding a
> solution now.
> So I am using my time in repairing the pool only to save the production
> data and I will throw away the rest.
> Now after marking all pgs as complete with ceph_objectstore_tool I see
> that:
>
> 1) ceph has put out three hdds ( I suppose due to scrub but it is my only
> my idea, I will check logs) BAD
> 2) it is recovering for objects degraded and misplaced GOOD
> 3) vm are not usable yet BAD
> 4) I see some pgs in state down+peering (I hope is not BAD)
>
> Regarding 1) how I can put again that three hdds in the cluster? Should I
> remove them from crush and start again?
> Can I tell ceph that they are not bad?
> Mario
>
> Il giorno mer 29 giu 2016 alle ore 15:34 Lionel Bouton <
> lionel+c...@bouton.name> ha scritto:
>
>> Hi,
>>
>> Le 29/06/2016 12:00, Mario Giammarco a écrit :
>> > Now the problem is that ceph has put out two disks because scrub  has
>> > failed (I think it is not a disk fault but due to mark-complete)
>>
>> There is something odd going on. I've only seen deep-scrub failing (ie
>> detect one inconsistency and marking the pg so) so I'm not sure what
>> happens in the case of a "simple" scrub failure but what should not
>> happen is the whole OSD going down on scrub of deepscrub fairure which
>> you seem to imply did happen.
>> Do you have logs for these two failures giving a hint at what happened
>> (probably /var/log/ceph/ceph-osd.<n>.log) ? Any kernel log pointing to
>> hardware failure(s) around the time these events happened ?
>>
>> Another point : you said that you had one disk "broken". Usually ceph
>> handles this case in the following manner :
>> - the OSD detects the problem and commit suicide (unless it's configured
>> to ignore IO errors which is not the default),
>> - your cluster is then in degraded state with one OSD down/in,
>> - after a timeout (several minutes), Ceph decides that the OSD won't
>> come up again soon and marks the OSD "out" (so one OSD down/out),
>> - as the OSD is out, crush adapts pg positions based on the remaining
>> available OSDs and bring back all degraded pg to clean state by creating
>> missing replicas while moving pgs around. You see a lot of IO, many pg
>> in wait_backfill/backfilling states at this point,
>> - when all is done the cluster is back to HEALTH_OK
>>
>> When your disk was broken and you waited 24 hours how far along this
>> process was your cluster ?
>>
>> Best regards,
>>
>> Lionel
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another cluster completely hang

Reply via email to