Re: [ceph-users] ceph-osd not starting after network related issues

Paul Emmerich Wed, 03 Jul 2019 06:48:20 -0700

For anyone reading this in the future from a google search: please don't
set osd_find_best_info_ignore_history_les unless you know exactly what you
are doing.
That's a really dangerous option and should be a last resort. It will
almost definitely lead to some data loss or inconsistencies (lost writes).


However, it is unfortunately sometimes required to do that when running
with min_size 1 (which you also should never do if you care about your
data).


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 3, 2019 at 8:52 AM Ian Coetzee <[email protected]> wrote:

> Hi All,
>
> Some feedback on my end. I managed to recover the "lost data" from one of
> the other OSDs. Seems like my initial summary was a bit off, in that the
> PG's was replicated, CEPH just wanted to confirm that the objects were
> still relevant.
>
> For future reference, I basically marked the OSD as lost
>
> > ceph osd lost <id>
>
> Then the PGs went into an incomplete state
>
> After that I temporarily set an option on the OSDs to ignore the history
> (osd_find_best_info_ignore_history_les). Got the info from
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/017270.html
>
> After that CEPH was happy and started to rebalance the cluster, pheew,
> crisis averted.
>
> This failure did however convince me to increase our cluster size from 2:1
> to 3:2. Sacrificing usable space for reliability.
>
> Now I need to give feedback on what happened, this is what I am still not
> sure about as SMART does not show any sector errors. I might as well start
> a badblocks and see if I detect anything in there.
>
> As always, I am open to other suggestion as to where to look for other
> clues on what went wrong.
>
> Kind regards
>
> On Mon, 1 Jul 2019 at 09:31, Ian Coetzee <[email protected]> wrote:
>
>> Hi Guys,
>>
>> This is a cross-post from the proxmox ML.
>>
>> This morning I have a bit of a big boo-boo on our production system.
>>
>> After a very sudden network outage somewhere during the night, one of my
>> ceph-osd's is no longer starting up.
>>
>> If I try and start it manually, I get a very spectacular failure, see
>> link.
>>
>> https://www.jacklin.co.za/zerobin/?04e2dcd13ab8dfc8#zKCISUvAm4o/6mnLmyu+8fSS1VumC65XaETt/dD7rn0=
>>
>> As near as I can tell, it seems to be asserting whether a file exsists, I
>> have yet to determine which file that would be. Any pointers are welcome,
>> as well as any other ideas to get the osd back. For some reason there is
>> data on the osd that was not replicated to my other osd's, as such I can
>> not just re-init this osd as some of the posts I could find suggests
>>
>> Kind regards
>>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-osd not starting after network related issues

Reply via email to