For anyone reading this in the future from a google search: please don't set osd_find_best_info_ignore_history_les unless you know exactly what you are doing. That's a really dangerous option and should be a last resort. It will almost definitely lead to some data loss or inconsistencies (lost writes).
However, it is unfortunately sometimes required to do that when running with min_size 1 (which you also should never do if you care about your data). Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Jul 3, 2019 at 8:52 AM Ian Coetzee <c...@iancoetzee.za.net> wrote: > Hi All, > > Some feedback on my end. I managed to recover the "lost data" from one of > the other OSDs. Seems like my initial summary was a bit off, in that the > PG's was replicated, CEPH just wanted to confirm that the objects were > still relevant. > > For future reference, I basically marked the OSD as lost > > > ceph osd lost <id> > > Then the PGs went into an incomplete state > > After that I temporarily set an option on the OSDs to ignore the history > (osd_find_best_info_ignore_history_les). Got the info from > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/017270.html > > After that CEPH was happy and started to rebalance the cluster, pheew, > crisis averted. > > This failure did however convince me to increase our cluster size from 2:1 > to 3:2. Sacrificing usable space for reliability. > > Now I need to give feedback on what happened, this is what I am still not > sure about as SMART does not show any sector errors. I might as well start > a badblocks and see if I detect anything in there. > > As always, I am open to other suggestion as to where to look for other > clues on what went wrong. > > Kind regards > > On Mon, 1 Jul 2019 at 09:31, Ian Coetzee <c...@iancoetzee.za.net> wrote: > >> Hi Guys, >> >> This is a cross-post from the proxmox ML. >> >> This morning I have a bit of a big boo-boo on our production system. >> >> After a very sudden network outage somewhere during the night, one of my >> ceph-osd's is no longer starting up. >> >> If I try and start it manually, I get a very spectacular failure, see >> link. >> >> https://www.jacklin.co.za/zerobin/?04e2dcd13ab8dfc8#zKCISUvAm4o/6mnLmyu+8fSS1VumC65XaETt/dD7rn0= >> >> As near as I can tell, it seems to be asserting whether a file exsists, I >> have yet to determine which file that would be. Any pointers are welcome, >> as well as any other ideas to get the osd back. For some reason there is >> data on the osd that was not replicated to my other osd's, as such I can >> not just re-init this osd as some of the posts I could find suggests >> >> Kind regards >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com