[ceph-users] ceph-osd not starting after network related issues

2019-07-01 Thread Ian Coetzee
Hi Guys,

This is a cross-post from the proxmox ML.

This morning I have a bit of a big boo-boo on our production system.

After a very sudden network outage somewhere during the night, one of my
ceph-osd's is no longer starting up.

If I try and start it manually, I get a very spectacular failure, see link.
https://www.jacklin.co.za/zerobin/?04e2dcd13ab8dfc8#zKCISUvAm4o/6mnLmyu+8fSS1VumC65XaETt/dD7rn0=

As near as I can tell, it seems to be asserting whether a file exsists, I
have yet to determine which file that would be. Any pointers are welcome,
as well as any other ideas to get the osd back. For some reason there is
data on the osd that was not replicated to my other osd's, as such I can
not just re-init this osd as some of the posts I could find suggests

Kind regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd not starting after network related issues

2019-07-02 Thread Ian Coetzee
Hi All,

Some feedback on my end. I managed to recover the "lost data" from one of
the other OSDs. Seems like my initial summary was a bit off, in that the
PG's was replicated, CEPH just wanted to confirm that the objects were
still relevant.

For future reference, I basically marked the OSD as lost

> ceph osd lost 

Then the PGs went into an incomplete state

After that I temporarily set an option on the OSDs to ignore the history
(osd_find_best_info_ignore_history_les). Got the info from
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/017270.html

After that CEPH was happy and started to rebalance the cluster, pheew,
crisis averted.

This failure did however convince me to increase our cluster size from 2:1
to 3:2. Sacrificing usable space for reliability.

Now I need to give feedback on what happened, this is what I am still not
sure about as SMART does not show any sector errors. I might as well start
a badblocks and see if I detect anything in there.

As always, I am open to other suggestion as to where to look for other
clues on what went wrong.

Kind regards

On Mon, 1 Jul 2019 at 09:31, Ian Coetzee  wrote:

> Hi Guys,
>
> This is a cross-post from the proxmox ML.
>
> This morning I have a bit of a big boo-boo on our production system.
>
> After a very sudden network outage somewhere during the night, one of my
> ceph-osd's is no longer starting up.
>
> If I try and start it manually, I get a very spectacular failure, see link.
>
> https://www.jacklin.co.za/zerobin/?04e2dcd13ab8dfc8#zKCISUvAm4o/6mnLmyu+8fSS1VumC65XaETt/dD7rn0=
>
> As near as I can tell, it seems to be asserting whether a file exsists, I
> have yet to determine which file that would be. Any pointers are welcome,
> as well as any other ideas to get the osd back. For some reason there is
> data on the osd that was not replicated to my other osd's, as such I can
> not just re-init this osd as some of the posts I could find suggests
>
> Kind regards
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com