Re: [ceph-users] How can we repair OSD leveldb?

Sean Sullivan Thu, 18 Aug 2016 12:09:14 -0700

We have a hammer cluster that experienced a similar power failure and ended
up corrupting our monitors leveldb stores. I am still trying to repair ours
but I can give you a few tips that seem to help.

1.)  I would copy the database off to somewhere safe right away. Just
opening it seems to change it.

2.) check out ceph-test tools (ceph-objectstore-tool, ceph-kvstore-tool,
ceph-osdmap-tool, etc).  It lets you list the keys/data in your
osd leveldb, possibly export them and get some barings on what you need to
do to recover your map.

           3.) I am making a few assumptions here. a.) You are using
replication for your pools. b.) you are using either S3 or rbd, not cephFS.
>From here worse case chances are your data is recoverable sans the osd and
monitor leveldb store so long as the rest of the data is okay. (The actual
rados objects spread across each osd in '/var/lib/ceph/osd/ceph-*/
current/blah_head)

If you use RBD there is a tool out there that lets you recover your RBD
images:: https://github.com/ceph/ceph/tree/master/src/tools/rbd_recover_tool
We only use S3 but this seems to be doable as well:

As an example we have a 9MB file that was stored in ceph::
I ran a find across all of the osds in my cluster and compiled a list of
files::

find /var/lib/ceph/osd/ceph-*/current/ -type f -iname \*this_is_my_File\.
gzip\*

>From here I resulted in a list that looks like the following::

This is the head. It's usually the bucket.id\file__head__

default.20283.1\ud975ef9e-c7b1-42c5-938b-d746fc2c7996\
sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam__head_CA57D598__1
[__A________]\[_B___________________________________________
_____________________________].[__C______________]

default.20283.1\u\umultipart\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1__head_C338075C__1
[__A________]\[_D_______]\[__B_____________________________
_________________________________________].[__C_____________
_____________________________________]

And for each of those you'll have matching shadow files::
default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u1__head_02F05634__1
[__A________]\[_E______]\[__B_______________________________
________________________________________].[__C______________
______________________________________]

Here is another part of the multipart (this file only had 1 multipart and
we use multipart for all files larger than 5MB irrespective of size)::

default.20283.1\u\ushadow\ud975ef9e-c7b1-42c5-938b-
d746fc2c7996\sC1635.TCGA-DJ-A3UP-10A-01D-A22D-08.1.bam.2\
sYDDf8Qip4tn5YxQWfOmTt5fgm7o9Tw6.1\u2__head_1EA07BDF__1
[__A________]\[_E______]\[__B_______________________________
________________________________________].[__C______________
______________________________________]

                                       ^^ notice the different part number
here.

A is the bucket.id and is the same for every object in the same bucket.
Even if you don't know what the bucket id for your bucket is, you should be
able to assume with good certainty after you review your list which is which

B is our object name. We generate uuids for each object so I can not be
certain how much of this is ceph or us but the tail of your object name
should exist and be the same across all of your parts.

C.) Is their suffix for each object. From here you may have suffix' like
the above

D.) Is your upload chunks

E.) Is your shadow chunks for each part of the multipart (i think)

I'm sure it's much more complicated than that but that's what worked for
me.  From here I just scanned through all of my osds and slowly pulled all
of the individual parts via ssh and concatinated them all to their
respective files. So far the md5 sums match our md5 of the file prior to
uploading them to ceph in the first place.

We have a python tool to do this but it's kind of specific to us. I can ask
the author and see if I can post a gist of the code if that helps. Please
let me know.

I can't speak for CephFS unfortunately as we do not use it but I wouldn't
be surprised if it is similar. So if you set up ssh-keys across all of your
osd nodes you should be able to export all of the data to another
server/cluster/etc.

I am working on trying to rebuild leveldb for our monitors with the correct
keys/values but I have a feeling this is going to be a long way off. I
wouldn't be surprised if the leveldb structure for the mon databse is
similar to the osd omap database.

On Wed, Aug 17, 2016 at 4:54 PM, Dan Jakubiec <dan.jakub...@gmail.com>
wrote:

> Hi Wido,
>
> Thank you for the response:
>
> > On Aug 17, 2016, at 16:25, Wido den Hollander <w...@42on.com> wrote:
> >
> >
> >> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec <
> dan.jakub...@gmail.com>:
> >>
> >>
> >> Hello, we have a Ceph cluster with 8 OSD that recently lost power to
> all 8 machines.  We've managed to recover the XFS filesystems on 7 of the
> machines, but the OSD service is only starting on 1 of them.
> >>
> >> The other 5 machines all have complaints similar to the following:
> >>
> >>      2016-08-17 09:32:15.549588 7fa2f4666800 -1
> filestore(/var/lib/ceph/osd/ceph-1) Error initializing leveldb :
> Corruption: 6 missing files; e.g.: /var/lib/ceph/osd/ceph-1/
> current/omap/042421.ldb
> >>
> >> How can we repair the leveldb to allow the OSDs to startup?
> >>
> >
> > My first question would be: How did this happen?
> >
> > What hardware are you using underneath? Is there a RAID controller which
> is not flushing properly? Since this should not happen during a power
> failure.
> >
>
> Each OSD drive is connected to an onboard hardware RAID controller and
> configured in RAID 0 mode as individual virtual disks.  The RAID controller
> is an LSI 3108.
>
> I agree -- I am finding it bizarre that 7 of our 8 OSDs (one per machine)
> did not survive the power outage.
>
> We did have some problems with the stock Ubunut xfs_repair (3.1.9) seg
> faulting, which eventually we overcame by building a newer version of
> xfs_repair (4.7.0).  But it did finally repair clean.
>
> We actually have some different errors on other OSDs.  A few of them are
> failing with "Missing map in load_pgs" errors.  But generally speaking it
> appears to be missing files of various types causing different kinds of
> failures.
>
> I'm really nervous now about the OSD's inability to start with any
> inconsistencies and no repair utilities (that I can find).  Any advice on
> how to recover?
>
> > I don't know the answer to your question, but lost files are not good.
> >
> > You might find them in a lost+found directory if XFS repair worked?
> >
>
> Sadly this directory is empty.
>
> -- Dan
>
> > Wido
> >
> >> Thanks,
> >>
> >> -- Dan J_______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How can we repair OSD leveldb?

Reply via email to