Hi Pierre,

Unfortunately it looks like we had a bug in 0.82 that could lead to
journal corruption of the sort you're seeing here.  A new journal
format was added, and on the first start after an update the MDS would
re-write the journal to the new format.  This should only have been
happening on the single active MDS for a given rank, but it was
actually being done by standby-replay MDS daemons too.  As a result,
if there were standby-replay daemons configured, they could try to
rewrite the journal at the same time, resulting in a corrupt journal.

In your case, I think the probability of the condition occurring was
increased by the OSD issues you were having, because at some earlier
stage the rewrite process had been stopped partway through.  Without
standby MDSs this would be recovered from cleanly, but with the
standbys in play the danger of corruption is high while the journal is
in the partly-rewritten state.

The ticket is here: http://tracker.ceph.com/issues/8811
The candidate fix is here: https://github.com/ceph/ceph/pull/2115

If you have recent backups then I would suggest recreating the
filesystem and restoring from backups.  You can also try using the
"cephfs-journal-tool journal reset" command, which will wipe out the
journal entirely, losing the most recent writes to the filesystem and
potentially leaving some stray objects in the data pool.

Sorry that this has bitten you, even though 0.82 was not a named
release this was a pretty nasty bug to let out there, and I'm going to
improve our automated tests in this area.


On Wed, Jul 16, 2014 at 11:57 PM, Pierre BLONDEAU
<pierre.blond...@unicaen.fr> wrote:
> Le 16/07/2014 22:40, Gregory Farnum a écrit :
>> On Wed, Jul 16, 2014 at 6:21 AM, Pierre BLONDEAU
>> <pierre.blond...@unicaen.fr> wrote:
>>> Hi,
>>> After the repair process, i have :
>>> 1926 active+clean
>>>     2 active+clean+inconsistent
>>> This two PGs seem to be on the same osd ( #34 ):
>>> # ceph pg dump | grep inconsistent
>>> dumped all in format plain
>>> 0.2e    4       0       0       0       8388660 4       4
>>> active+clean+inconsistent       2014-07-16 11:39:43.819631      9463'4
>>> 438411:133968   [34,4]  34      [34,4]  34      9463'4  2014-07-16
>>> 04:52:54.417333      9463'4  2014-07-11 09:29:22.041717
>>> 0.1ed   5       0       0       0       8388623 10      10
>>> active+clean+inconsistent       2014-07-16 11:39:45.820142      9712'10
>>> 438411:144792   [34,2]  34      [34,2]  34      9712'10 2014-07-16
>>> 09:12:44.742488      9712'10 2014-07-10 21:57:11.345241
>>> It's can explain why my MDS won't to start ? If i remove ( or shutdown )
>>> this OSD, it's can solved my problem ?
>> You want to figure out why they're inconsistent (if they're still
>> going inconsistent, or maybe just need to be repaired), but this
>> shouldn't be causing your MDS troubles.
>> Can you dump the MDS journal and put it somewhere accessible? (You can
>> use ceph-post-file to upload it.) John has been trying to reproduce
>> this crash but hasn't succeeded yet.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
> Hi,
> I try to do :
> cephfs-journal-tool journal export ceph-journal.bin 2>
> cephfs-journal-tool.log
> But the program crash. I upload log file :
> e069c6ac-3cb4-4a52-8950-da7c600e2b01
> There is a mistake in
> http://ceph.com/docs/master/cephfs/cephfs-journal-tool/ in "Example: journal
> inspect". The good syntax seems to be :
> # cephfs-journal-tool  journal inspect
> 2014-07-17 00:54:14.155382 7ff89d239780 -1 Header is invalid (inconsistent
> offsets)
> Overall journal integrity: DAMAGED
> Header could not be decoded
> Regards
> --
> ----------------------------------------------
> Administrateur Systèmes & réseaux
> Université de Caen
> Laboratoire GREYC, Département d'informatique
> tel     : 02 31 56 75 42
> bureau  : Campus 2, Science 3, 406
> ----------------------------------------------
ceph-users mailing list

Reply via email to