Hi Goncarlo,

In the end we ascertained that the assert was coming from reading corrupt
data in the mds journal.  We have followed the sections at the following
link (http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/) in order
down to (and including) MDS Table wipes (only wiping the "session" table in
the final step).  This resolved the problem we had with our mds asserting.

We have also run a cephfs scrub to validate the data (ceph daemon mds.0
scrub_path / recursive repair), which has resulted in "metadata damage
detected" health warning.  This seems to perform a read of all objects
involved in cephfs rados pools (anecdotal: performance of the scan against
the data pool was much faster to process than the metadata pool itself).

We are now working with the output of "ceph tell mds.0 damage ls", and
looking at the following mailing list post as a starting point for
proceeding with that:
http://ceph-users.ceph.narkive.com/EfFTUPyP/how-to-fix-the-mds-damaged-issue

Chris

On Fri, 9 Dec 2016 at 19:26 Goncalo Borges <goncalo.bor...@sydney.edu.au>
wrote:

> Hi Sean, Rob.
>
> I saw on the tracker that you were able to resolve the mds assert by
> manually cleaning the corrupted metadata. Since I am also hitting that
> issue and I suspect that i will face an mds assert of the same type sooner
> or later, can you please explain a bit further what operations did you do
> to clean the problem?
> Cheers
> Goncalo
> ________________________________________
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rob
> Pickerill [r.picker...@gmail.com]
> Sent: 09 December 2016 07:13
> To: Sean Redmond; John Spray
> Cc: ceph-users
> Subject: Re: [ceph-users] CephFS FAILED
> assert(dn->get_linkage()->is_null())
>
> Hi John / All
>
> Thank you for the help so far.
>
> To add a further point to Sean's previous email, I see this log entry
> before the assertion failure:
>
>     -6> 2016-12-08 15:47:08.483700 7fb133dca700 12
> mds.0.cache.dir(1000a453344) remove_dentry [dentry
> #100/stray9/1000a453344/config [2,head] auth NULL (dver
> sion lock) v=540 inode=0 0x55e8664fede0]
>     -5> 2016-12-08 15:47:08.484882 7fb133dca700 -1 mds/CDir.cc: In
> function 'void CDir::try_remove_dentries_for_stray()' thread 7fb133dca700
> time 2016-12-08
> 15:47:08.483704
> mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())
>
> And I can reference this with:
>
> root@ceph-mon1:~/1000a453344# rados -p ven-ceph-metadata-1 listomapkeys
> 1000a453344.00000000
> 1470734502_head
> config_head
>
> Would we also need to clean up this object, if so is there a safe we can
> do this?
>
> Rob
>
> On Thu, 8 Dec 2016 at 19:58 Sean Redmond <sean.redmo...@gmail.com<mailto:
> sean.redmo...@gmail.com>> wrote:
> Hi John,
>
> Thanks for your pointers, I have extracted the onmap_keys and onmap_values
> for an object I found in the metadata pool called '600.00000000' and
> dropped them at the below location
>
> https://www.dropbox.com/sh/wg6irrjg7kie95p/AABk38IB4PXsn2yINpNa9Js5a?dl=0
>
> Could you explain how is it possible to identify stray directory fragments?
>
> Thanks
>
> On Thu, Dec 8, 2016 at 6:30 PM, John Spray <jsp...@redhat.com<mailto:
> jsp...@redhat.com>> wrote:
> On Thu, Dec 8, 2016 at 3:45 PM, Sean Redmond <sean.redmo...@gmail.com
> <mailto:sean.redmo...@gmail.com>> wrote:
> > Hi,
> >
> > We had no changes going on with the ceph pools or ceph servers at the
> time.
> >
> > We have however been hitting this in the last week and it maybe related:
> >
> > http://tracker.ceph.com/issues/17177
>
> Oh, okay -- so you've got corruption in your metadata pool as a result
> of hitting that issue, presumably.
>
> I think in the past people have managed to get past this by taking
> their MDSs offline and manually removing the omap entries in their
> stray directory fragments (i.e. using the `rados` cli on the objects
> starting "600.").
>
> John
>
>
>
> > Thanks
> >
> > On Thu, Dec 8, 2016 at 3:34 PM, John Spray <jsp...@redhat.com<mailto:
> jsp...@redhat.com>> wrote:
> >>
> >> On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond <sean.redmo...@gmail.com
> <mailto:sean.redmo...@gmail.com>>
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a CephFS cluster that is currently unable to start the mds
> server
> >> > as
> >> > it is hitting an assert, the extract from the mds log is below, any
> >> > pointers
> >> > are welcome:
> >> >
> >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >> >
> >> > 2016-12-08 14:50:18.577038 7f7d9faa3700  1 mds.0.47077 handle_mds_map
> >> > state
> >> > change up:rejoin --> up:active
> >> > 2016-12-08 14:50:18.577048 7f7d9faa3700  1 mds.0.47077 recovery_done
> --
> >> > successful recovery!
> >> > 2016-12-08 14:50:18.577166 7f7d9faa3700  1 mds.0.47077 active_start
> >> > 2016-12-08 14:50:19.460208 7f7d9faa3700  1 mds.0.47077 cluster
> >> > recovered.
> >> > 2016-12-08 14:50:19.495685 7f7d9abfc700 -1 mds/CDir.cc: In function
> >> > 'void
> >> > CDir::try_remove_dentries_for_stray()' thread 7f7d9abfc700 time
> >> > 2016-12-08
> >> > 14:50:19
> >> > .494508
> >> > mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())
> >> >
> >> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> > const*)+0x80) [0x55f0f789def0]
> >> >  2: (CDir::try_remove_dentries_for_stray()+0x1a0) [0x55f0f76666c0]
> >> >  3: (StrayManager::__eval_stray(CDentry*, bool)+0x8c9)
> [0x55f0f75e7799]
> >> >  4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f0f75e7cf2]
> >> >  5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f0f753b30d]
> >> >  6: (MDSInternalContextBase::complete(int)+0x18b) [0x55f0f76e93db]
> >> >  7: (MDSRank::_advance_queues()+0x6a7) [0x55f0f749bf27]
> >> >  8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f0f749c45a]
> >> >  9: (()+0x770a) [0x7f7da6bdc70a]
> >> >  10: (clone()+0x6d) [0x7f7da509d82d]
> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >> > needed to
> >> > interpret this.
> >>
> >> Last time someone had this issue they had tried to create a filesystem
> >> using pools that had another filesystem's old objects in:
> >> http://tracker.ceph.com/issues/16829
> >>
> >> What was going on on your system before you hit this?
> >>
> >> John
> >>
> >> > Thanks
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to