Hi Zheng,

Thanks, that let me think I forgot to remove some 'temporary-key' for the inconsistency issue I've got. Once those were removed,the mds started again.

Thanks again!

Kenneth

On 12/10/2019 04:26, Yan, Zheng wrote:


On Sat, Oct 12, 2019 at 1:10 AM Kenneth Waegeman <kenneth.waege...@ugent.be <mailto:kenneth.waege...@ugent.be>> wrote:

    Hi all,

    After solving some pg inconsistency problems, my fs is still in
    trouble.  my mds's are crashing with this error:


    >     -5> 2019-10-11 19:02:55.375 7f2d39f10700  1 mds.1.564276
    rejoin_start
    >     -4> 2019-10-11 19:02:55.385 7f2d3d717700  5 mds.beacon.mds01
    > received beacon reply up:rejoin seq 5 rtt 1.01
    >     -3> 2019-10-11 19:02:55.495 7f2d39f10700  1 mds.1.564276
    > rejoin_joint_start
    >     -2> 2019-10-11 19:02:55.505 7f2d39f10700  5 mds.mds01
    > handle_mds_map old map epoch 564279 <= 564279, discarding
    >     -1> 2019-10-11 19:02:55.695 7f2d33f04700 -1
    >
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstyp
    > es.h: In function 'static void
    > dentry_key_t::decode_helper(std::string_view, std::string&,
    > snapid_t&)' thread 7f2d33f04700 time 2019-10-11 19:02:55.703343
    >
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstypes.h:

    > 1229: FAILED ceph_assert(i != string::npos
    > )
    >
    >  ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
    > nautilus (stable)
    >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
    > const*)+0x14a) [0x7f2d43393046]
    >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
    > const*, char const*, ...)+0) [0x7f2d43393214]
    >  3: (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&,
    > std::map<std::string, ceph::buffer::v14_2_0::list,
    > std::less<std::string>, std::allocator<std::pair<std::string const,
    > ceph::buffer::v14_2_0::list> > >&, bool, int)+0xa68) [0x556a17ec
    > baa8]
    >  4: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54) [0x556a17ee0034]
    >  5: (MDSContext::complete(int)+0x70) [0x556a17f5e710]
    >  6: (MDSIOContextBase::complete(int)+0x16b) [0x556a17f5e9ab]
    >  7: (Finisher::finisher_thread_entry()+0x156) [0x7f2d433d8386]
    >  8: (()+0x7dd5) [0x7f2d41262dd5]
    >  9: (clone()+0x6d) [0x7f2d3ff1302d]
    >
    >      0> 2019-10-11 19:02:55.695 7f2d33f04700 -1 *** Caught signal
    > (Aborted) **
    >  in thread 7f2d33f04700 thread_name:fn_anonymous
    >
    >  ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
    > nautilus (stable)
    >  1: (()+0xf5d0) [0x7f2d4126a5d0]
    >  2: (gsignal()+0x37) [0x7f2d3fe4b2c7]
    >  3: (abort()+0x148) [0x7f2d3fe4c9b8]
    >  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
    > const*)+0x199) [0x7f2d43393095]
    >  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
    > const*, char const*, ...)+0) [0x7f2d43393214]
    >  6: (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&,
    > std::map<std::string, ceph::buffer::v14_2_0::list,
    > std::less<std::string>, std::allocator<std::pair<std::string const,
    > ceph::buffer::v14_2_0::list> > >&, bool, int)+0xa68) [0x556a17ec
    > baa8]
    >  7: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54) [0x556a17ee0034]
    >  8: (MDSContext::complete(int)+0x70) [0x556a17f5e710]
    >  9: (MDSIOContextBase::complete(int)+0x16b) [0x556a17f5e9ab]
    >  10: (Finisher::finisher_thread_entry()+0x156) [0x7f2d433d8386]
    >  11: (()+0x7dd5) [0x7f2d41262dd5]
    >  12: (clone()+0x6d) [0x7f2d3ff1302d]
    >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
    > needed to interpret this.
    >
    > [root@mds02 ~]# ceph -s
    >   cluster:
    >     id:     92bfcf0a-1d39-43b3-b60f-44f01b630e47
    >     health: HEALTH_WARN
    >             1 filesystem is degraded
    >             insufficient standby MDS daemons available
    >             1 MDSs behind on trimming
    >             1 large omap objects
    >
    >   services:
    >     mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d)
    >     mgr: mds02(active, since 3w), standbys: mds01, mds03
    >     mds: ceph_fs:2/2 {0=mds02=up:rejoin,1=mds01=up:rejoin(laggy or
    > crashed)}
    >     osd: 535 osds: 533 up, 529 in
    >
    >   data:
    >     pools:   3 pools, 3328 pgs
    >     objects: 376.32M objects, 673 TiB
    >     usage:   1.0 PiB used, 2.2 PiB / 3.2 PiB avail
    >     pgs:     3315 active+clean
    >              12   active+clean+scrubbing+deep
    >              1    active+clean+scrubbing
    >
    Someone an idea where to go from here ?☺


looks like omap for dirfrag is corrupted.  please check mds log (debug_mds = 10) to find which omap is corrupted. Basically all omap keys of dirfrag should be in format xxxx_xxxx.


    Thanks!

    K

    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to