[ceph-users] Re: MDS crash when unlink file

Venky Shankar Tue, 15 Feb 2022 22:04:58 -0800

On Mon, Feb 14, 2022 at 5:33 PM Arnaud MARTEL
<arnaud.mar...@i2bc.paris-saclay.fr> wrote:
>
> Hi Venky,
>
> Thank's a lot for your answer. I needed to reduce the number of running MDS 
> before set debug_mds to 20 but, now, I was able to reproduce the crash and 
> generate the full logfile.
> You can download it with the following link: 
> https://mycore.core-cloud.net/index.php/s/hBQA5ym3Jkh1O7l
> The logfile is not too big, so I didn't remove any line and you will have all 
> the trace. The crash caused by the unlink is at line 39484...
>
> I don't known how many files are "invalids" (at least 10-20) nor if there is 
> a way to identify all of them.
>
> Do you need anything else??


Nope. Thanks for the logs.

>
> Kind regards
> Arnaud
>
> ----- Mail original -----
> De: "Venky Shankar" <vshan...@redhat.com>
> À: "arnaud martel" <arnaud.mar...@i2bc.paris-saclay.fr>
> Cc: "ceph-users" <ceph-users@ceph.io>
> Envoyé: Vendredi 11 Février 2022 15:03:04
> Objet: Re: [ceph-users] MDS crash when unlink file
>
> Hi Arnaud,
>
> On Fri, Feb 11, 2022 at 2:42 PM Arnaud MARTEL
> <arnaud.mar...@i2bc.paris-saclay.fr> wrote:
> >
> > Hi,
> >
> > MDSs are crashing on my production cluster when trying to unlink some files 
> > and I need help :-).
> > When looking into the log files, I have identified some associated files 
> > and I ran a scrub on the parent directory with force,repair,recursive 
> > options. No error were detected but the problem persists.
> > 'ceph -s" and "ceph health detail" display no error/warning and my main 
> > question is: what are my next steps?
> >
> >
> >
> > -3> 2022-02-11T08:36:20.647+0000 7fa372dba700 4 mds.0.server 
> > handle_client_request client_request(client.3422129:6687 unlink 
> > #0x10002191acc/gpt2_L-4_H-768_trained_pre-20_1_checkpoint_24_norm-2_norm-None_temporal-shifting-0_84_hidden-layer-0-1-2-3-4.o459077
> >  2022-02-11T08:36:20.647472+0000 caller_uid=0, 
> > caller_gid=0{0,1001,90590,90596,9060
> > 2,90610,90619,90620,90627,90636,}) v4
> > -2> 2022-02-11T08:36:20.647+0000 7fa36bdac700 5 mds.0.log _submit_thread 
> > 9994621415698~1111 : EOpen [metablob 0x10002191acc, 1 dirs], 1 open files
> > -1> 2022-02-11T08:36:20.654+0000 7fa372dba700 -1 
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Server.cc:
> >  In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, 
> > CDentry*)' thread 7fa372dba
> > 700 time 2022-02-11T08:36:20.649556+0000
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Server.cc:
> >  7503: FAILED ceph_assert(in->first <= straydn->first)
> >
> > ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific 
> > (stable)
> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x158) [0x7fa37b7decce]
> > 2: /usr/lib64/ceph/libceph-common.so.2(+0x276ee8) [0x7fa37b7deee8]
> > 3: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, 
> > CDentry*)+0x106a) [0x55e4bf43331a]
> > 4: 
> > (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x4d9) 
> > [0x55e4bf437fe9]
> > 5: 
> > (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xefb)
> >  [0x55e4bf44e82b]
> > 6: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x33) 
> > [0x55e4bf5044b3]
> > 7: (MDSContext::complete(int)+0x56) [0x55e4bf6c0906]
> > 8: (MDSCacheObject::finish_waiting(unsigned long, int)+0xce) 
> > [0x55e4bf6e26be]
> > 9: (Locker::eval_gather(SimpleLock*, bool, bool*, std::vector<MDSContext*, 
> > std::allocator<MDSContext*> >*)+0x13d6) [0x55e4bf594f66]
> > 10: (Locker::handle_file_lock(ScatterLock*, boost::intrusive_ptr<MLock 
> > const> const&)+0xed1) [0x55e4bf5a3241]
> > 11: (Locker::handle_lock(boost::intrusive_ptr<MLock const> const&)+0x1b3) 
> > [0x55e4bf5a3db3]
> > 12: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0xb4) 
> > [0x55e4bf5a7fe4]
> > 13: (MDSRank::handle_message(boost::intrusive_ptr<Message const> 
> > const&)+0xbcc) [0x55e4bf3bf38c]
> > 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, 
> > bool)+0x7bb) [0x55e4bf3c19eb]
> > 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> 
> > const&)+0x55) [0x55e4bf3c1fe5]
> > 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x128) 
> > [0x55e4bf3b1f28]
> > 17: (DispatchQueue::entry()+0x126a) [0x7fa37ba1c4da]
> > 18: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fa37bacce21]
> > 19: /lib64/libpthread.so.0(+0x814a) [0x7fa37a7c514a]
> > 20: clone()
>
> You are hitting this bug: https://tracker.ceph.com/issues/38452. It
> seems to happen when an inode field gets corrupted.
>
> Would it be possible to set "debug mds = 20", trigger the crash and
> share the log? The logs can be huge in size. You may want to share an
> uploaded link if you are fine with that.
>
> There is a workaround mentioned in the above tracker, however, that
> requires specific commands pertaining to certain metadata objects to
> be run.
>
> >
> >
> > Arnaud
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
> --
> Cheers,
> Venky
>


-- 
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS crash when unlink file

Reply via email to