Hello Justin, Please do:
ceph config set mds debug_mds 20 ceph config set mds debug_ms 1 Then wait for a crash. Please upload the log. To restore your file system: ceph config set mds mds_abort_on_newly_corrupt_dentry false Let the MDS purge the strays and then try: ceph config set mds mds_abort_on_newly_corrupt_dentry true On Tue, May 23, 2023 at 7:04 PM Justin Li <justin...@deakin.edu.au> wrote: > > Hi Patrick, > > Sorry for keeping bothering you but I found that MDS service kept crashing > even cluster shows MDS is up. I attached another log of MDS server - eowyn at > below. Look forward to hearing more insights. Thanks a lot. > > https://drive.google.com/file/d/1nD_Ks7fNGQp0GE5Q_x8M57HldYurPhuN/view?usp=sharing > > MDS crashed: > root@eowyn:~# systemctl status ceph-mds@eowyn > ● ceph-mds@eowyn.service - Ceph metadata server daemon > Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor > preset: enabled) > Active: failed (Result: signal) since Wed 2023-05-24 08:55:12 AEST; 24s > ago > Process: 44349 ExecStart=/usr/bin/ceph-mds -f --cluster ${CLUSTER} --id > eowyn --setuser ceph --setgroup ceph (code=kill> > Main PID: 44349 (code=killed, signal=ABRT) > > May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Scheduled restart > job, restart counter is at 3. > May 24 08:55:12 eowyn systemd[1]: Stopped Ceph metadata server daemon. > May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Start request > repeated too quickly. > May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Failed with result > 'signal'. > May 24 08:55:12 eowyn systemd[1]: Failed to start Ceph metadata server daemon. > > > Part of MDS log on eowyn (MDS server): > -2> 2023-05-24T08:55:11.854+1000 7f1f8ee93700 -1 log_channel(cluster) log > [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry > #0x100/stray0/1005480d3ac [19ce,head] auth (dversion lock) pv=2154265085 > v=2154265074 ino=0x1005480d3ac state=1342177316 | purging=1 0x55b04517ca00] > -1> 2023-05-24T08:55:11.858+1000 7f1f8ee93700 -1 > /build/ceph-16.2.13/src/mds/CDentry.cc: In function 'bool > CDentry::check_corruption(bool)' thread 7f1f8ee93700 time > 2023-05-24T08:55:11.858329+1000 > /build/ceph-16.2.13/src/mds/CDentry.cc: 697: ceph_abort_msg("abort() called") > > ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific > (stable) > 1: (ceph::__ceph_abort(char const*, int, char const*, > std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0xe0) [0x7f1f99404495] > 2: (CDentry::check_corruption(bool)+0x86b) [0x55b02652991b] > 3: (StrayManager::_purge_stray_purged(CDentry*, bool)+0xc64) [0x55b026480ed4] > 4: (MDSContext::complete(int)+0x61) [0x55b026601471] > 5: (MDSIOContextBase::complete(int)+0x4fc) [0x55b026601b9c] > 6: (Finisher::finisher_thread_entry()+0x19d) [0x7f1f994b8c6d] > 7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1f99146609] > 8: clone() > > > > > Justin Li > Senior Technical Officer > School of Information Technology > Faculty of Science, Engineering and Built Environment > For ICT Support please see https://www.deakin.edu.au/sebeicthelp > > > Deakin University > Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125 > +61 3 9246 8932 > justin...@deakin.edu.au > http://www.deakin.edu.au/ > Deakin University CRICOS Provider Code 00113B > > Important Notice: The contents of this email are intended solely for the > named addressee and are confidential; any unauthorised use, reproduction or > storage of the contents is expressly prohibited. If you have received this > email in error, please delete it and any attachments immediately and advise > the sender by return email or telephone. > > Deakin University does not warrant that this email and any attachments are > error or virus free. > > -----Original Message----- > From: Justin Li > Sent: Wednesday, May 24, 2023 8:25 AM > To: Patrick Donnelly <pdonn...@redhat.com> > Cc: ceph-users@ceph.io > Subject: RE: [ceph-users] [Help appreciated] ceph mds damaged > > Sorry Patrick, last email was restricted as attachment size. I attached a > link for you to download the log. Thanks. > https://drive.google.com/drive/folders/1bV_X7vyma_-gTfLrPnEV27QzsdmgyK4g?usp=sharing > > > Justin Li > Senior Technical Officer > School of Information Technology > Faculty of Science, Engineering and Built Environment For ICT Support please > see https://www.deakin.edu.au/sebeicthelp > > > Deakin University > Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125 > +61 3 9246 8932 > justin...@deakin.edu.au > http://www.deakin.edu.au/ > Deakin University CRICOS Provider Code 00113B > > Important Notice: The contents of this email are intended solely for the > named addressee and are confidential; any unauthorised use, reproduction or > storage of the contents is expressly prohibited. If you have received this > email in error, please delete it and any attachments immediately and advise > the sender by return email or telephone. > > Deakin University does not warrant that this email and any attachments are > error or virus free. > > -----Original Message----- > From: Justin Li > Sent: Wednesday, May 24, 2023 8:21 AM > To: Patrick Donnelly <pdonn...@redhat.com> > Cc: ceph-users@ceph.io > Subject: RE: [ceph-users] [Help appreciated] ceph mds damaged > > Hi Patrick, > > I attached two logs here. Those two servers are one of the monitors and MDSs. > Let me know if you need more logs. Thanks. > > > Justin Li > Senior Technical Officer > School of Information Technology > Faculty of Science, Engineering and Built Environment For ICT Support please > see https://www.deakin.edu.au/sebeicthelp > > > Deakin University > Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125 > +61 3 9246 8932 > justin...@deakin.edu.au > http://www.deakin.edu.au/ > Deakin University CRICOS Provider Code 00113B > > Important Notice: The contents of this email are intended solely for the > named addressee and are confidential; any unauthorised use, reproduction or > storage of the contents is expressly prohibited. If you have received this > email in error, please delete it and any attachments immediately and advise > the sender by return email or telephone. > > Deakin University does not warrant that this email and any attachments are > error or virus free. > > -----Original Message----- > From: Patrick Donnelly <pdonn...@redhat.com> > Sent: Wednesday, May 24, 2023 7:35 AM > To: Justin Li <justin...@deakin.edu.au> > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] [Help appreciated] ceph mds damaged > > Hello Justin, > > On Tue, May 23, 2023 at 4:55 PM Justin Li <justin...@deakin.edu.au> wrote: > > > > Dear All, > > > > After a unsuccessful upgrade to pacific, MDS were offline and could not get > > back on. Checked the MDS log and found below. See cluster info from below > > as well. Appreciate it if anyone can point me to the right direction. > > Thanks. > > > > > > MDS log: > > > > 2023-05-24T06:21:36.831+1000 7efe56e7d700 1 mds.0.cache.den(0x600 > > 1005480d3b2) loaded already corrupt dentry: [dentry > > #0x100/stray0/1005480d3b2 [19ce,head] rep@0,-2.0<mailto:rep@0,-2.0> > > NULL (dversion lock) pv=0 v=2154265030 ino=(nil) state=0 > > 0x556433addb80] > > > > -5> 2023-05-24T06:21:36.831+1000 7efe56e7d700 -1 mds.0.damage > > notify_dentry Damage to dentries in fragment * of ino 0x600is fatal > > because it is a system directory for this rank > > > > -4> 2023-05-24T06:21:36.831+1000 7efe56e7d700 5 mds.beacon.posco > > set_want_state: up:active -> down:damaged > > > > -3> 2023-05-24T06:21:36.831+1000 7efe56e7d700 5 mds.beacon.posco > > Sending beacon down:damaged seq 5339 > > > > -2> 2023-05-24T06:21:36.831+1000 7efe56e7d700 10 monclient: > > _send_mon_message to mon.ceph-3 at v2:10.120.0.146:3300/0 > > > > -1> 2023-05-24T06:21:37.659+1000 7efe60690700 5 mds.beacon.posco > > received beacon reply down:damaged seq 5339 rtt 0.827966 > > > > 0> 2023-05-24T06:21:37.659+1000 7efe56e7d700 1 mds.posco respawn! > > > > > > Cluster info: > > root@ceph-1:~# ceph -s > > cluster: > > id: e2b93a76-2f97-4b34-8670-727d6ac72a64 > > health: HEALTH_ERR > > 1 filesystem is degraded > > 1 filesystem is offline > > 1 mds daemon damaged > > > > services: > > mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 26h) > > mgr: ceph-3(active, since 15h), standbys: ceph-1, ceph-2 > > mds: 0/1 daemons up, 3 standby > > osd: 135 osds: 133 up (since 10h), 133 in (since 2w) > > > > data: > > volumes: 0/1 healthy, 1 recovering; 1 damaged > > pools: 4 pools, 4161 pgs > > objects: 230.30M objects, 276 TiB > > usage: 836 TiB used, 460 TiB / 1.3 PiB avail > > pgs: 4138 active+clean > > 13 active+clean+scrubbing > > 10 active+clean+scrubbing+deep > > > > > > > > root@ceph-1:~# ceph health detail > > HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds > > daemon damaged [WRN] FS_DEGRADED: 1 filesystem is degraded > > fs cephfs is degraded > > [ERR] MDS_ALL_DOWN: 1 filesystem is offline > > fs cephfs is offline because no MDS is active for it. > > [ERR] MDS_DAMAGE: 1 mds daemon damaged > > fs cephfs mds.0 is damaged > > Do you have a complete log you can share? Try: > > https://docs.ceph.com/en/quincy/man/8/ceph-post-file/ > > To get your upgrade to complete, you may set: > > ceph config set mds mds_go_bad_corrupt_dentry false > > -- > Patrick Donnelly, Ph.D. > He / Him / His > Red Hat Partner Engineer > IBM, Inc. > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > > > Important Notice: The contents of this email are intended solely for the > named addressee and are confidential; any unauthorised use, reproduction or > storage of the contents is expressly prohibited. If you have received this > email in error, please delete it and any attachments immediately and advise > the sender by return email or telephone. > > Deakin University does not warrant that this email and any attachments are > error or virus free. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io