Hello Justin,

Please do:
ceph config set mds debug_mds 20
ceph config set mds debug_ms 1

Then wait for a crash. Please upload the log.

To restore your file system:

ceph config set mds mds_abort_on_newly_corrupt_dentry false

Let the MDS purge the strays and then try:

ceph config set mds mds_abort_on_newly_corrupt_dentry true

On Tue, May 23, 2023 at 7:04 PM Justin Li <justin...@deakin.edu.au> wrote:
>
> Hi Patrick,
>
> Sorry for keeping bothering you but I found that MDS service kept crashing 
> even cluster shows MDS is up. I attached another log of MDS server - eowyn at 
> below. Look forward to hearing more insights. Thanks a lot.
>
> https://drive.google.com/file/d/1nD_Ks7fNGQp0GE5Q_x8M57HldYurPhuN/view?usp=sharing
>
> MDS crashed:
> root@eowyn:~# systemctl status  ceph-mds@eowyn
> ● ceph-mds@eowyn.service - Ceph metadata server daemon
>      Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor 
> preset: enabled)
>      Active: failed (Result: signal) since Wed 2023-05-24 08:55:12 AEST; 24s 
> ago
>     Process: 44349 ExecStart=/usr/bin/ceph-mds -f --cluster ${CLUSTER} --id 
> eowyn --setuser ceph --setgroup ceph (code=kill>
>    Main PID: 44349 (code=killed, signal=ABRT)
>
> May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Scheduled restart 
> job, restart counter is at 3.
> May 24 08:55:12 eowyn systemd[1]: Stopped Ceph metadata server daemon.
> May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Start request 
> repeated too quickly.
> May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Failed with result 
> 'signal'.
> May 24 08:55:12 eowyn systemd[1]: Failed to start Ceph metadata server daemon.
>
>
> Part of MDS log on eowyn (MDS server):
>    -2> 2023-05-24T08:55:11.854+1000 7f1f8ee93700 -1 log_channel(cluster) log 
> [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry 
> #0x100/stray0/1005480d3ac [19ce,head] auth (dversion lock) pv=2154265085 
> v=2154265074 ino=0x1005480d3ac state=1342177316 | purging=1 0x55b04517ca00]
>     -1> 2023-05-24T08:55:11.858+1000 7f1f8ee93700 -1 
> /build/ceph-16.2.13/src/mds/CDentry.cc: In function 'bool 
> CDentry::check_corruption(bool)' thread 7f1f8ee93700 time 
> 2023-05-24T08:55:11.858329+1000
> /build/ceph-16.2.13/src/mds/CDentry.cc: 697: ceph_abort_msg("abort() called")
>
>  ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific 
> (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*, 
> std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> 
> > const&)+0xe0) [0x7f1f99404495]
>  2: (CDentry::check_corruption(bool)+0x86b) [0x55b02652991b]
>  3: (StrayManager::_purge_stray_purged(CDentry*, bool)+0xc64) [0x55b026480ed4]
>  4: (MDSContext::complete(int)+0x61) [0x55b026601471]
>  5: (MDSIOContextBase::complete(int)+0x4fc) [0x55b026601b9c]
>  6: (Finisher::finisher_thread_entry()+0x19d) [0x7f1f994b8c6d]
>  7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1f99146609]
>  8: clone()
>
>
>
>
> Justin Li
> Senior Technical Officer
> School of Information Technology
> Faculty of Science, Engineering and Built Environment
> For ICT Support please see https://www.deakin.edu.au/sebeicthelp
>
>
> Deakin University
> Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
> +61 3 9246 8932
> justin...@deakin.edu.au
> http://www.deakin.edu.au/
> Deakin University CRICOS Provider Code 00113B
>
> Important Notice: The contents of this email are intended solely for the 
> named addressee and are confidential; any unauthorised use, reproduction or 
> storage of the contents is expressly prohibited. If you have received this 
> email in error, please delete it and any attachments immediately and advise 
> the sender by return email or telephone.
>
> Deakin University does not warrant that this email and any attachments are 
> error or virus free.
>
> -----Original Message-----
> From: Justin Li
> Sent: Wednesday, May 24, 2023 8:25 AM
> To: Patrick Donnelly <pdonn...@redhat.com>
> Cc: ceph-users@ceph.io
> Subject: RE: [ceph-users] [Help appreciated] ceph mds damaged
>
> Sorry Patrick, last email was restricted as attachment size. I attached a 
> link for you to download the log. Thanks.
> https://drive.google.com/drive/folders/1bV_X7vyma_-gTfLrPnEV27QzsdmgyK4g?usp=sharing
>
>
> Justin Li
> Senior Technical Officer
> School of Information Technology
> Faculty of Science, Engineering and Built Environment For ICT Support please 
> see https://www.deakin.edu.au/sebeicthelp
>
>
> Deakin University
> Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
> +61 3 9246 8932
> justin...@deakin.edu.au
> http://www.deakin.edu.au/
> Deakin University CRICOS Provider Code 00113B
>
> Important Notice: The contents of this email are intended solely for the 
> named addressee and are confidential; any unauthorised use, reproduction or 
> storage of the contents is expressly prohibited. If you have received this 
> email in error, please delete it and any attachments immediately and advise 
> the sender by return email or telephone.
>
> Deakin University does not warrant that this email and any attachments are 
> error or virus free.
>
> -----Original Message-----
> From: Justin Li
> Sent: Wednesday, May 24, 2023 8:21 AM
> To: Patrick Donnelly <pdonn...@redhat.com>
> Cc: ceph-users@ceph.io
> Subject: RE: [ceph-users] [Help appreciated] ceph mds damaged
>
> Hi Patrick,
>
> I attached two logs here. Those two servers are one of the monitors and MDSs. 
> Let me know if you need more logs. Thanks.
>
>
> Justin Li
> Senior Technical Officer
> School of Information Technology
> Faculty of Science, Engineering and Built Environment For ICT Support please 
> see https://www.deakin.edu.au/sebeicthelp
>
>
> Deakin University
> Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
> +61 3 9246 8932
> justin...@deakin.edu.au
> http://www.deakin.edu.au/
> Deakin University CRICOS Provider Code 00113B
>
> Important Notice: The contents of this email are intended solely for the 
> named addressee and are confidential; any unauthorised use, reproduction or 
> storage of the contents is expressly prohibited. If you have received this 
> email in error, please delete it and any attachments immediately and advise 
> the sender by return email or telephone.
>
> Deakin University does not warrant that this email and any attachments are 
> error or virus free.
>
> -----Original Message-----
> From: Patrick Donnelly <pdonn...@redhat.com>
> Sent: Wednesday, May 24, 2023 7:35 AM
> To: Justin Li <justin...@deakin.edu.au>
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] [Help appreciated] ceph mds damaged
>
> Hello Justin,
>
> On Tue, May 23, 2023 at 4:55 PM Justin Li <justin...@deakin.edu.au> wrote:
> >
> > Dear All,
> >
> > After a unsuccessful upgrade to pacific, MDS were offline and could not get 
> > back on. Checked the MDS log and found below. See cluster info from below 
> > as well. Appreciate it if anyone can point me to the right direction. 
> > Thanks.
> >
> >
> > MDS log:
> >
> > 2023-05-24T06:21:36.831+1000 7efe56e7d700  1 mds.0.cache.den(0x600
> > 1005480d3b2) loaded already corrupt dentry: [dentry
> > #0x100/stray0/1005480d3b2 [19ce,head] rep@0,-2.0<mailto:rep@0,-2.0>
> > NULL (dversion lock) pv=0 v=2154265030 ino=(nil) state=0
> > 0x556433addb80]
> >
> >     -5> 2023-05-24T06:21:36.831+1000 7efe56e7d700 -1 mds.0.damage
> > notify_dentry Damage to dentries in fragment * of ino 0x600is fatal
> > because it is a system directory for this rank
> >
> >     -4> 2023-05-24T06:21:36.831+1000 7efe56e7d700  5 mds.beacon.posco
> > set_want_state: up:active -> down:damaged
> >
> >     -3> 2023-05-24T06:21:36.831+1000 7efe56e7d700  5 mds.beacon.posco
> > Sending beacon down:damaged seq 5339
> >
> >     -2> 2023-05-24T06:21:36.831+1000 7efe56e7d700 10 monclient:
> > _send_mon_message to mon.ceph-3 at v2:10.120.0.146:3300/0
> >
> >     -1> 2023-05-24T06:21:37.659+1000 7efe60690700  5 mds.beacon.posco
> > received beacon reply down:damaged seq 5339 rtt 0.827966
> >
> >      0> 2023-05-24T06:21:37.659+1000 7efe56e7d700  1 mds.posco respawn!
> >
> >
> > Cluster info:
> > root@ceph-1:~# ceph -s
> >   cluster:
> >     id:     e2b93a76-2f97-4b34-8670-727d6ac72a64
> >     health: HEALTH_ERR
> >             1 filesystem is degraded
> >             1 filesystem is offline
> >             1 mds daemon damaged
> >
> >   services:
> >     mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 26h)
> >     mgr: ceph-3(active, since 15h), standbys: ceph-1, ceph-2
> >     mds: 0/1 daemons up, 3 standby
> >     osd: 135 osds: 133 up (since 10h), 133 in (since 2w)
> >
> >   data:
> >     volumes: 0/1 healthy, 1 recovering; 1 damaged
> >     pools:   4 pools, 4161 pgs
> >     objects: 230.30M objects, 276 TiB
> >     usage:   836 TiB used, 460 TiB / 1.3 PiB avail
> >     pgs:     4138 active+clean
> >              13   active+clean+scrubbing
> >              10   active+clean+scrubbing+deep
> >
> >
> >
> > root@ceph-1:~# ceph health detail
> > HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds
> > daemon damaged [WRN] FS_DEGRADED: 1 filesystem is degraded
> >     fs cephfs is degraded
> > [ERR] MDS_ALL_DOWN: 1 filesystem is offline
> >     fs cephfs is offline because no MDS is active for it.
> > [ERR] MDS_DAMAGE: 1 mds daemon damaged
> >     fs cephfs mds.0 is damaged
>
> Do you have a complete log you can share? Try:
>
> https://docs.ceph.com/en/quincy/man/8/ceph-post-file/
>
> To get your upgrade to complete, you may set:
>
> ceph config set mds mds_go_bad_corrupt_dentry false
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
> Important Notice: The contents of this email are intended solely for the 
> named addressee and are confidential; any unauthorised use, reproduction or 
> storage of the contents is expressly prohibited. If you have received this 
> email in error, please delete it and any attachments immediately and advise 
> the sender by return email or telephone.
>
> Deakin University does not warrant that this email and any attachments are 
> error or virus free.



-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to