[ceph-users] Re: mds dump inode crashes file system

Xiubo Li Tue, 16 May 2023 06:11:48 -0700


On 5/16/23 00:33, Frank Schilder wrote:

Dear Xiubo,


I uploaded the cache dump, the MDS log and the dmesg log containing the 
snaptrace dump to

ceph-post-file: 763955a3-7d37-408a-bbe4-a95dc687cd3f


Okay, thanks.

Sorry, I forgot to add user and description this time.

A question about trouble shooting. I'm pretty sure I know the path where the error is 
located. Would a "ceph tell mds.1 scrub start / recursive repair" be able to 
discover and fix broken snaptraces? If not I'm awaiting further instructions.


Not very sure.

Haven't check this in detail yet.

Thanks


Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Xiubo Li <xiu...@redhat.com>
Sent: Friday, May 12, 2023 3:44 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system


On 5/12/23 20:27, Frank Schilder wrote:

Dear Xiubo and others.

I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?
I'm in meetings pretty much all day and will try to send some more info later.

$ mount|grep ceph

I get

MON-IPs:SRC on DST type ceph 
(rw,relatime,name=con-fs2-rit-pfile,secret=<hidden>,noshare,acl,mds_namespace=con-fs2,_netdev)

so async dirop seems disabled.

Yeah.

Yeah, the kclient just received a corrupted snaptrace from MDS.
So the first thing is you need to fix the corrupted snaptrace issue in cephfs 
and then continue.

Ooookaaayyyy. I will take it as a compliment that you seem to assume I know how 
to do that. The documentation gives 0 hits. Could you please provide me with 
instructions of what to look for and/or what to do first?

There is no doc about this as I know.

If possible you can parse the above corrupted snap message to check what 
exactly corrupted.
I haven't get a chance to do that.

Again, how would I do that? Is there some documentation and what should I 
expect?

Currently there is no easy way to do this as I know, last time I have
parsed the corrupted binary data to the corresponding message manully.

And then we could know what exactly has happened for the snaptrace.

You seems didn't enable the 'osd blocklist' cephx auth cap for mon:

I can't find anything about an osd blocklist client auth cap in the documentation. Is 
this something that came after octopus? Our caps are as shown in the documentation for a 
ceph fs client (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
"allow r":

          caps mds = "allow rw path=/shares"
          caps mon = "allow r"
          caps osd = "allow rw tag cephfs data=con-fs2"

Yeah, it seems the 'osd blocklist' was disabled. As I remembered if
enabled it should be something likes:

caps mon = "allow r, allow command \"osd blocklist\""

I checked that but by reading the code I couldn't get what had cause the MDS 
crash.
There seems something wrong corrupt the metadata in cephfs.

He wrote something about an invalid xattrib (empty value). It would be really helpful to 
get a clue how to proceed. I managed to dump the MDS cache with the critical inode in 
cache. Would this help with debugging? I also managed to get debug logs with debug_mds=20 
during a crash caused by an "mds dump inode" command. Would this contain 
something interesting? I can also pull the rados objects out and can upload all of these 
files.

Yeah, possibly. Where is the logs ?

I managed to track the problem down to a specific folder with a few files (I'm not sure 
if this coincides with the snaptrace issue, we might have 2 issues here). I made a copy 
of the folder and checked that an "mds dump inode" for the copy does not crash 
the MDS. I then moved the folders for which this command causes a crash to a different 
location outside the mounts. Do you think this will help? I'm wondering if after taking 
our daily snapshot tomorrow we end up in the degraded situation again.

I really need instructions for how to check what is broken without an MDS crash 
and then how to fix it.

Firstly we need to know where the corrupted metadata is.

I think the mds debug logs and the above corrupted snaptrace could help.
Need to parse that corrupted binary data.

Thanks

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds dump inode crashes file system

Reply via email to