[ceph-users] Re: Octopus on Ubuntu 20.04.6 LTS with kernel 5

2023-05-11 Thread Gerdriaan Mulder
As a data point: we've been running Octopus (solely for CephFS) on 
Ubuntu 20.04 with 5.4.0(-122) for some time now, with packages from 
download.ceph.com.


On 11/05/2023 07.12, Szabo, Istvan (Agoda) wrote:

I can answer my question, even in the official ubuntu repo they are using by 
default the octopus version so for sure it works with kernel 5.

https://packages.ubuntu.com/focal/allpackages


-Original Message-
From: Szabo, Istvan (Agoda) 
Sent: Thursday, May 11, 2023 11:20 AM
To: Ceph Users 
Subject: [ceph-users] Octopus on Ubuntu 20.04.6 LTS with kernel 5

Hi,

In octopus documentation we can see kernel 4 as recommended, however we've 
changed our test cluster yesterday from centos 7 / 8 to Ubuntu 20.04.6 LTS with 
kernel 5.4.0-148 and seems working, I just want to make sure before I move to 
prod there isn't any caveats.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Octopus on Ubuntu 20.04.6 LTS with kernel 5

2023-05-11 Thread Ilya Dryomov
On Thu, May 11, 2023 at 7:13 AM Szabo, Istvan (Agoda)
 wrote:
>
> I can answer my question, even in the official ubuntu repo they are using by 
> default the octopus version so for sure it works with kernel 5.
>
> https://packages.ubuntu.com/focal/allpackages
>
>
> -Original Message-
> From: Szabo, Istvan (Agoda) 
> Sent: Thursday, May 11, 2023 11:20 AM
> To: Ceph Users 
> Subject: [ceph-users] Octopus on Ubuntu 20.04.6 LTS with kernel 5
>
> Hi,
>
> In octopus documentation we can see kernel 4 as recommended, however we've 
> changed our test cluster yesterday from centos 7 / 8 to Ubuntu 20.04.6 LTS 
> with kernel 5.4.0-148 and seems working, I just want to make sure before I 
> move to prod there isn't any caveats.

Hi Istvan,

Note that on https://docs.ceph.com/en/octopus/start/os-recommendations/
it starts with:

> If you are using the kernel client to map RBD block devices or mount CephFS,
> the general advice is to use a “stable” or “longterm maintenance” kernel
> series provided by either http://kernel.org or your Linux distribution on any
> client hosts.

The recommendation for 4.x kernels follows that just as a precaution
against folks opting to stick to something older.  If your distribution
provides 5.x or 6.x stable kernels, by all means use them!

A word of caution though: Octopus was EOLed last year.  Please consider
upgrading your cluster to a supported release -- preferably Quincy since
Pacific is scheduled to go EOL sometime this year too.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds dump inode crashes file system

2023-05-11 Thread Frank Schilder
Dear Xiubo,

thanks for your reply.

> BTW, did you enabled the async dirop ? Currently this is disabled by
> default in 4.18.0-486.el8.x86_64.

I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?

I'm in meetings pretty much all day and will try to send some more info later.

> Could you reproduce this by enabling the mds debug logs ?

Not right now. Our users are annoyed enough already. I first need to figure out 
how to move the troublesome inode somewhere else where I might be able to do 
something. The boot message shows up on this one file server every time. Is 
there any information about what dir/inode might be causing the issue? How 
could I reproduce this without affecting the users, say, by re-creating the 
same condition somewhere else? Any hints are appreciated.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Thursday, May 11, 2023 3:45 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system

Hey Frank,

On 5/10/23 21:44, Frank Schilder wrote:
> The kernel message that shows up on boot on the file server in text format:
>
> May 10 13:56:59 rit-pfile01 kernel: WARNING: CPU: 3 PID: 34 at 
> fs/ceph/caps.c:689 ceph_add_cap+0x53e/0x550 [ceph]
> May 10 13:56:59 rit-pfile01 kernel: Modules linked in: ceph libceph 
> dns_resolver nls_utf8 isofs cirrus drm_shmem_helper intel_rapl_msr iTCO_wdt 
> intel_rapl_common iTCO_vendor_support drm_kms_helper syscopyarea sysfillrect 
> sysimgblt fb_sys_fops pcspkr joydev virtio_net drm i2c_i801 net_failover 
> virtio_balloon failover lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc 
> sr_mod cdrom sg xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci 
> libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console 
> virtio_scsi dm_mirror dm_region_hash dm_log dm_mod fuse
> May 10 13:56:59 rit-pfile01 kernel: CPU: 3 PID: 34 Comm: kworker/3:0 Not 
> tainted 4.18.0-486.el8.x86_64 #1
> May 10 13:56:59 rit-pfile01 kernel: Hardware name: Red Hat KVM/RHEL-AV, BIOS 
> 1.16.0-3.module_el8.7.0+3346+68867adb 04/01/2014
> May 10 13:56:59 rit-pfile01 kernel: Workqueue: ceph-msgr ceph_con_workfn 
> [libceph]
> May 10 13:56:59 rit-pfile01 kernel: RIP: 0010:ceph_add_cap+0x53e/0x550 [ceph]
> May 10 13:56:59 rit-pfile01 kernel: Code: c0 48 c7 c7 c0 69 7f c0 e8 6c 4c 72 
> c3 0f 0b 44 89 7c 24 04 e9 7e fc ff ff 44 8b 7c 24 04 e9 68 fe ff ff 0f 0b e9 
> c9 fc ff ff <0f> 0b e9 0a fe ff ff 0f 0b e9 12 fe ff ff 0f 0b 66 90 0f 1f 44 
> 00
> May 10 13:56:59 rit-pfile01 kernel: RSP: 0018:a4d000d87b48 EFLAGS: 
> 00010217
> May 10 13:56:59 rit-pfile01 kernel: RAX:  RBX: 
> 0005 RCX: dead0200
> May 10 13:56:59 rit-pfile01 kernel: RDX: 92d7d7f6e7d0 RSI: 
> 92d7d7f6e7d0 RDI: 92d7d7f6e7c8
> May 10 13:56:59 rit-pfile01 kernel: RBP: 92d7c5588970 R08: 
> 92d7d7f6e7d0 R09: 0001
> May 10 13:56:59 rit-pfile01 kernel: R10: 92d80078cbb8 R11: 
> 92c0 R12: 0155
> May 10 13:56:59 rit-pfile01 kernel: R13: 92d80078cbb8 R14: 
> 92d80078cbc0 R15: 0001
> May 10 13:56:59 rit-pfile01 kernel: FS:  () 
> GS:92d937d8() knlGS:
> May 10 13:56:59 rit-pfile01 kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> May 10 13:56:59 rit-pfile01 kernel: CR2: 7f74435b9008 CR3: 
> 0001099fa000 CR4: 003506e0
> May 10 13:56:59 rit-pfile01 kernel: Call Trace:
> May 10 13:56:59 rit-pfile01 kernel: ceph_handle_caps+0xdf2/0x1780 [ceph]
> May 10 13:56:59 rit-pfile01 kernel: mds_dispatch+0x13a/0x670 [ceph]
> May 10 13:56:59 rit-pfile01 kernel: ceph_con_process_message+0x79/0x140 
> [libceph]
> May 10 13:56:59 rit-pfile01 kernel: ? calc_signature+0xdf/0x110 [libceph]
> May 10 13:56:59 rit-pfile01 kernel: ceph_con_v1_try_read+0x5d7/0xf30 [libceph]
> May 10 13:56:59 rit-pfile01 kernel: ceph_con_workfn+0x329/0x680 [libceph]
> May 10 13:56:59 rit-pfile01 kernel: process_one_work+0x1a7/0x360
> May 10 13:56:59 rit-pfile01 kernel: worker_thread+0x30/0x390
> May 10 13:56:59 rit-pfile01 kernel: ? create_worker+0x1a0/0x1a0
> May 10 13:56:59 rit-pfile01 kernel: kthread+0x134/0x150
> May 10 13:56:59 rit-pfile01 kernel: ? set_kthread_struct+0x50/0x50
> May 10 13:56:59 rit-pfile01 kernel: ret_from_fork+0x35/0x40
> May 10 13:56:59 rit-pfile01 kernel: ---[ end trace 84e4b3694bbe9fde ]---

BTW, did you enabled the async dirop ? Currently this is disabled by
default in 4.18.0-486.el8.x86_64.

The async dirop is buggy and we have hit a very similar bug as above,
please see https://tracker.ceph.com/issues/55857. This is a racy between
the client requests and dir migrating in MDS and this has been fixed
long time ago.

If you didn't enable the async dirop then it should be a different issue
without the async dirop. But I guess this should a

[ceph-users] Re: mds dump inode crashes file system

2023-05-11 Thread Frank Schilder
Dear Gregory,

> I would start by looking at what xattrs exist and if there's an obvious bad
> one, deleting it.

I can't see any obvious bad ones and I also can't just delete them, they are 
required for ACLs. I'm not convinced that one of the xattrs that can be dumped 
with 'getfattr -d -m ".*"' are the culprit, they all look fine:

# getfattr -d -m ".*" 
/mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal\ 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
security.NTACL=encoded-data-removed
security.selinux="system_u:object_r:cephfs_t:s0"
system.posix_acl_access=encoded-data-removed
user.DOSATTRIB=encoded-data-removed
user.SAMBA_PAI=encoded-data-removed

How can I inspect the file object including all hidden xattrs, for example, all 
the ceph.-xattrs? There ought to be some rados+decode way of doing that. Would 
the attrib name be in the OPS list dumped on MDS crash?

I would be grateful for any pointer you can provide.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Wednesday, May 10, 2023 6:18 PM
To: Gregory Farnum
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system

Hi Gregory,

using the more complicated rados way, I found the path. I assume you are 
referring to attribs I can read with getfattr. The output of a dump is:

# getfattr -d 
/mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal\ 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
user.DOSATTRIB=0sAAAFAAURIIfMCneZfdkB
user.SAMBA_PAI=0sAgSEDwABgYYejLcxAAAC/wABgYYejLcxABAAlFExABABlFExABAALPgoABABLPgoABAADEUvABABDEUvABAAllExABABllExABAAE9AqABABE9AqAA==

#
An empty line is part of the output. These look all right to me. Can you tell 
me what I should look at? I will probably reply tomorrow, my time for today is 
almost up.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Gregory Farnum 
Sent: Wednesday, May 10, 2023 4:37 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system

On Wed, May 10, 2023 at 7:33 AM Frank Schilder  wrote:
>
> Hi Gregory,
>
> thanks for your reply. Yes, I forgot, I can also inspect the rados head 
> object. My bad.
>
> The empty xattr might come from a crash of the SAMBA daemon. We export to 
> windows and this uses xattrs extensively to map to windows ACLs. It might be 
> possible that a crash at an inconvenient moment left an object in this state. 
> Do you think this is possible? Would it be possible to repair that?

I'm still a little puzzled that it's possible for the system to get
into this state, so we probably will need to generate some bugfixes.
And it might just be the dump function is being naughty. But I would
start by looking at what xattrs exist and if there's an obvious bad
one, deleting it.
-Greg

>
> I will report back what I find with the low-level access. Need to head home 
> now ...
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Gregory Farnum 
> Sent: Wednesday, May 10, 2023 4:26 PM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: mds dump inode crashes file system
>
> This is a very strange assert to be hitting. From a code skim my best
> guess is the inode somehow has an xattr with no value, but that's just
> a guess and I've no idea how it would happen.
> Somebody recently pointed you at the (more complicated) way of
> identifying an inode path by looking at its RADOS object and grabbing
> the backtrace, which ought to let you look at the file in-situ.
> -Greg
>
>
> On Wed, May 10, 2023 at 6:37 AM Frank Schilder  wrote:
> >
> > For the "mds dump inode" command I could find the crash in the log; see 
> > below. Most of the log contents is the past OPS dump from the 3 MDS 
> > restarts that happened. It contains the 1 last OPS before the crash and 
> > I can upload the log if someone can use it. The crash stack trace somewhat 
> > truncated for readability:
> >
> > 2023-05-10T12:54:53.142+0200 7fe971ca6700  1 mds.ceph-23 Updating MDS map 
> > to version 892464 from mon.4
> > 2023-05-10T13:39:50.962+0200 7fe96fca2700  0 log_channel(cluster) log [WRN] 
> > : client.205899841 isn't responding to mclientcaps(revoke), ino 
> > 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 61.705410 
> > seconds ago
> 

[ceph-users] Re: mds dump inode crashes file system

2023-05-11 Thread Frank Schilder
Dear Gregory,

sorry, forgot one question: I would like to create an intact copy of the file 
and move the "broken" one to a different location, hopefully being able to 
reproduce and debug the issue with the moved one. I need to preserve all info 
that samba attached to the file. If I do a "cp -p" to include the xattrs in the 
copy command, will this create a clean copy or likely reproduce the issue?

I compared the visible xattrs with those of another file in the same folder and 
they have the same values. Nothing obvious at this point.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Thursday, May 11, 2023 12:34 PM
To: Gregory Farnum
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system

Dear Gregory,

> I would start by looking at what xattrs exist and if there's an obvious bad
> one, deleting it.

I can't see any obvious bad ones and I also can't just delete them, they are 
required for ACLs. I'm not convinced that one of the xattrs that can be dumped 
with 'getfattr -d -m ".*"' are the culprit, they all look fine:

# getfattr -d -m ".*" 
/mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal\ 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
security.NTACL=encoded-data-removed
security.selinux="system_u:object_r:cephfs_t:s0"
system.posix_acl_access=encoded-data-removed
user.DOSATTRIB=encoded-data-removed
user.SAMBA_PAI=encoded-data-removed

How can I inspect the file object including all hidden xattrs, for example, all 
the ceph.-xattrs? There ought to be some rados+decode way of doing that. Would 
the attrib name be in the OPS list dumped on MDS crash?

I would be grateful for any pointer you can provide.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Wednesday, May 10, 2023 6:18 PM
To: Gregory Farnum
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system

Hi Gregory,

using the more complicated rados way, I found the path. I assume you are 
referring to attribs I can read with getfattr. The output of a dump is:

# getfattr -d 
/mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal\ 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
getfattr: Removing leading '/' from absolute path names
# file: mnt/cephfs/shares/rit-oil/Projects/CSP/Chalk/CSP1.A.03/99_Personal 
folders/Eugenio/Tests/Eclipse/19_imbLab/19_IMBLAB.EGRID
user.DOSATTRIB=0sAAAFAAURIIfMCneZfdkB
user.SAMBA_PAI=0sAgSEDwABgYYejLcxAAAC/wABgYYejLcxABAAlFExABABlFExABAALPgoABABLPgoABAADEUvABABDEUvABAAllExABABllExABAAE9AqABABE9AqAA==

#
An empty line is part of the output. These look all right to me. Can you tell 
me what I should look at? I will probably reply tomorrow, my time for today is 
almost up.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Gregory Farnum 
Sent: Wednesday, May 10, 2023 4:37 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system

On Wed, May 10, 2023 at 7:33 AM Frank Schilder  wrote:
>
> Hi Gregory,
>
> thanks for your reply. Yes, I forgot, I can also inspect the rados head 
> object. My bad.
>
> The empty xattr might come from a crash of the SAMBA daemon. We export to 
> windows and this uses xattrs extensively to map to windows ACLs. It might be 
> possible that a crash at an inconvenient moment left an object in this state. 
> Do you think this is possible? Would it be possible to repair that?

I'm still a little puzzled that it's possible for the system to get
into this state, so we probably will need to generate some bugfixes.
And it might just be the dump function is being naughty. But I would
start by looking at what xattrs exist and if there's an obvious bad
one, deleting it.
-Greg

>
> I will report back what I find with the low-level access. Need to head home 
> now ...
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Gregory Farnum 
> Sent: Wednesday, May 10, 2023 4:26 PM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: mds dump inode crashes file system
>
> This is a very strange assert to be hitting. From a code skim my best
> guess is the inode somehow has an xattr with no value, but that's just
> a guess and I've no idea how it would happen.
> Somebody recently pointed you at the (more complicated) way of
> identifying an inode path by looking at its RADOS object and grabbing
> t

[ceph-users] Re: Octopus on Ubuntu 20.04.6 LTS with kernel 5

2023-05-11 Thread Anthony D'Atri
As a KRDB client, I believe that 5.4 also introduces better support for RBD 
features including fast-diff

> On May 11, 2023, at 3:59 AM, Gerdriaan Mulder  wrote:
> 
> As a data point: we've been running Octopus (solely for CephFS) on Ubuntu 
> 20.04 with 5.4.0(-122) for some time now, with packages from 
> download.ceph.com.
> 
>> On 11/05/2023 07.12, Szabo, Istvan (Agoda) wrote:
>> I can answer my question, even in the official ubuntu repo they are using by 
>> default the octopus version so for sure it works with kernel 5.
>> https://packages.ubuntu.com/focal/allpackages
>> -Original Message-
>> From: Szabo, Istvan (Agoda) 
>> Sent: Thursday, May 11, 2023 11:20 AM
>> To: Ceph Users 
>> Subject: [ceph-users] Octopus on Ubuntu 20.04.6 LTS with kernel 5
>> Hi,
>> In octopus documentation we can see kernel 4 as recommended, however we've 
>> changed our test cluster yesterday from centos 7 / 8 to Ubuntu 20.04.6 LTS 
>> with kernel 5.4.0-148 and seems working, I just want to make sure before I 
>> move to prod there isn't any caveats.
>> Thank you
>> 
>> This message is confidential and is for the sole use of the intended 
>> recipient(s). It may also be privileged or otherwise protected by copyright 
>> or other legal rules. If you have received it by mistake please let us know 
>> by reply email and delete it from your system. It is prohibited to copy this 
>> message or disclose its content to anyone. Any confidentiality or privilege 
>> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
>> the message. All messages sent to and from Agoda may be monitored to ensure 
>> compliance with company policies, to protect the company's interests and to 
>> remove potential malware. Electronic messages may be intercepted, amended, 
>> lost or deleted, or contain viruses.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email 
>> to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds dump inode crashes file system

2023-05-11 Thread Frank Schilder
Dear Xiubo,

please see also my previous e-mail about the async dirop config.

I have a bit more log output from dmesg on the file server here: 
https://pastebin.com/9Y0EPgDD . This covers a reboot after the one in my 
previous e-mail as well as another fail at the end. When I checked around 16:30 
the mount point was inaccessible again with "stale file handle". Please note 
the "wrong peer at address" messages in the log, it seems that a number of 
issues come together here. These threads are actually all related to this file 
server and the observations we make now:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/MSB5TIG42XAFNG2CKILY5DZWIMX6C5CO/
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XPR6X6TD7372I2YEPJO2L6F/

You mentioned directory migration in the MDS, I guess you mean migrating a 
directory fragment between MDSes? This should not happen, all these directories 
are statically pinned to a rank. An MDS may split/merge directory fragments, 
but they stay at the same MDS all the time. This is confirmed by running a 
"dump inode" on directories under a pin. Only one MDS reports back that it has 
the dir inode in its cache, so I think the static pinning works as expected.

It would be great if you could also look at Greg's reply, maybe you have 
something I could look at to find the cause of the crash during the mds dump 
inode command.

Thanks a lot and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Thursday, May 11, 2023 12:26 PM
To: Xiubo Li; ceph-users@ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system

Dear Xiubo,

thanks for your reply.

> BTW, did you enabled the async dirop ? Currently this is disabled by
> default in 4.18.0-486.el8.x86_64.

I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?

I'm in meetings pretty much all day and will try to send some more info later.

> Could you reproduce this by enabling the mds debug logs ?

Not right now. Our users are annoyed enough already. I first need to figure out 
how to move the troublesome inode somewhere else where I might be able to do 
something. The boot message shows up on this one file server every time. Is 
there any information about what dir/inode might be causing the issue? How 
could I reproduce this without affecting the users, say, by re-creating the 
same condition somewhere else? Any hints are appreciated.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Thursday, May 11, 2023 3:45 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system

Hey Frank,

On 5/10/23 21:44, Frank Schilder wrote:
> The kernel message that shows up on boot on the file server in text format:
>
> May 10 13:56:59 rit-pfile01 kernel: WARNING: CPU: 3 PID: 34 at 
> fs/ceph/caps.c:689 ceph_add_cap+0x53e/0x550 [ceph]
> May 10 13:56:59 rit-pfile01 kernel: Modules linked in: ceph libceph 
> dns_resolver nls_utf8 isofs cirrus drm_shmem_helper intel_rapl_msr iTCO_wdt 
> intel_rapl_common iTCO_vendor_support drm_kms_helper syscopyarea sysfillrect 
> sysimgblt fb_sys_fops pcspkr joydev virtio_net drm i2c_i801 net_failover 
> virtio_balloon failover lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc 
> sr_mod cdrom sg xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci 
> libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console 
> virtio_scsi dm_mirror dm_region_hash dm_log dm_mod fuse
> May 10 13:56:59 rit-pfile01 kernel: CPU: 3 PID: 34 Comm: kworker/3:0 Not 
> tainted 4.18.0-486.el8.x86_64 #1
> May 10 13:56:59 rit-pfile01 kernel: Hardware name: Red Hat KVM/RHEL-AV, BIOS 
> 1.16.0-3.module_el8.7.0+3346+68867adb 04/01/2014
> May 10 13:56:59 rit-pfile01 kernel: Workqueue: ceph-msgr ceph_con_workfn 
> [libceph]
> May 10 13:56:59 rit-pfile01 kernel: RIP: 0010:ceph_add_cap+0x53e/0x550 [ceph]
> May 10 13:56:59 rit-pfile01 kernel: Code: c0 48 c7 c7 c0 69 7f c0 e8 6c 4c 72 
> c3 0f 0b 44 89 7c 24 04 e9 7e fc ff ff 44 8b 7c 24 04 e9 68 fe ff ff 0f 0b e9 
> c9 fc ff ff <0f> 0b e9 0a fe ff ff 0f 0b e9 12 fe ff ff 0f 0b 66 90 0f 1f 44 
> 00
> May 10 13:56:59 rit-pfile01 kernel: RSP: 0018:a4d000d87b48 EFLAGS: 
> 00010217
> May 10 13:56:59 rit-pfile01 kernel: RAX:  RBX: 
> 0005 RCX: dead0200
> May 10 13:56:59 rit-pfile01 kernel: RDX: 92d7d7f6e7d0 RSI: 
> 92d7d7f6e7d0 RDI: 92d7d7f6e7c8
> May 10 13:56:59 rit-pfile01 kernel: RBP: 92d7c5588970 R08: 
> 92d7d7f6e7d0 R09: 0001
> May 10 13:56:59 rit-pfile01 kernel: R10: 92d80078cbb8 R11: 
> 92c0 R12: 0155
> May 10 13:56:59 rit-pfile01 kernel: R13: 92d80078cbb8 R14: 
> 92d80078cbc0 R15: 0001
> May 10 13:56:59 rit-pfile01 kernel: F

[ceph-users] Discussion thread for Known Pacific Performance Regressions

2023-05-11 Thread Mark Nelson

Hi Everyone,

This email was originally posted to d...@ceph.io, but Marc mentioned that 
he thought this would be useful to post on the user list so I'm 
re-posting here as well.


David Orman mentioned in the CLT meeting this morning that there are a 
number of people on the mailing list asking about performance 
regressions in Pacific+ vs older releases.  I want to document a couple 
of the bigger ones that we know about for the community's benefit.  I 
want to be clear that Pacific does have a number of performance 
improvements over previous releases, and we do have tests showing 
improvement relative to nautilus (especially RBD on NVMe drives).  Some 
of these regressions are going to have a bigger effect for some users 
than others.  Having said that, let's get into them.



** Regression #1: RocksDB Log File Recycling **

Effects: More metadata updates to the underlying FS, higher 
write-amplification (Observed by Digital Ocean), Slower performance 
especially when the WAL device is saturated.



When bluestore was created back in 2015 Sage implemented an optimization 
in RocksDB that allowed WAL log files to be recycled. The idea is that 
instead of deleting logs when they are flushed, rocksdb can simply reuse 
them.  The benefit here is that it allows records to be written and 
fadatasync can be called without touching the inode for every IO.  Sage 
did a pretty good job of explaining the benefit in the PR available here:


https://github.com/facebook/rocksdb/pull/746


After much discussion, that PR was merged and received a couple of bug 
fixes over the years:


Locking bug fix from Somnath back in 2016:
https://github.com/facebook/rocksdb/pull/1313

Another bug fix from ajkr in 2020:
https://github.com/facebook/rocksdb/pull/5900


In 2020, the RocksDB folks discovered there is a fundamental flaw in the 
way that the original PR works.  It turns out that the feature to 
recycle log files is incompatible with RocksDB's kPointInTimeRecovery, 
kAbsoluteConsistency, and kTolerateCorruptedTailRecords recovery modes.  
One of the later PR's included a very good and concise description of 
the problem:


"The two features are naturally incompatible. WAL recycling expects the 
recovery to succeed upon encountering a corrupt record at the point 
where new data ends and recycled data remains at the tail. However, 
WALRecoveryMode::kTolerateCorruptedTailRecords must fail upon 
encountering any such corrupt record, as it cannot differentiate between 
this and a real corruption, which would cause committed updates to be 
truncated."



More background discussion on the RocksDB side available in these PRs 
and comments:


https://github.com/facebook/rocksdb/pull/6351

https://github.com/facebook/rocksdb/pull/6351#issuecomment-672838284

https://github.com/facebook/rocksdb/pull/7252

https://github.com/facebook/rocksdb/pull/7271


On the Ceph side, there was a PR to try to re-enable the old behavior 
which we rejected as unsafe based on the analysis by the RocksDB folks 
(which we agree with):


https://github.com/ceph/ceph/pull/36579

Sage also commented about a potential way forward:

https://github.com/ceph/ceph/pull/36579#issuecomment-870884583

"tbh I think the best approach would be to create a new WAL file format 
that (1) is 4k block aligned and (2) has a header for each block that 
indicates the generation # for that log file (so we can see whether what 
we read is from a previous pass or corruption). That would be a fair bit 
of effort, though."



On a side note, Igor tried to also disable WAL file recycling as a 
backport to Octopus but was thwarted by a BlueFS bug.  That PR was 
eventually reverted leaving the old (dangerous!) behavior being left in 
place:


https://github.com/ceph/ceph/pull/45040

https://github.com/ceph/ceph/pull/47053


The gist of it is that releases of Ceph older than Pacific are 
benefiting from the speed improvement of log file recycling but may be 
vulnerable to the issue as described above.  This is likely one of the 
more impactful regressions that people upgrading to Pacific or later 
releases are seeing.



Josh Baergen from Digital Ocean followed up that there is a slew of 
additional information on this issue in the following tracker as well:


https://tracker.ceph.com/issues/58530



** Regression #1 Potential Fixes ***

Josh Baergen also mentioned that the write-amplification effect that was 
observed due to this issue is mitigated in by 
https://github.com/ceph/ceph/pull/48915 which was merged into 16.2.11 
back in December.  That however does not improve write IOPS amplification.


Beyond that, we could follow Sage's idea and try to implement a new WAL 
file format.  The risks here are that it could be a lot of work and we 
don't know if there is really any appetite on the RocksDB side to merge 
something like this upstream.  My personal take is that we're already 
kind of abusing the RocksDB WAL for short lived PG log updates an

[ceph-users] Re: Radosgw multisite replication issues

2023-05-11 Thread Casey Bodley
On Tue, May 9, 2023 at 3:11 PM Tarrago, Eli (RIS-BCT)
 wrote:
>
> East and West Clusters have been upgraded to quincy, 17.2.6.
>
> We are still seeing replication failures. Deep diving the logs, I found the 
> following interesting items.
>
> What is the best way to continue to troubleshoot this?

the curl timeouts make it look like a networking issue. can you
reproduce these issues with normal s3 clients against the west zone
endpoints?

if it's not the network itself, it could also be that the remote
radosgws have saturated their rgw_max_concurrent_requests, so are slow
to start processing accepted connections. as you're probably aware,
multisite replication sends a lot of requests to /admin/log to poll
for changes. if the remote radosgw is slow to process those, this
could be the result. there are two separate perf counters you might
consult to check for this:

on the remote (west) radosgws, there's a perf counter called "qactive"
that you could query (either from the radosgw admin socket, or via
'ceph daemon perf') for comparison against the configured
rgw_max_concurrent_requests

on the local (east) radosgws, there's a set of perf counters under
"data-sync-from-{zone}" that track polling errors and latency

> What is the curl attempting to fetch, but failing to obtain?
>
> -
> root@east01:~# radosgw-admin bucket sync --bucket=ceph-bucket 
> --source-zone=rgw-west run
> 2023-05-09T15:22:43.582+ 7f197d7fa700  0 WARNING: curl operation 
> timed out, network average transfer speed less than 1024 Bytes per second 
> during 300 seconds.
> 2023-05-09T15:22:43.582+ 7f1a48dd9e40  0 data sync: ERROR: failed 
> to fetch bucket index status

this error would correspond to a request like "GET
/admin/log/?type=bucket-instance&bucket-instance={instance id}&info",
sent to one of the west zone endpoints (http://west01.example.net:8080
etc). if you retry the command, you should be able to find such a
request in one of the west zone's radosgw logs. if you raise 'debug
rgw' level to 4 or more, that op would be logged as
'bucket_index_log_info'

> 2023-05-09T15:22:43.582+ 7f1a48dd9e40  0 
> RGW-SYNC:bucket[ceph-bucket:ddd66ab8-0417---.93706683.1:119<-ceph-bucket:ddd66ab8-0417---.93706683.93706683.1:119]:
>  ERROR: init sync on bucket failed, retcode=-5
> 2023-05-09T15:24:54.652+ 7f197d7fa700  0 WARNING: curl operation 
> timed out, network average transfer speed less than 1024 Bytes per second 
> during 300 seconds.
> 2023-05-09T15:27:05.725+ 7f197d7fa700  0 WARNING: curl operation 
> timed out, network average transfer speed less than 1024 Bytes per second 
> during 300 seconds.
> -
>
> radosgw-admin bucket sync --bucket=ceph-bucket-prd info
>   realm 98e0e391- (rgw-blobs)
>   zonegroup 0e0faf4e- (WestEastCeph)
>zone ddd66ab8- (rgw-east)
>  bucket :ceph-bucket[ddd66ab8-.93706683.1])
>
> source zone b2a4a31c-
>  bucket :ceph-bucket[ddd66ab8-.93706683.1])
> root@bctlpmultceph01:~# radosgw-admin bucket sync 
> --bucket=ceph-bucket status
>   realm 98e0e391- (rgw-blobs)
>   zonegroup 0e0faf4e- (WestEastCeph)
>zone ddd66ab8- (rgw-east)
>  bucket :ceph-bucket[ddd66ab8.93706683.1])
>
> source zone b2a4a31c- (rgw-west)
>   source bucket :ceph-bucket[ddd66ab8-.93706683.1])
> full sync: 0/120 shards
> incremental sync: 120/120 shards
> bucket is behind on 112 shards
> behind shards: 
> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,77,78,80,81,82,83,84,85,86,89,90,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119]
>
>
> -
>
>
> 2023-05-09T15:46:21.069+ 7f1fc7fff700  0 WARNING: curl operation timed 
> out, network average transfer speed less than 1024 Bytes per second during 
> 300 seconds.
> 2023-05-09T15:46:21.069+ 7f20b12b8700  0 WARNING: curl operation timed 
> out, network average transfer speed less than 1024 Bytes per second during 
> 300 seconds.
> 2023-05-09T15:46:21.069+ 7f20b12b8700  0 WARNING: curl operation timed 
> out, network average transfer speed less than 1024 Bytes per second during 
> 300 seconds.
> 2023-05-09T15:46:21.069+ 7f20b12b8700  0 WARNING: curl operation timed 
> out, network average transfer speed less than 1024 Bytes per second during 
> 300 seconds.
> 2023-05-09T15:46:21.069+ 7f20857f2700  0 rgw async rados processor: 
> store->fetch_remote_obj() returned r=-5

these errors would correspond to GetObject requests, and show up as
's3:get_obj' in the radosgw log


> 2023-05

[ceph-users] ceph status is warning with "266 pgs not deep-scrubbed in time"

2023-05-11 Thread Louis Koo
ceph version is: v16.2.10
But I had close the ration with "ceph config set mon 
mon_warn_pg_not_deep_scrubbed_ratio 0"


show the prints:
[root@smd-node01 deeproute]# ceph config get mon 
mon_warn_pg_not_deep_scrubbed_ratio
0.00
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 17.2.6 Dashboard/RGW Signature Mismatch

2023-05-11 Thread Ondřej Kukla
Hello,

I have found out that the issue seems to be in this change - 
https://github.com/ceph/ceph/pull/47207

When I’ve commented out the change and replaced it with the previous value the 
dashboard works as expected.

Ondrej
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recovering from OSD with corrupted DB

2023-05-11 Thread Jessica Sol
Hey all,

I recently had two OSDs fail. The first one I just removed and expected
replication to fix for me. Replication froze and I restarted the OSDs after
seeing heartbeat failures. It allowed replication to resume but one of the
OSD's RocksDB became corrupted, showing this error when I try to bring it
up:

rocksdb: verify_sharding unable to list column families: Corruption:
CURRENT file does not end with newline
bluestore(/var/lib/ceph/osd/ceph-1) _open_db erroring opening db:
osd.1 0 OSD:init: unable to mount object store
 ** ERROR: osd init failed: (5) Input/output error

I found a relevant bug report[1] with a workaround, but it requires
ceph-bluestore-tool to have bluefs-import, which was only added in 16.2.11,
I'm on 16.2.9 right now. Is it safe to do a patch upgrade from 16.2.9 to
16.2.11 without having an OSD running? Or an alternative way of running the
tool without upgrading? Is the tool even safe to run on data from 16.2.9?

Thanks,
Jess

[1] https://tracker.ceph.com/issues/47330
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CFP NOW OPEN: Ceph Days Vancouver Co-located with OpenInfra Summit

2023-05-11 Thread Mike Perez
Hi everyone,

I invite you to Ceph Days Vancouver on June 15, co-located with the OpenInfra 
Summit. Ceph Days are one-day events dedicated to multiple breakout and BoF 
sessions with a wide range of topics around Ceph.

You can receive a nice discount to attend Ceph Days and the OpenInfra Summit if 
you register before May 12!

https://ceph.io/en/community/events/2023/ceph-days-vancouver/

Our Call for proposals is open until May 17, and we have the following 
suggested topics:

- Ceph operations, management, and development
- New and proposed Ceph features, development status
- Ceph development roadmap
- Best practices
- Ceph use-cases, solution architectures, and user experiences
- Ceph performance and optimization
- Platform Integrations

- Kubernetes, OpenShift
- OpenStack (Cinder, Manila, etc.)
- Spark
- Multi-site and multi-cluster data services
- Persistent memory, ZNS SSDs, SMR HDDs, DPUs, and other new hardware 
technologies
- Storage management, monitoring, and deployment automation
- Experiences deploying and operating Ceph in production and/or at scale
- Small-scale or edge deployments
- Long-term, archival storage
- Data compression, deduplication, and storage optimization
- Developer processes, tools, challenges
- Ceph testing infrastructure, tools
- Ceph community issues, outreach, and project governance
- Ceph documentation, training, and learner experience

CFP: https://survey.zohopublic.com/zs/TVCCCQ

See you all there soon!

--
Mike Perez
Community Manager
Ceph Foundation
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] multisite synchronization and multipart uploads

2023-05-11 Thread Yixin Jin
Hi folks,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] multisite sync and multipart uploads

2023-05-11 Thread Yixin Jin
Hi guys,

With Quincy release, does anyone know how multisite sync deals with multipart 
uploads? I mean those part objects of some incomplete multipart uploads. Are 
those objects also sync-ed over either with full-sync or incremental sync? I 
did a quick experiment and notice that these objects are not sync-ed over. Is 
it intentional or is there a defect of it?

Thanks,
Yixin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multisite sync and multipart uploads

2023-05-11 Thread Casey Bodley
sync doesn't distinguish between multipart and regular object uploads.
once a multipart upload completes, sync will replicate it as a single
object using an s3 GetObject request

replicating the parts individually would have some benefits. for
example, when sync retries are necessary, we might only have to resend
one part instead of the entire object. but it's far simpler to
replicate objects in a single atomic step

On Thu, May 11, 2023 at 1:07 PM Yixin Jin  wrote:
>
> Hi guys,
>
> With Quincy release, does anyone know how multisite sync deals with multipart 
> uploads? I mean those part objects of some incomplete multipart uploads. Are 
> those objects also sync-ed over either with full-sync or incremental sync? I 
> did a quick experiment and notice that these objects are not sync-ed over. Is 
> it intentional or is there a defect of it?
>
> Thanks,
> Yixin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lua scripting in the rados gateway

2023-05-11 Thread Yuval Lifshitz
following PRs should be addressing the issues (feel free to review):
the lib64 problem on centos8: https://github.com/ceph/ceph/pull/51453
missing dependencies on cephadm image:
https://github.com/ceph/ceph-container/pull/2117
"script put" documentation: https://github.com/ceph/ceph/pull/51422

On Wed, May 10, 2023 at 12:20 PM Yuval Lifshitz  wrote:

> thanks Thomas!
> opened this tracker: https://tracker.ceph.com/issues/59697 should cover
> the missing dependencies for luarocks on the centos8 container (feel free
> to add anything missing there...).
> still trying to figure out the lib64 issue you found.
> regarding the "script put" issue - I will add that to the lua
> documentation page.
>
> On Tue, May 9, 2023 at 11:09 PM Thomas Bennett  wrote:
>
>> Hi Yuval,
>>
>> Just a follow up on this.
>>
>> An issue I’ve just resolved is getting scripts into the cephadm shell. As
>> it turns out - I didn’t know this be it seems the host file system is
>> mounted into the cephadm shell at /rootfs/.
>>
>> So I've been editing a /tmp/preRequest.lua on my host and then running:
>>
>> cephadm shell radosgw-admin script put
>> --infile=/rootfs/tmp/preRequest.lua --context=preRequest
>>
>> This injects the lua script into the pre request context.
>>
>> Cheers,
>> Tom
>>
>> On Fri, 28 Apr 2023 at 15:19, Thomas Bennett  wrote:
>>
>>> Hey Yuval,
>>>
>>> No problem. It was interesting to me to figure out how it all fits
>>> together and works.  Thanks for opening an issue on the tracker.
>>>
>>> Cheers,
>>> Tom
>>>
>>> On Thu, 27 Apr 2023 at 15:03, Yuval Lifshitz 
>>> wrote:
>>>
 Hi Thomas,
 Thanks for the detailed info!
 RGW lua scripting was never tested in a cephadm deployment :-(
 Opened a tracker: https://tracker.ceph.com/issues/59574 to make sure
 this would work out of the box.

 Yuval


 On Tue, Apr 25, 2023 at 10:25 PM Thomas Bennett 
 wrote:

> Hi ceph users,
>
> I've been trying out the lua scripting for the rados gateway (thanks
> Yuval).
>
> As in my previous email I mentioned that there is an error when trying
> to
> load the luasocket module. However, I thought it was a good time to
> report
> on my progress.
>
> My 'hello world' example below is called *test.lua* below includes the
> following checks:
>
>1. Can I write to the debug log?
>2. Can I use the lua socket package to do something stupid but
>intersting, like connect to a webservice?
>
> Before you continue reading this, you might need to know that I run all
> ceph processes in a *CentOS Stream release 8 *container deployed using
> ceph
> orchestrator running *Ceph v17.2.5*, so please view the information
> below
> in that context.
>
> For anyone looking for a reference, I suggest going to the ceph lua
> rados
> gateway documentation at radosgw/lua-scripting
> .
>
> There are two new switches you need to know about in the radosgw-admin:
>
>- *script* -> loading your lua script
>- *script-package* -> loading supporting packages for your script -
> e.i.
>luasocket in this case.
>
> For a basic setup, you'll need to have a few dependencies in your
> containers:
>
>- cephadm container: requires luarocks (I've checked the code - it
> runs
>a luarocks search command)
>- radosgw container: requires luarocks, gcc, make,  m4, wget (wget
> just
>in case).
>
> To achieve the above, I updated the container image for our running
> system.
> I needed to do this because I needed to redeploy the rados gateway
> container to inject the lua script packages into the radosgw runtime
> process. This will start with a fresh container based on the global
> config
> *container_image* setting on your running system.
>
> For us this is currently captured in *quay.io/tsolo/ceph:v17.2.5-3
> * and included the following exta
> steps (including installing the lua dev from an rpm because there is no
> centos package in yum):
> yum install luarocks gcc make wget m4
> rpm -i
>
> https://rpmfind.net/linux/centos/8-stream/PowerTools/x86_64/os/Packages/lua-devel-5.3.4-12.el8.x86_64.rpm
>
> You will notice that I've included a compiler and compiler support
> into the
> image. This is because luarocks on the radosgw to compile luasocket
> (the
> package I want to install). This will happen at start time when the
> radosgw
> is restarted from ceph orch.
>
> In the cephadm container I still need to update our cephadm shell so I
> need
> to install luarocks by hand:
> yum install luarocks
>
> Then set thew updated image to use:
> ceph config set global container_image quay.io/tsolo/ceph:v17.2.5-3
>
>

[ceph-users] Re: docker restarting lost all managers accidentally

2023-05-11 Thread Ben
along the path you mentioned, it is fixed by changing the owner of
/var/lib/ceph to 167:167 from root. The cluster was deployed with non root
user, and files permission is in a bit of mess. After the change systemctl
daemon-reload and restart brings it up.

for another manager in bootstrap host, journal logs complains the following:
ay 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 monclient: keyring not found
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: failed to load
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: (5) Input/output error
May 11 16:52:50 h15w bash[1434578]: debug 2023-05-11T08:52:50.858+
7f6b9bba5000 -1 auth: error parsing file
/var/lib/ceph/mgr/ceph-h15w.vuhzxy/keyring: error setting modifi>


the keyring has a base64 string and makes it original. mgr then is up as
well. There seems something inconsistent in bootstrapping a cluster.

Thank you all for help. It is now normal again.

Adam King  于2023年5月11日周四 01:33写道:

> in /var/lib/ceph// on the host with that mgr
> reporting the error, there should be a unit.run file that shows what is
> being done to start the mgr as well as a few files that get mounted into
> the mgr on startup, notably the "config" and "keyring" files. That config
> file should include the mon host addresses. E.g.
>
> [root@vm-01 ~]# cat
> /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config
> # minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74
> [global]
> fsid = 5a72983c-ef57-11ed-a389-525400e42d74
> mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2:
> 192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2:
> 192.168.122.97:3300/0,v1:192.168.122.97:6789/0]
>
> The first thing I'd do is probably make sure that array of addresses is
> correct.
>
> Then you could probably check the keyring file as well and see if it
> matches up with what you get running "ceph auth get ".
> E.g. here
>
> [root@vm-01 ~]# cat
> /var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring
> [mgr.vm-01.ilfvis]
> key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
>
> the key matches with
>
> [ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis
> [mgr.vm-01.ilfvis]
> key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
> caps mds = "allow *"
> caps mon = "profile mgr"
> caps osd = "allow *"
>
> I wouldn't post them for obvious reasons (these are just on a test cluster
> I'll tear back down so it's fine for me) but those are the first couple
> things I'd check. You could also try to make adjustments directly to the
> unit.run file if you have other things you'd like to try.
>
> On Wed, May 10, 2023 at 11:09 AM Ben  wrote:
>
>> Hi,
>> This cluster is deployed by cephadm 17.2.5,containerized.
>> It ends up in this(no active mgr):
>> [root@8cd2c0657c77 /]# ceph -s
>>   cluster:
>> id: ad3a132e-e9ee-11ed-8a19-043f72fb8bf9
>> health: HEALTH_WARN
>> 6 hosts fail cephadm check
>> no active mgr
>> 1/3 mons down, quorum h18w,h19w
>> Degraded data redundancy: 781908/2345724 objects degraded
>> (33.333%), 101 pgs degraded, 209 pgs undersized
>>
>>   services:
>> mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w
>> mgr: no daemons active (since 5h)
>> mds: 1/1 daemons up, 1 standby
>> osd: 9 osds: 6 up (since 5h), 6 in (since 5h)
>> rgw: 2 daemons active (2 hosts, 1 zones)
>>
>>   data:
>> volumes: 1/1 healthy
>> pools:   8 pools, 209 pgs
>> objects: 781.91k objects, 152 GiB
>> usage:   312 GiB used, 54 TiB / 55 TiB avail
>> pgs: 781908/2345724 objects degraded (33.333%)
>>  108 active+undersized
>>  101 active+undersized+degraded
>>
>> I checked the h20w, there is a manager container running with log:
>>
>> debug 2023-05-10T12:43:23.315+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:48:23.318+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:53:23.318+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after 300
>>
>> debug 2023-05-10T12:58:23.319+ 7f5e152ec000  0 monclient(hunting):
>> authenticate timed out after

[ceph-users] Re: client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc

2023-05-11 Thread Xiubo Li


On 5/10/23 19:35, Frank Schilder wrote:

Hi Xiubo.


IMO evicting the corresponding client could also resolve this issue
instead of restarting the MDS.

Yes, it can get rid of the stuck caps release request, but it will also make 
any process accessing the file system crash. After a client eviction we usually 
have to reboot the server to get everything back clean. An MDS restart would 
achieve this in a transparent way and when replaying the journal execute the 
pending caps recall successfully without making processes crash - if there 
wasn't the wrong peer issue.

As far as I can tell, the operation is stuck in the MDS because its never 
re-scheduled/re-tried or checked if the condition still exists (the client 
still holds the caps requested). An MDS restart re-schedules all pending 
operations and then it succeeds. In every ceph version so far there were 
examples where hand-shaking between a client and an MDS had small flaws. For 
situations like that I would really like to have a light-weight MDS daemon 
command to force a re-schedule/re-play without having to restart the entire MDS 
and reconnect all its clients from scratch.


Okay, I saw your another thread about this.

Thanks




It would be great to have light-weight tools available to rectify such simple 
conditions in an as non-disruptive as possible way.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Wednesday, May 10, 2023 4:01 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), 
pending pAsLsXsFsc issued pAsLsXsFsc


On 5/9/23 16:23, Frank Schilder wrote:

Dear Xiubo,

both issues will cause problems, the one reported in the subject 
(https://tracker.ceph.com/issues/57244) and the potential follow-up on MDS 
restart 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XPR6X6TD7372I2YEPJO2L6F).
 Either one will cause compute jobs on our HPC cluster to hang and users will 
need to run the jobs again. Our queues are full, so not very popular to loose 
your spot.

The process in D-state is a user process. Interestingly it is often possible to 
kill it despite the D-state (if one can find the process) and the stuck recall 
gets resolved. If I restart the MDS, the stuck process might continue working, 
but we run a significant risk of other processed getting stuck due to the 
libceph/MDS wrong peer issue. We actually have these kind of messages

[Mon Mar  6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:05:18 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got
192.168.32.87:6801/-1572619386

all over the HPC cluster and each of them means that some files/dirs are 
inaccessible on the compute node and jobs either died or are/got stuck there. 
Every MDS restart bears the risk of such events happening and with many nodes 
this probability approaches 1 - every time we restart an MDS jobs get stuck.

I have a reproducer for an instance of https://tracker.ceph.com/issues/57244. 
Unfortunately, this is a big one that I would need to pack into a container. I 
was not able to reduce it to something small, it seems to depend on a very 
specific combination of codes with certain internal latencies between threads 
that trigger a race.

It sounds like you have a patch for https://tracker.ceph.com/issues/57244 
although its not linked from the tracker item.

IMO evicting the corresponding client could also resolve this issue
instead of restarting the MDS.

Have you tried this ?

Thanks

- Xiubo


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Friday, May 5, 2023 2:40 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] client isn't responding to mclientcaps(revoke), 
pending pAsLsXsFsc issued pAsLsXsFsc


On 5/1/23 17:35, Frank Schilder wrote:

Hi all,

I think we might be hitting a known problem 
(https://tracker.ceph.com/issues/57244). I don't want to fail the mds yet, 
because we have troubles with older kclients that miss the mds restart and hold 
on to cache entries referring to the killed instance, leading to hanging jobs 
on our HPC cluster.

Will this cause any issue in your case ?


I have seen this issue before and there was a process in D-state that 
dead-locked itself. Usually, killing this process succeeded and resolved the 
issue. However, this time I can't find such a process.

BTW, what's the D-state process ? A ceph one ?

Thanks


The tracker mentions that one can delete the file/folder. I have the inode 
number, but really don't want to start a find on a 1.5PB file system. Is there 
a better way to find what path is causing the issue (ask the MDS directly, look 
at a cache dump, or similar)? Is there an alternative to deletion or MDS fail?

Thanks and best regards,
=
Fr

[ceph-users] Re: mds dump inode crashes file system

2023-05-11 Thread Xiubo Li


On 5/11/23 18:26, Frank Schilder wrote:

Dear Xiubo,

thanks for your reply.


BTW, did you enabled the async dirop ? Currently this is disabled by
default in 4.18.0-486.el8.x86_64.

I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?

I'm in meetings pretty much all day and will try to send some more info later.


$ mount|grep ceph
192.168.0.104:40636,192.168.0.104:40638,192.168.0.104:40640:/ on 
/mnt/kcephfs type ceph 
(rw,relatime,wsync,fsid=b10bc0bf-101c-48af-a7a4-13b7208532c9,acl)


You will see the ",wsync" string if enabled. Or it's disabled.




Could you reproduce this by enabling the mds debug logs ?

Not right now. Our users are annoyed enough already. I first need to figure out 
how to move the troublesome inode somewhere else where I might be able to do 
something. The boot message shows up on this one file server every time. Is 
there any information about what dir/inode might be causing the issue? How 
could I reproduce this without affecting the users, say, by re-creating the 
same condition somewhere else? Any hints are appreciated.


This is not easy to reproduce I just concluded that by reading the ceph 
and kceph code. In theory you can reproduce this by making the directory 
to do the migrating during stress IOs.


Thanks



Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Thursday, May 11, 2023 3:45 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system

Hey Frank,

On 5/10/23 21:44, Frank Schilder wrote:

The kernel message that shows up on boot on the file server in text format:

May 10 13:56:59 rit-pfile01 kernel: WARNING: CPU: 3 PID: 34 at 
fs/ceph/caps.c:689 ceph_add_cap+0x53e/0x550 [ceph]
May 10 13:56:59 rit-pfile01 kernel: Modules linked in: ceph libceph 
dns_resolver nls_utf8 isofs cirrus drm_shmem_helper intel_rapl_msr iTCO_wdt 
intel_rapl_common iTCO_vendor_support drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops pcspkr joydev virtio_net drm i2c_i801 net_failover 
virtio_balloon failover lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc 
sr_mod cdrom sg xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci 
libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console 
virtio_scsi dm_mirror dm_region_hash dm_log dm_mod fuse
May 10 13:56:59 rit-pfile01 kernel: CPU: 3 PID: 34 Comm: kworker/3:0 Not 
tainted 4.18.0-486.el8.x86_64 #1
May 10 13:56:59 rit-pfile01 kernel: Hardware name: Red Hat KVM/RHEL-AV, BIOS 
1.16.0-3.module_el8.7.0+3346+68867adb 04/01/2014
May 10 13:56:59 rit-pfile01 kernel: Workqueue: ceph-msgr ceph_con_workfn 
[libceph]
May 10 13:56:59 rit-pfile01 kernel: RIP: 0010:ceph_add_cap+0x53e/0x550 [ceph]
May 10 13:56:59 rit-pfile01 kernel: Code: c0 48 c7 c7 c0 69 7f c0 e8 6c 4c 72 c3 0f 
0b 44 89 7c 24 04 e9 7e fc ff ff 44 8b 7c 24 04 e9 68 fe ff ff 0f 0b e9 c9 fc ff ff 
<0f> 0b e9 0a fe ff ff 0f 0b e9 12 fe ff ff 0f 0b 66 90 0f 1f 44 00
May 10 13:56:59 rit-pfile01 kernel: RSP: 0018:a4d000d87b48 EFLAGS: 00010217
May 10 13:56:59 rit-pfile01 kernel: RAX:  RBX: 0005 
RCX: dead0200
May 10 13:56:59 rit-pfile01 kernel: RDX: 92d7d7f6e7d0 RSI: 92d7d7f6e7d0 
RDI: 92d7d7f6e7c8
May 10 13:56:59 rit-pfile01 kernel: RBP: 92d7c5588970 R08: 92d7d7f6e7d0 
R09: 0001
May 10 13:56:59 rit-pfile01 kernel: R10: 92d80078cbb8 R11: 92c0 
R12: 0155
May 10 13:56:59 rit-pfile01 kernel: R13: 92d80078cbb8 R14: 92d80078cbc0 
R15: 0001
May 10 13:56:59 rit-pfile01 kernel: FS:  () 
GS:92d937d8() knlGS:
May 10 13:56:59 rit-pfile01 kernel: CS:  0010 DS:  ES:  CR0: 
80050033
May 10 13:56:59 rit-pfile01 kernel: CR2: 7f74435b9008 CR3: 0001099fa000 
CR4: 003506e0
May 10 13:56:59 rit-pfile01 kernel: Call Trace:
May 10 13:56:59 rit-pfile01 kernel: ceph_handle_caps+0xdf2/0x1780 [ceph]
May 10 13:56:59 rit-pfile01 kernel: mds_dispatch+0x13a/0x670 [ceph]
May 10 13:56:59 rit-pfile01 kernel: ceph_con_process_message+0x79/0x140 
[libceph]
May 10 13:56:59 rit-pfile01 kernel: ? calc_signature+0xdf/0x110 [libceph]
May 10 13:56:59 rit-pfile01 kernel: ceph_con_v1_try_read+0x5d7/0xf30 [libceph]
May 10 13:56:59 rit-pfile01 kernel: ceph_con_workfn+0x329/0x680 [libceph]
May 10 13:56:59 rit-pfile01 kernel: process_one_work+0x1a7/0x360
May 10 13:56:59 rit-pfile01 kernel: worker_thread+0x30/0x390
May 10 13:56:59 rit-pfile01 kernel: ? create_worker+0x1a0/0x1a0
May 10 13:56:59 rit-pfile01 kernel: kthread+0x134/0x150
May 10 13:56:59 rit-pfile01 kernel: ? set_kthread_struct+0x50/0x50
May 10 13:56:59 rit-pfile01 kernel: ret_from_fork+0x35/0x40
May 10 13:56:59 rit-pfile01 kernel: ---[ end trace 84e4b3694bbe9fde ]---

BTW, did you enabled the async dirop ? Currently this is disabled by
default in 4.18.0

[ceph-users] Re: mds dump inode crashes file system

2023-05-11 Thread Xiubo Li


On 5/11/23 20:12, Frank Schilder wrote:

Dear Xiubo,

please see also my previous e-mail about the async dirop config.

I have a bit more log output from dmesg on the file server 
here:https://pastebin.com/9Y0EPgDD  .


1.
   [Wed May 10 16:03:06 2023] ceph: corrupt snap message from mds1
2.
   [Wed May 10 16:03:06 2023] header: : 05 00 00 00 00 00 00 00
   00 00 00 00 00 00 00 00 
3.
   [Wed May 10 16:03:06 2023] header: 0010: 12 03 7f 00 01 00 00 01
   00 00 00 00 00 00 00 00 
4.
   [Wed May 10 16:03:06 2023] header: 0020: 00 00 00 00 02 01 00 00
   00 00 00 00 00 01 00 00 
5.
   [Wed May 10 16:03:06 2023] header: 0030: 00 98 0d 60 93 ...`.
6.
   [Wed May 10 16:03:06 2023] front: : 00 00 00 00 00 00 00 00
   00 00 00 00 00 00 00 00 
7.
   [Wed May 10 16:03:06 2023] front: 0010: 0c 00 00 00 88 00 00 00
   d1 c0 71 38 00 01 00 00 ..q8
8.
   [Wed May 10 16:03:06 2023] front: 0020: 22 c8 71 38 00 01 00 00
   d7 c7 71 38 00 01 00 00 ".q8..q8
9.
   [Wed May 10 16:03:06 2023] front: 0030: d9 c7 71 38 00 01 00 00
   d4 c7 71 38 00 01 00 00 ..q8..q8
10.
   [Wed May 10 16:03:06 2023] front: 0040: f1 c0 71 38 00 01 00 00
   d4 c0 71 38 00 01 00 00 ..q8..q8
11.
   [Wed May 10 16:03:06 2023] front: 0050: 20 c8 71 38 00 01 00 00
   1d c8 71 38 00 01 00 00 .q8..q8
12.
   [Wed May 10 16:03:06 2023] front: 0060: ec c0 71 38 00 01 00 00
   d6 c0 71 38 00 01 00 00 ..q8..q8
13.
   [Wed May 10 16:03:06 2023] front: 0070: ef c0 71 38 00 01 00 00
   6a 11 2d 1a 00 01 00 00 ..q8j.-.
14.
   [Wed May 10 16:03:06 2023] front: 0080: 01 00 00 00 00 00 00 00
   01 00 00 00 00 00 00 00 
15.
   [Wed May 10 16:03:06 2023] front: 0090: ee 01 00 00 00 00 00 00
   01 00 00 00 00 00 00 00 
16.
   [Wed May 10 16:03:06 2023] front: 00a0: 00 00 00 00 00 00 00 00
   01 00 00 00 00 00 00 00 
17.
   [Wed May 10 16:03:06 2023] front: 00b0: 01 09 00 00 00 00 00 00
   00 00 00 00 00 00 00 00 
18.
   [Wed May 10 16:03:06 2023] front: 00c0: 01 00 00 00 00 00 00 00
   02 09 00 00 00 00 00 00 
19.
   [Wed May 10 16:03:06 2023] front: 00d0: 05 00 00 00 00 00 00 00
   01 09 00 00 00 00 00 00 
20.
   [Wed May 10 16:03:06 2023] front: 00e0: ff 08 00 00 00 00 00 00
   fd 08 00 00 00 00 00 00 
21.
   [Wed May 10 16:03:06 2023] front: 00f0: fb 08 00 00 00 00 00 00
   f9 08 00 00 00 00 00 00 
22.
   [Wed May 10 16:03:06 2023] footer: : ca 39 06 07 00 00 00 00
   00 00 00 00 42 06 63 61 .9..B.ca
23.
   [Wed May 10 16:03:06 2023] footer: 0010: 7b 4b 5d 2d 05 {K]-.
24.
   [Wed May 10 16:03:06 2023] ceph: ceph_do_invalidate_pages: inode
   1001a2d116a.fffe is shut down

Yeah, the kclient just received a corrupted snaptrace from MDS.

So the first thing is you need to fix the corrupted snaptrace issue in 
cephfs and then continue.


If possible you can parse the above corrupted snap message to check what 
exactly corrupted. I haven't get a chance to do that.


You seems didn't enable the 'osd blocklist' cephx auth cap for mon:

1.
   [Wed May 10 16:03:06 2023] ceph: update_snap_trace error -5
2.
   [Wed May 10 16:03:06 2023] ceph: ceph_update_snap_trace failed to
   blocklist (3)192.168.48.142:0: -13




This covers a reboot after the one in my previous e-mail as well as another fail at the end. When I 
checked around 16:30 the mount point was inaccessible again with "stale file handle". 
Please note the "wrong peer at address" messages in the log, it seems that a number of 
issues come together here. These threads are actually all related to this file server and the 
observations we make now:


Since kcleint received a corrupted snaptrace it will make the filesystem 
the mounter was inaccessible is expected.




https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/MSB5TIG42XAFNG2CKILY5DZWIMX6C5CO/
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XPR6X6TD7372I2YEPJO2L6F/

You mentioned directory migration in the MDS, I guess you mean migrating a directory 
fragment between MDSes? This should not happen, all these directories are statically 
pinned to a rank. An MDS may split/merge directory fragments, but they stay at the same 
MDS all the time. This is confirmed by running a "dump inode" on directories 
under a pin. Only one MDS reports back that it has the dir inode in its cache, so I think 
the static pinning works as expected.


Yeah, the banlancer will do that. If not this should be a different issue.




It would be great if you could also look at Greg's reply, maybe you have 
something I could look at to find the cause of the crash during the mds dump 
inode command.


I checked that but by reading the code I couldn't get what had cause the 
MDS crash. There