[ceph-users] Question about erasure coding on cephfs

2024-03-02 Thread Erich Weiler
Hi Y'all, We have a new ceph cluster online that looks like this: md-01 : monitor, manager, mds md-02 : monitor, manager, mds md-03 : monitor, manager store-01 : twenty 30TB NVMe OSDs store-02 : twenty 30TB NVMe OSDs The cephfs storage is using erasure coding at 4:2. The crush domain is set t

[ceph-users] Clients failing to advance oldest client?

2024-03-25 Thread Erich Weiler
Hi Y'all, I'm seeing this warning via 'ceph -s' (this is on Reef): # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 3 clients failing to advance oldest client/flush tid 1 MDSs report slow requests 1 MDSs behind on t

[ceph-users] Re: Clients failing to advance oldest client?

2024-03-25 Thread Erich Weiler
Ok! Thank you. Is there a way to tell which client is slow? > On Mar 25, 2024, at 9:06 PM, David Yang wrote: > > It is recommended to disconnect the client first and then observe > whether the cluster's slow requests recover. > > Erich Weiler 于2024年3月26日周二 0

[ceph-users] Re: Clients failing to advance oldest client?

2024-03-26 Thread Erich Weiler
ersion you're on? -- *Dhairya Parmar* Associate Software Engineer, CephFS IBM, Inc. On Tue, Mar 26, 2024 at 2:32 AM Erich Weiler <mailto:wei...@soe.ucsc.edu>> wrote: Hi Y'all, I'm seeing this warning via 'ceph -s' (this is on Reef):

[ceph-users] CephFS filesystem mount tanks on some nodes?

2024-03-26 Thread Erich Weiler
Hi All, We have a CephFS filesystem where we are running Reef on the servers (OSD/MDS/MGR/MON) and Quincy on the clients. Every once in a while, one of the clients will stop allowing access to my CephFS filesystem, the error being "permission denied" while try to access the filesystem on tha

[ceph-users] MDS Behind on Trimming...

2024-03-27 Thread Erich Weiler
Hi All, I've been battling this for a while and I'm not sure where to go from here. I have a Ceph health warning as such: # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 1 MDSs report slow requests 1 MDSs behind on trimming

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
21:28, Xiubo Li wrote: On 3/28/24 04:03, Erich Weiler wrote: Hi All, I've been battling this for a while and I'm not sure where to go from here.  I have a Ceph health warning as such: # ceph -s   cluster:     id: 58bde08a-d7ed-11ee-9098-506b4b4da440     health: HEALTH_WARN   

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
ls id=99445 Here is how to map inode ID to the path: ceph tell mds.0 dump inode 0x100081b9ceb | jq -r .path On Fri, Mar 29, 2024 at 1:12 AM Erich Weiler wrote: Here are some of the MDS logs: Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) log [WRN] : slow request 511.7

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
n this will clear (I've done it before). But it just comes back. Often somewhere in the same directory /private/groups/shapirolab/brock/...[something]. -erich On 3/28/24 10:11 AM, Erich Weiler wrote: Here are some of the MDS logs: Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
: Hello Erich, Does the workload, by any chance, involve rsync? It is unfortunately well-known for triggering such issues. A workaround is to export the directory via NFS and run rsync against the NFS mount instead of directly against CephFS. On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler wrote:

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Erich Weiler
rakov wrote: Hello Erich, Does the workload, by any chance, involve rsync? It is unfortunately well-known for triggering such issues. A workaround is to export the directory via NFS and run rsync against the NFS mount instead of directly against CephFS. On Fri, Mar 29, 2024 at 4:58 AM Erich Weile

[ceph-users] Multiple MDS Daemon needed?

2024-04-07 Thread Erich Weiler
Hi All, We have a slurm cluster with 25 clients, each with 256 cores, each mounting a cephfs filesystem as their main storage target. The workload can be heavy at times. We have two active MDS daemons and one standby. A lot of the time everything is healthy but we sometimes get warnings ab

[ceph-users] Re: MDS Behind on Trimming...

2024-04-07 Thread Erich Weiler
Xiubo On 3/28/24 04:03, Erich Weiler wrote: Hi All, I've been battling this for a while and I'm not sure where to go from here.  I have a Ceph health warning as such: # ceph -s   cluster:     id: 58bde08a-d7ed-11ee-9098-506b4b4da440     health: HEALTH_WARN     1 MDSs repor

[ceph-users] Re: MDS Behind on Trimming...

2024-04-07 Thread Erich Weiler
://tracker.ceph.com/issues/62123) as Xiubo suggested? Thanks again, Erich > On Apr 7, 2024, at 9:00 PM, Alexander E. Patrakov wrote: > > Hi Erich, > >> On Mon, Apr 8, 2024 at 11:51 AM Erich Weiler wrote: >> >> Hi Xiubo, >> >>> Thanks for your logs, a

[ceph-users] Re: MDS Behind on Trimming...

2024-04-09 Thread Erich Weiler
Dos that mean it could be the locker order bug (https://tracker.ceph.com/issues/62123) as Xiubo suggested? I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. Thank you! Yeah, this issue is happening every couple days now. It

[ceph-users] Re: MDS Behind on Trimming...

2024-04-11 Thread Erich Weiler
I have raised one PR to fix the lock order issue, if possible please have a try to see could it resolve this issue. That's great! When do you think that will be available? Thank you!  Yeah, this issue is happening every couple days now. It just happened again today and I got more MDS dumps. 

[ceph-users] Re: MDS Behind on Trimming...

2024-04-11 Thread Erich Weiler
I guess we are specifically using the "centos-ceph-reef" repository, and it looks like the latest version in that repo is 18.2.2-1.el9s. Will this fix appear in 18.2.2-2.el9s or something like that? I don't know how often the release cycle updates the repos...? On 4/11/

[ceph-users] Re: MDS Behind on Trimming...

2024-04-11 Thread Erich Weiler
Or... Maybe the fix will first appear in the "centos-ceph-reef-test" repo that I see? Is that how RedHat usually does it? On 4/11/24 10:30, Erich Weiler wrote: I guess we are specifically using the "centos-ceph-reef" repository, and it looks like the latest version in t

[ceph-users] How to make config changes stick for MDS?

2024-04-16 Thread Erich Weiler
Hi All, I'm having a crazy time getting config items to stick on my MDS daemons. I'm running Reef 18.2.1 on RHEL 9 and the daemons are running in podman, I used cephadm to deploy the daemons. I can adjust the config items in runtime, like so: ceph tell mds.slugfs.pr-md-01.xdtppo config set

[ceph-users] Question about PR merge

2024-04-17 Thread Erich Weiler
Hello, We are tracking PR #56805: https://github.com/ceph/ceph/pull/56805 And the resolution of this item would potentially fix a pervasive and ongoing issue that needs daily attention in our cephfs cluster. I was wondering if it would be included in 18.2.3 which I *think* should be release

[ceph-users] Re: Question about PR merge

2024-04-17 Thread Erich Weiler
Have you already shared information about this issue? Please do if not. I am working with Xiubo Li and providing debugging information - in progress! I was wondering if it would be included in 18.2.3 which I *think* should be released soon? Is there any way of knowing if that is true? Thi

[ceph-users] Re: MDS Behind on Trimming...

2024-04-19 Thread Erich Weiler
x27;t know. Or maybe it's the lock issue you've been working on. I guess I can test the lock order fix when it's available to test. -erich On 4/19/24 7:26 AM, Erich Weiler wrote: So I woke up this morning and checked the blocked_ops again, there were 150 of them.  But the age

[ceph-users] Stuck in replay?

2024-04-22 Thread Erich Weiler
Hi All, We have a somewhat serious situation where we have a cephfs filesystem (18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of the active daemons to unstick a bunch of blocked requests, and the standby went into 'replay' for a very long time, then RAM on that MDS serve

[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Erich Weiler
? The mds process is taking up 22GB right now and starting to swap my server, so maybe it somehow is too large On 4/22/24 11:17 AM, Erich Weiler wrote: Hi All, We have a somewhat serious situation where we have a cephfs filesystem (18.2.1), and 2 active MDSs (one standby).  ThI tried to rest

[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Erich Weiler
:37 AM, Sake Ceph wrote: Just a question: is it possible to block or disable all clients? Just to prevent load on the system. Kind regards, Sake Op 22-04-2024 20:33 CEST schreef Erich Weiler : I also see this from 'ceph health detail': # ceph health detail HEALTH_WARN 1 fil

[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Erich Weiler
no idea the MDS daemon could require that much RAM. -erich On 4/22/24 11:41 AM, Erich Weiler wrote: possibly but it would be pretty time consuming and difficult... Is it maybe a RAM issue since my MDS RAM is filling up?  Should maybe I bring up another MDS on another server with huge amount of

[ceph-users] cache pressure?

2024-04-23 Thread Erich Weiler
So I'm trying to figure out ways to reduce the number of warnings I'm getting and I'm thinking about the one "client failing to respond to cache pressure". Is there maybe a way to tell a client (or all clients) to reduce the amount of cache it uses or to release caches quickly? Like, all the

[ceph-users] Re: [EXTERN] cache pressure?

2024-04-26 Thread Erich Weiler
    "**/.conda/**": true,     "**/.local/**": true,     "**/.nextflow/**": true,     "**/work/**": true   } } ~/.vscode-server/data/Machine/settings.json To monitor and find processes with watcher you may use inotify-info <https://github.com/mikesart/i

[ceph-users] Re: cache pressure?

2024-04-26 Thread Erich Weiler
As Dietmar said, VS Code may cause this. Quite funny to read, actually, because we've been dealing with this issue for over a year, and yesterday was the very first time Ceph complained about a client and we saw VS Code's remote stuff running. Coincidence. I'm holding my breath that the vscode

[ceph-users] Re: [EXTERN] cache pressure?

2024-04-27 Thread Erich Weiler
odules/*/**": true, "**/.cache/**": true, "**/.conda/**": true, "**/.local/**": true, "**/.nextflow/**": true, "**/work/**": true, "**/cephfs/**": true } } On 4/27/24 12:24 AM, Dietmar Rieder wrote: Hi

[ceph-users] Re: MDS Behind on Trimming...

2024-04-29 Thread Erich Weiler
. Thanks - Xiubo On 4/19/24 23:55, Erich Weiler wrote: Hi Xiubo, Nevermind I was wrong, most the blocked ops were 12 hours old. Ug. I restarted the MDS daemon to clear them. I just reset to having one active MDS instead of two, let's see if that makes a difference. I am beginning to t

[ceph-users] 'ceph fs status' no longer works?

2024-05-02 Thread Erich Weiler
Hi All, For a while now I've been using 'ceph fs status' to show current MDS active servers, filesystem status, etc. I recently took down my MDS servers and added RAM to them (one by one, so the filesystem stayed online). After doing that with my four MDS servers (I had two active and two s

[ceph-users] Re: 'ceph fs status' no longer works?

2024-05-02 Thread Erich Weiler
upgrades. I’ll have to check my notes if I wrote anything down for that. But try a mgr failover first, that could help. Zitat von Erich Weiler : Hi All, For a while now I've been using 'ceph fs status' to show current MDS active servers, filesystem status, etc.  I recently to

[ceph-users] Re: 'ceph fs status' no longer works?

2024-05-02 Thread Erich Weiler
Nachricht --- *Betreff: *[ceph-users] Re: 'ceph fs status' no longer works? *Von: *"Erich Weiler" mailto:wei...@soe.ucsc.edu>> *An: *"Eugen Block" mailto:ebl...@nde.ag>>, ceph-users@ceph.io <mailto:ceph-users@ceph.io> *Datum: *02-05-2024 21:05

[ceph-users] Re: [EXTERN] Re: cache pressure?

2024-05-07 Thread Erich Weiler
solved. -erich On 5/7/24 6:55 AM, Dietmar Rieder wrote: On 4/26/24 23:51, Erich Weiler wrote: As Dietmar said, VS Code may cause this. Quite funny to read, actually, because we've been dealing with this issue for over a year, and yesterday was the very first time Ceph complained abou

[ceph-users] Adding new OSDs - also adding PGs?

2024-06-04 Thread Erich Weiler
Hi All, I'm going to be adding a bunch of OSDs to our cephfs cluster shortly (increasing the total size by 50%). We're on reef, and will be deploying using the cephadm method, and the OSDs are exactly the same size and disk type as the current ones. So, after adding the new OSDs, my underst