Hi Y'all,
We have a new ceph cluster online that looks like this:
md-01 : monitor, manager, mds
md-02 : monitor, manager, mds
md-03 : monitor, manager
store-01 : twenty 30TB NVMe OSDs
store-02 : twenty 30TB NVMe OSDs
The cephfs storage is using erasure coding at 4:2. The crush domain is
set t
Hi Y'all,
I'm seeing this warning via 'ceph -s' (this is on Reef):
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
3 clients failing to advance oldest client/flush tid
1 MDSs report slow requests
1 MDSs behind on t
Ok! Thank you. Is there a way to tell which client is slow?
> On Mar 25, 2024, at 9:06 PM, David Yang wrote:
>
> It is recommended to disconnect the client first and then observe
> whether the cluster's slow requests recover.
>
> Erich Weiler 于2024年3月26日周二 0
ersion
you're on?
--
*Dhairya Parmar*
Associate Software Engineer, CephFS
IBM, Inc.
On Tue, Mar 26, 2024 at 2:32 AM Erich Weiler <mailto:wei...@soe.ucsc.edu>> wrote:
Hi Y'all,
I'm seeing this warning via 'ceph -s' (this is on Reef):
Hi All,
We have a CephFS filesystem where we are running Reef on the servers
(OSD/MDS/MGR/MON) and Quincy on the clients.
Every once in a while, one of the clients will stop allowing access to
my CephFS filesystem, the error being "permission denied" while try to
access the filesystem on tha
Hi All,
I've been battling this for a while and I'm not sure where to go from
here. I have a Ceph health warning as such:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 MDSs report slow requests
1 MDSs behind on trimming
21:28, Xiubo Li wrote:
On 3/28/24 04:03, Erich Weiler wrote:
Hi All,
I've been battling this for a while and I'm not sure where to go from
here. I have a Ceph health warning as such:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
ls id=99445
Here is how to map inode ID to the path:
ceph tell mds.0 dump inode 0x100081b9ceb | jq -r .path
On Fri, Mar 29, 2024 at 1:12 AM Erich Weiler wrote:
Here are some of the MDS logs:
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 511.7
n this will clear (I've done it before). But
it just comes back. Often somewhere in the same directory
/private/groups/shapirolab/brock/...[something].
-erich
On 3/28/24 10:11 AM, Erich Weiler wrote:
Here are some of the MDS logs:
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log
:
Hello Erich,
Does the workload, by any chance, involve rsync? It is unfortunately
well-known for triggering such issues. A workaround is to export the
directory via NFS and run rsync against the NFS mount instead of
directly against CephFS.
On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler wrote:
rakov wrote:
Hello Erich,
Does the workload, by any chance, involve rsync? It is unfortunately
well-known for triggering such issues. A workaround is to export the
directory via NFS and run rsync against the NFS mount instead of
directly against CephFS.
On Fri, Mar 29, 2024 at 4:58 AM Erich Weile
Hi All,
We have a slurm cluster with 25 clients, each with 256 cores, each
mounting a cephfs filesystem as their main storage target. The workload
can be heavy at times.
We have two active MDS daemons and one standby. A lot of the time
everything is healthy but we sometimes get warnings ab
Xiubo
On 3/28/24 04:03, Erich Weiler wrote:
Hi All,
I've been battling this for a while and I'm not sure where to go from
here. I have a Ceph health warning as such:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 MDSs repor
://tracker.ceph.com/issues/62123) as Xiubo suggested?
Thanks again,
Erich
> On Apr 7, 2024, at 9:00 PM, Alexander E. Patrakov wrote:
>
> Hi Erich,
>
>> On Mon, Apr 8, 2024 at 11:51 AM Erich Weiler wrote:
>>
>> Hi Xiubo,
>>
>>> Thanks for your logs, a
Dos that mean it could be the locker order bug
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?
I have raised one PR to fix the lock order issue, if possible please
have a try to see could it resolve this issue.
Thank you! Yeah, this issue is happening every couple days now. It
I have raised one PR to fix the lock order issue, if possible please
have a try to see could it resolve this issue.
That's great! When do you think that will be available?
Thank you! Yeah, this issue is happening every couple days now. It
just happened again today and I got more MDS dumps.
I guess we are specifically using the "centos-ceph-reef" repository, and
it looks like the latest version in that repo is 18.2.2-1.el9s. Will
this fix appear in 18.2.2-2.el9s or something like that? I don't know
how often the release cycle updates the repos...?
On 4/11/
Or... Maybe the fix will first appear in the "centos-ceph-reef-test"
repo that I see? Is that how RedHat usually does it?
On 4/11/24 10:30, Erich Weiler wrote:
I guess we are specifically using the "centos-ceph-reef" repository, and
it looks like the latest version in t
Hi All,
I'm having a crazy time getting config items to stick on my MDS daemons.
I'm running Reef 18.2.1 on RHEL 9 and the daemons are running in
podman, I used cephadm to deploy the daemons.
I can adjust the config items in runtime, like so:
ceph tell mds.slugfs.pr-md-01.xdtppo config set
Hello,
We are tracking PR #56805:
https://github.com/ceph/ceph/pull/56805
And the resolution of this item would potentially fix a pervasive and
ongoing issue that needs daily attention in our cephfs cluster. I was
wondering if it would be included in 18.2.3 which I *think* should be
release
Have you already shared information about this issue? Please do if not.
I am working with Xiubo Li and providing debugging information - in
progress!
I was
wondering if it would be included in 18.2.3 which I *think* should be
released soon? Is there any way of knowing if that is true?
Thi
x27;t know.
Or maybe it's the lock issue you've been working on. I guess I can test
the lock order fix when it's available to test.
-erich
On 4/19/24 7:26 AM, Erich Weiler wrote:
So I woke up this morning and checked the blocked_ops again, there were
150 of them. But the age
Hi All,
We have a somewhat serious situation where we have a cephfs filesystem
(18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of
the active daemons to unstick a bunch of blocked requests, and the
standby went into 'replay' for a very long time, then RAM on that MDS
serve
? The mds process is taking up 22GB right now and
starting to swap my server, so maybe it somehow is too large
On 4/22/24 11:17 AM, Erich Weiler wrote:
Hi All,
We have a somewhat serious situation where we have a cephfs filesystem
(18.2.1), and 2 active MDSs (one standby). ThI tried to rest
:37 AM, Sake Ceph wrote:
Just a question: is it possible to block or disable all clients? Just to
prevent load on the system.
Kind regards,
Sake
Op 22-04-2024 20:33 CEST schreef Erich Weiler :
I also see this from 'ceph health detail':
# ceph health detail
HEALTH_WARN 1 fil
no idea the MDS daemon could require that much RAM.
-erich
On 4/22/24 11:41 AM, Erich Weiler wrote:
possibly but it would be pretty time consuming and difficult...
Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe I
bring up another MDS on another server with huge amount of
So I'm trying to figure out ways to reduce the number of warnings I'm
getting and I'm thinking about the one "client failing to respond to
cache pressure".
Is there maybe a way to tell a client (or all clients) to reduce the
amount of cache it uses or to release caches quickly? Like, all the
"**/.conda/**": true,
"**/.local/**": true,
"**/.nextflow/**": true,
"**/work/**": true
}
}
~/.vscode-server/data/Machine/settings.json
To monitor and find processes with watcher you may use inotify-info
<https://github.com/mikesart/i
As Dietmar said, VS Code may cause this. Quite funny to read, actually,
because we've been dealing with this issue for over a year, and
yesterday was the very first time Ceph complained about a client and we
saw VS Code's remote stuff running. Coincidence.
I'm holding my breath that the vscode
odules/*/**": true,
"**/.cache/**": true,
"**/.conda/**": true,
"**/.local/**": true,
"**/.nextflow/**": true,
"**/work/**": true,
"**/cephfs/**": true
}
}
On 4/27/24 12:24 AM, Dietmar Rieder wrote:
Hi
.
Thanks
- Xiubo
On 4/19/24 23:55, Erich Weiler wrote:
Hi Xiubo,
Nevermind I was wrong, most the blocked ops were 12 hours old. Ug.
I restarted the MDS daemon to clear them.
I just reset to having one active MDS instead of two, let's see if
that makes a difference.
I am beginning to t
Hi All,
For a while now I've been using 'ceph fs status' to show current MDS
active servers, filesystem status, etc. I recently took down my MDS
servers and added RAM to them (one by one, so the filesystem stayed
online). After doing that with my four MDS servers (I had two active
and two s
upgrades. I’ll have to check my
notes if I wrote anything down for that. But try a mgr failover first,
that could help.
Zitat von Erich Weiler :
Hi All,
For a while now I've been using 'ceph fs status' to show current MDS
active servers, filesystem status, etc. I recently to
Nachricht ---
*Betreff: *[ceph-users] Re: 'ceph fs status' no longer works?
*Von: *"Erich Weiler" mailto:wei...@soe.ucsc.edu>>
*An: *"Eugen Block" mailto:ebl...@nde.ag>>,
ceph-users@ceph.io <mailto:ceph-users@ceph.io>
*Datum: *02-05-2024 21:05
solved.
-erich
On 5/7/24 6:55 AM, Dietmar Rieder wrote:
On 4/26/24 23:51, Erich Weiler wrote:
As Dietmar said, VS Code may cause this. Quite funny to read,
actually, because we've been dealing with this issue for over a year,
and yesterday was the very first time Ceph complained abou
Hi All,
I'm going to be adding a bunch of OSDs to our cephfs cluster shortly
(increasing the total size by 50%). We're on reef, and will be
deploying using the cephadm method, and the OSDs are exactly the same
size and disk type as the current ones.
So, after adding the new OSDs, my underst
36 matches
Mail list logo