Hello
We are experiencing an issue where our ceph MDS gobbles up 500G of RAM, is
killed by the kernel, dies, then repeats. We have 3 MDS daemons on different
machines, and all are exhibiting this behavior. We are running the following
versions (from Docker):
* ceph/daemon:v3.2.1-stable-3
We decided to go ahead and try truncating the journal, but before we did, we
would try to back it up. However, there are ridiculous values in the header. It
can't write a journal this large because (I presume) my ext4 filesystem can't
seek to this position in the (sparse) file.
I would not be
k 0 MDS as failed
* Reset the FS (yes, I really mean it)
* Restart MDSes
* Finally get some sleep
If anybody has any idea what may have caused this situation, I am keenly
interested. If not, hopefully I at least helped someone else.
____
From: Pickett, Ne
ith versions, we'll update all
those to 12.2.10 today :)
From: Yan, Zheng
Sent: Tuesday, April 2, 2019 20:26
To: Sergey Malinin
Cc: Pickett, Neale T; ceph-users
Subject: Re: [ceph-users] MDS allocates all memory (>500G) replaying,
OOM-killed, repeat
Hello, ceph-users.
Our mds servers keep segfaulting from a failed assertion, and for the first
time I can't find anyone else who's posted about this problem. None of them are
able to stay up, so our cephfs is down.
We recently had to truncate the journal log after an upgrade to nautilus, and
I have created an anonymized crash log at
https://pastebin.ubuntu.com/p/YsVXQQTBCM/ in the hopes that it can help someone
understand what's leading to our MDS outage.
Thanks in advance for any assistance.
From: Pickett, Neale T
Sent: Thursday, Octob
Last week I asked about a rogue inode that was causing ceph-mds to segfault
during replay. We didn't get any suggestions from this list, so we have been
familiarizing ourselves with the ceph source code, and have added the following
patch:
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -7
like an inode problem to me, but I have completely run out of
ideas, so I will do nothing more to ceph as I anxoiusly hope I am not fired for
this 14-days-and-counting outage while awaiting a reply from the list.
Thank you very much!
Neale
From: Patrick Donnelly
) one. And somehow
it can handle hard links, possibly (we don't have many, or any, of these).
Thanks very much for your help. This has been fascinating.
Neale
From: Patrick Donnelly
Sent: Monday, October 28, 2019 12:52
To: Pickett, Neale T
Cc: ceph-