I had two MDS nodes. One was still active, but the other was stuck rejoining, which already caused the FS to hang (i.e. Ait was down, yes). Since at first I thought this was the old cache size bug, I deleted the open files objects and when that didn't seem to have an effect, I tried restarting the
Quoting Janek Bevendorff (janek.bevendo...@uni-weimar.de):
> Update: turns out I just had to wait for an hour. The MDSs were sending
> Beacons regularly, so the MONs didn't try to kill them and instead let
> them finish doing whatever they were doing.
>
> Unlike the other bug where the number of o
Update: turns out I just had to wait for an hour. The MDSs were sending
Beacons regularly, so the MONs didn't try to kill them and instead let
them finish doing whatever they were doing.
Unlike the other bug where the number of open files outgrows what the
MDS can handle, this incident allowed "se
Hi, my MDS failed again, but this time I cannot recover it by deleting
the mds*_openfiles .0 object. The startup behaviour is also different.
Both inode count and cache size stay at zero while the MDS is replaying.
When I set the MDS log level to 7, I get tons of these messages:
2020-01-06 11:59:
Have you already tried to adjust the "mds_cache_memory_limit" and or
"ceph tell mds.* cache drop"? I really wonder how the MDS copes with
that with milions of CAPS.
I played with the cache size, yeah. I kind of need a large cache,
otherwise everything is just slow and I'm constantly getting cac
Hi Janek,
Quoting Janek Bevendorff (janek.bevendo...@uni-weimar.de):
> Hey Patrick,
>
> I just wanted to give you some feedback about how 14.2.5 is working for me.
> I've had the chance to test it for a day now and overall, the experience is
> much better, although not perfect (perhaps far from i
Hey Patrick,
I just wanted to give you some feedback about how 14.2.5 is working for
me. I've had the chance to test it for a day now and overall, the
experience is much better, although not perfect (perhaps far from it).
I have two active MDS (I figured that'd spread the meta data load a
li
> You set mds_beacon_grace ?
Yes, as I said. It seemed to have no effect or at least none that I
could see. The kick timeout seemed random after all. I even set it to
something ridiculous like 1800 and the MDS were still timed out.
Sometimes they got to 20M inodes, sometimes only to a few 100k.
On Thu, Dec 5, 2019 at 10:31 AM Janek Bevendorff
wrote:
>
> I had similar issues again today. Some users were trying to train a
> neural network on several million files resulting in enormous cache
> sizes. Due to my custom cap recall and decay rate settings, the MDSs
> were able to withstand the
I had similar issues again today. Some users were trying to train a
neural network on several million files resulting in enormous cache
sizes. Due to my custom cap recall and decay rate settings, the MDSs
were able to withstand the load for quite some time, but at some point
the active rank crashed
The fix has been merged into master and will be backported soon.
Amazing, thanks!
I've
also done testing in a large cluster to confirm the issue you found.
Using multiple processes to create files as fast as possible in a
single client reliably reproduced the issue. The MDS cannot recall
c
Hi Janek,
On Tue, Aug 6, 2019 at 11:25 AM Janek Bevendorff
wrote:
> > Here are tracker tickets to resolve the issues you encountered:
> >
> > https://tracker.ceph.com/issues/41140
> > https://tracker.ceph.com/issues/41141
The fix has been merged into master and will be backported soon. I've
also
I've been copying happily for days now (not very fast, but the MDS were
stable), but eventually the MDSs started flapping again due to large
cache sizes (they are being killed after 11M inodes). I could solve the
problem by temporarily increasing the cache size in order to allow them
to rejoin,
13 matches
Mail list logo