Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

David Turner Sun, 19 Aug 2018 06:30:02 -0700

I second that you do not have nearly enough RAM in these servers and I
don't you have at least 72 CPU cores either which means you again don't
have the minimum recommendation for the amount of OSDs you have, let alone
everything else.  I would suggest you start by moving your MDS daemons off
of these nodes as they'll be there most hungry and problematic of the
remaining services.  It would also probably make sense to just move the
mon, and mgr daemons to the new host as well.


On Sun, Aug 19, 2018, 8:01 AM Christian Wuerdig <christian.wuer...@gmail.com>
wrote:

> It should be added though that you're running at only 1/3 of the
> recommended RAM usage for the OSD setup alone - not to mention that
> you also co-host MON, MGR and MDS deamons on there. The next time you
> run into an issue - in particular with OSD recovery - you may be in a
> pickle again and then it might not be so easy to get going.
> On Fri, 17 Aug 2018 at 02:48, Jonathan Woytek <woy...@dryrose.com> wrote:
> >
> > On Thu, Aug 16, 2018 at 10:15 AM, Gregory Farnum <gfar...@redhat.com>
> wrote:
> > > Do note that while this works and is unlikely to break anything, it's
> > > not entirely ideal. The MDS was trying to probe the size and mtime of
> > > any files which were opened by clients that have since disappeared. By
> > > removing that list of open files, it can't do that any more, so you
> > > may have some inaccurate metadata about individual file sizes or
> > > mtimes.
> >
> > Understood, and thank you for the additional details. However, when
> > the difference is having a working filesystem, or having a filesystem
> > permanently down because the ceph-mds rejoin is impossible to
> > complete, I'll accept the risk involved. I'd prefer to see the rejoin
> > process able to proceed without chewing up memory until the machine
> > deadlocks on itself, but I don't yet know enough about the internals
> > of the rejoin process to even attempt to comment on how that could be
> > done. Ideally, it seems like flushing the current recovery/rejoin
> > status periodically and monitoring memory usage during recovery would
> > help to fix the problem. From what I could see, ceph-mds just
> > continued to allocate memory as it processed every open handle, and
> > never released any of it until it was killed.
> >
> > jonathan
> > --
> > Jonathan Woytek
> > http://www.dryrose.com
> > KB3HOZ
> > PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

Reply via email to