Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-19 Thread Jonathan Woytek
On Sun, Aug 19, 2018 at 9:29 AM David Turner wrote: > I second that you do not have nearly enough RAM in these servers and I > don't you have at least 72 CPU cores either which means you again don't > have the minimum recommendation for the amount of OSDs you have, let alone > everything else. I

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-19 Thread David Turner
I second that you do not have nearly enough RAM in these servers and I don't you have at least 72 CPU cores either which means you again don't have the minimum recommendation for the amount of OSDs you have, let alone everything else. I would suggest you start by moving your MDS daemons off of the

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-19 Thread Christian Wuerdig
It should be added though that you're running at only 1/3 of the recommended RAM usage for the OSD setup alone - not to mention that you also co-host MON, MGR and MDS deamons on there. The next time you run into an issue - in particular with OSD recovery - you may be in a pickle again and then it m

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-16 Thread Jonathan Woytek
On Thu, Aug 16, 2018 at 10:15 AM, Gregory Farnum wrote: > Do note that while this works and is unlikely to break anything, it's > not entirely ideal. The MDS was trying to probe the size and mtime of > any files which were opened by clients that have since disappeared. By > removing that list of o

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-16 Thread Gregory Farnum
On Thu, Aug 16, 2018 at 8:58 AM, Jonathan Woytek wrote: > This did the trick! THANK YOU! > > After starting with the mds_wipe_sessions set and after removing the > mds*_openfiles.0 entries in the metadata pool, mds started almost > immediately and went to active. I verified that the filesystem cou

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-16 Thread Jonathan Woytek
This did the trick! THANK YOU! After starting with the mds_wipe_sessions set and after removing the mds*_openfiles.0 entries in the metadata pool, mds started almost immediately and went to active. I verified that the filesystem could mount again, shut down mds, removed the wipe sessions setting,

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
On Wed, Aug 15, 2018 at 11:02 PM Yan, Zheng wrote: > On Thu, Aug 16, 2018 at 10:55 AM Jonathan Woytek > wrote: > > > > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic > (stable) > > > > > > Try deleting mds0_openfiles.0 (mds1_openfiles.0 and so on if you have > multiple acti

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Thu, Aug 16, 2018 at 10:55 AM Jonathan Woytek wrote: > > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) > > Try deleting mds0_openfiles.0 (mds1_openfiles.0 and so on if you have multiple active mds) from metadata pool of your filesystem. Records in these files a

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) On Wed, Aug 15, 2018 at 10:51 PM, Yan, Zheng wrote: > On Thu, Aug 16, 2018 at 10:50 AM Jonathan Woytek wrote: >> >> Actually, I missed it--I do see the wipe start, wipe done in the log. >> However, it is still doing v

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Thu, Aug 16, 2018 at 10:50 AM Jonathan Woytek wrote: > > Actually, I missed it--I do see the wipe start, wipe done in the log. > However, it is still doing verify_diri_backtrace, as described > previously. > which version of mds do you use? > jonathan > > On Wed, Aug 15, 2018 at 10:42 PM, Jon

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
Actually, I missed it--I do see the wipe start, wipe done in the log. However, it is still doing verify_diri_backtrace, as described previously. jonathan On Wed, Aug 15, 2018 at 10:42 PM, Jonathan Woytek wrote: > On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng wrote: >> How many client reconnected

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng wrote: > How many client reconnected when mds restarts? The issue is likely > because reconnected clients held two many inodes, mds was opening > these inodes in rejoin state. Try starting mds with option > mds_wipe_sessions = true. The option makes m

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Wed, Aug 15, 2018 at 11:44 PM Jonathan Woytek wrote: > > Hi list people. I was asking a few of these questions in IRC, too, but > figured maybe a wider audience could see something that I'm missing. > > I'm running a four-node cluster with cephfs and the kernel-mode driver as the > primary ac

[ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
Hi list people. I was asking a few of these questions in IRC, too, but figured maybe a wider audience could see something that I'm missing. I'm running a four-node cluster with cephfs and the kernel-mode driver as the primary access method. Each node has 72 * 10TB OSDs, for a total of 288 OSDs. Ea