Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-19 Thread Jonathan Woytek
On Sun, Aug 19, 2018 at 9:29 AM David Turner wrote: > I second that you do not have nearly enough RAM in these servers and I > don't you have at least 72 CPU cores either which means you again don't > have the minimum recommendation for the amount of OSDs you have, let alone > everything else. I

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-19 Thread David Turner
I second that you do not have nearly enough RAM in these servers and I don't you have at least 72 CPU cores either which means you again don't have the minimum recommendation for the amount of OSDs you have, let alone everything else. I would suggest you start by moving your MDS daemons off of the

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-19 Thread Christian Wuerdig
It should be added though that you're running at only 1/3 of the recommended RAM usage for the OSD setup alone - not to mention that you also co-host MON, MGR and MDS deamons on there. The next time you run into an issue - in particular with OSD recovery - you may be in a pickle again and then it m

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-16 Thread Jonathan Woytek
On Thu, Aug 16, 2018 at 10:15 AM, Gregory Farnum wrote: > Do note that while this works and is unlikely to break anything, it's > not entirely ideal. The MDS was trying to probe the size and mtime of > any files which were opened by clients that have since disappeared. By > removing that list of o

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-16 Thread Gregory Farnum
On Thu, Aug 16, 2018 at 8:58 AM, Jonathan Woytek wrote: > This did the trick! THANK YOU! > > After starting with the mds_wipe_sessions set and after removing the > mds*_openfiles.0 entries in the metadata pool, mds started almost > immediately and went to active. I verified that the filesystem cou

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-16 Thread Jonathan Woytek
This did the trick! THANK YOU! After starting with the mds_wipe_sessions set and after removing the mds*_openfiles.0 entries in the metadata pool, mds started almost immediately and went to active. I verified that the filesystem could mount again, shut down mds, removed the wipe sessions setting,

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
On Wed, Aug 15, 2018 at 11:02 PM Yan, Zheng wrote: > On Thu, Aug 16, 2018 at 10:55 AM Jonathan Woytek > wrote: > > > > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic > (stable) > > > > > > Try deleting mds0_openfiles.0 (mds1_openfiles.0 and so on if you have > multiple acti

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Thu, Aug 16, 2018 at 10:55 AM Jonathan Woytek wrote: > > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) > > Try deleting mds0_openfiles.0 (mds1_openfiles.0 and so on if you have multiple active mds) from metadata pool of your filesystem. Records in these files a

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) On Wed, Aug 15, 2018 at 10:51 PM, Yan, Zheng wrote: > On Thu, Aug 16, 2018 at 10:50 AM Jonathan Woytek wrote: >> >> Actually, I missed it--I do see the wipe start, wipe done in the log. >> However, it is still doing v

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Thu, Aug 16, 2018 at 10:50 AM Jonathan Woytek wrote: > > Actually, I missed it--I do see the wipe start, wipe done in the log. > However, it is still doing verify_diri_backtrace, as described > previously. > which version of mds do you use? > jonathan > > On Wed, Aug 15, 2018 at 10:42 PM, Jon

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
Actually, I missed it--I do see the wipe start, wipe done in the log. However, it is still doing verify_diri_backtrace, as described previously. jonathan On Wed, Aug 15, 2018 at 10:42 PM, Jonathan Woytek wrote: > On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng wrote: >> How many client reconnected

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng wrote: > How many client reconnected when mds restarts? The issue is likely > because reconnected clients held two many inodes, mds was opening > these inodes in rejoin state. Try starting mds with option > mds_wipe_sessions = true. The option makes m

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Wed, Aug 15, 2018 at 11:44 PM Jonathan Woytek wrote: > > Hi list people. I was asking a few of these questions in IRC, too, but > figured maybe a wider audience could see something that I'm missing. > > I'm running a four-node cluster with cephfs and the kernel-mode driver as the > primary ac

[ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
Hi list people. I was asking a few of these questions in IRC, too, but figured maybe a wider audience could see something that I'm missing. I'm running a four-node cluster with cephfs and the kernel-mode driver as the primary access method. Each node has 72 * 10TB OSDs, for a total of 288 OSDs. Ea

Re: [ceph-users] mds stuck in rejoin

2013-09-16 Thread Gregory Farnum
Awesome, glad a simple upgrade fixed it for you. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Sep 16, 2013 at 6:18 AM, Serge Slipchenko wrote: > Hi, > > Digging the web I have found similar symptoms > http://tracker.ceph.com/issues/6087 > I have found that my cep

Re: [ceph-users] mds stuck in rejoin

2013-09-16 Thread Serge Slipchenko
Hi, Digging the web I have found similar symptoms http://tracker.ceph.com/issues/6087 I have found that my ceph-mds isn't updated and still is 0.67.2 that doesn't have MDS patch. After update to 0.67.3 MDS stabilized. I am terribly sorry, but I hope that my bad experience will help someone. On M

Re: [ceph-users] mds stuck in rejoin

2013-09-15 Thread Gregory Farnum
What's the output of "ceph -s", and have you tried running the MDS with any logging enabled that we can check out? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Sep 15, 2013 at 8:24 AM, Serge Slipchenko wrote: > Hi, > > I'm testing ceph 0.67.3 (408cd61584c72c0d97b774

[ceph-users] mds stuck in rejoin

2013-09-15 Thread Serge Slipchenko
Hi, I'm testing ceph 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a) under load. My configuration has 2 mds, 3 mon and 16 osd - mon and mds are on separate servers, osd distributed on 8 servers 3 servers with several processes read and write via libcephfs. Restart of active mds leads to infini