[ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

Jonathan Woytek Wed, 15 Aug 2018 08:45:02 -0700

Hi list people. I was asking a few of these questions in IRC, too, but
figured maybe a wider audience could see something that I'm missing.


I'm running a four-node cluster with cephfs and the kernel-mode driver as
the primary access method. Each node has 72 * 10TB OSDs, for a total of 288
OSDs. Each system has about 256GB of memory. These systems are dedicated to
the ceph service and run no other workloads. The cluster is configured with
every machine participating as MON, MGR, and MDS. Data is stored in replica
2 mode. The data are files of varying lengths, but most <5MB. Files are
named with their SHA256 hash, and are divided into subdirectories based on
the first few octets (example: files/a1/a13f/a13f25....). The current set
of files occupies about 100TB (200TB accounting for replication).

Early this week, we started seeing some network issues that were causing
OSDs to become unavailable for short periods of time. It was long enough to
get logged by syslog, but not long enough to trigger a persistent warning
or error state in ceph status. Conditions continued to degrade until we
encountered two of the four nodes falling off of the network, and OSDs
tried to start migrating en masse. After the network stabilized a short
while later, the OSDs were all shown as online and OK, and ceph seemed to
recover cleanly, and stopped trying to migrate data. In the process of
trying to get the network stable, though, the two nodes that had fallen off
the network had to be rebooted.

When all four nodes were back online and talking to each other, I noticed
that MDS was in "up: rejoin", and after a period of time, it would eat all
of the available memory and swap on whatever system was primary. It would
eventually either get killed-off by the system due to memory usage, or it
got so slow that the monitors would drop it and pick another MDS as
primary. This cycle would repeat.

I added more swap to one system (160GB of swap total), and brought down the
MDS service on the other three nodes, forcing the rejoin operations to
occur on the node with added swap. I also turned up debugging to see what
it was actually doing. This was then allowed to run for about 14 hours
overnight. When I arrived this morning, the system was still up, but
severly lagged. Nearly all swap had been used, and the system had
difficulty responding to commands. Out of options, I killed the process,
and then watched as it tried to shut down cleanly. I was hoping to preserve
as much of the work that it did as possible. I restarted it, and it seemed
to do more in replay, and then reentered the rejoin, which is still running
and giving no hints of finishing anytime soon.

The rejoin traffic I'm seeing in the MDS log looks like this:

2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.ino(0x100000108aa)
verify_diri_backtrace
2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x100000108aa)
_fetched header 274 bytes 2323 keys for [dir 0x100000108aa
/files-by-sha256/1c/1cc4/ [2,head] auth v=0 cv=0/0 ap=1+0+0
state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1
0x561738166a00]
2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x100000108aa)
_fetched version 59738838
2018-08-15 11:39:21.726 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1
0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:21.727 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1
0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:21.898 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon
up:rejoin seq 377 rtt 1.400594
2018-08-15 11:39:24.564 7f9c752a5700 10 mds.beacon.ta-g17 _send up:rejoin
seq 378
2018-08-15 11:39:25.503 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon
up:rejoin seq 378 rtt 0.907796
2018-08-15 11:39:26.565 7f9c7229f700 10 mds.0.cache.dir(0x100000108aa)
auth_unpin by 0x561738166a00 on [dir 0x100000108aa
/files-by-sha256/1c/1cc4/ [2,head] auth v=59738838 cv=59738838/59738838
state=1073741825|complete f(v0 m2018-08-14 07:52:06.764154 2323=2323+0)
n(v0 rc2018-08-14 07:52:06.764154 b3161079403 2323=2323+0) hs=2323+0,ss=0+0
| child=1 waiter=1 authpin=0 0x561738166a00] count now 0 + 0
2018-08-15 11:39:26.706 7f9c73aa2700  7 mds.0.13676 mds has 1 queued
contexts
2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676 0x5617cd27a790
2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676  finish 0x5617cd27a790
2018-08-15 11:39:26.723 7f9c7229f700 10 MDSIOContextBase::complete:
21C_IO_Dir_OMAP_Fetched
2018-08-15 11:39:26.723 7f9c7229f700 10 mds.0.cache.ino(0x100000020f7)
verify_diri_backtrace
2018-08-15 11:39:26.738 7f9c7229f700 10 mds.0.cache.dir(0x100000020f7)
_fetched header 274 bytes 1899 keys for [dir 0x100000020f7
/files-by-sha256/a7/a723/ [2,head] auth v=0 cv=0/0 ap=1+0+0
state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1
0x5617351cbc00]
2018-08-15 11:39:26.792 7f9c7229f700 10 mds.0.cache.dir(0x100000020f7)
_fetched version 59752211
2018-08-15 11:39:26.792 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1
0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:26.811 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1
0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:27.908 7f9c7229f700 10 mds.0.cache.dir(0x100000020f7)
auth_unpin by 0x5617351cbc00 on [dir 0x100000020f7
/files-by-sha256/a7/a723/ [2,head] auth v=59752211 cv=59752211/59752211
state=1073741825|complete f(v0 m2018-08-14 08:14:21.893249 1899=1899+0)
n(v0 rc2018-08-14 08:14:21.893249 b2658734443 1899=1899+0) hs=1899+0,ss=0+0
| child=1 waiter=1 authpin=0 0x5617351cbc00] count now 0 + 0
2018-08-15 11:39:27.962 7f9c7229f700 10 MDSIOContextBase::complete:
21C_IO_Dir_OMAP_Fetched


I am to the point here where I'd prefer to get this filesystem up sooner
than later. There was likely some data in-transit to the filesystem when
the outage occurred (possibly as many as a few thousand files being
created), but I'm willing to lose that data and let it get re-created by
our processes when we detect it missing.

Is there anything I can do to make this more efficient or help to get the
process completed so MDS goes online?

jonathan
--
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

Reply via email to