Hello, I've been using CephFS for quite a while now, and am very happy with it. However, I'm experiencing an issue that's quite hard to debug.
On almost every server where CephFS is mounted, the CephFS mount becomes unusable after updating Ceph (has happened 3 times now, after Ceph update). When attempting to access the CephFS mount, I'd get a permission denied: root@cephmgr:~# cd /mnt/cephfs -bash: cd: /mnt/cephfs: Permission denied [root@da0 ~]# ls /home ls: cannot access /home: Permission denied root@http-cap01:~# ls /var/www/vhosts ls: cannot access '/var/www/vhosts': Permission denied What strikes me as odd is that on some machines, the mount works fine, and I get above error on other machines while server configurations are identical. Here's a log of a client machine where the CephFS mount 'broke' after updating Ceph: -- Jun 12 00:56:47 http-cap01 kernel: [4163998.163498] ceph: mds0 caps stale Jun 12 01:00:20 http-cap01 kernel: [4164210.581767] ceph: mds0 caps went stale, renewing Jun 12 01:00:20 http-cap01 kernel: [4164210.581771] ceph: mds0 caps stale Jun 12 01:00:45 http-cap01 kernel: [4164236.434456] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:3]:6816 socket closed (con state OPEN) Jun 12 01:00:46 http-cap01 kernel: [4164236.980990] libceph: reset on mds0 Jun 12 01:00:46 http-cap01 kernel: [4164236.980996] ceph: mds0 closed our session Jun 12 01:00:46 http-cap01 kernel: [4164236.980997] ceph: mds0 reconnect start Jun 12 01:00:46 http-cap01 kernel: [4164237.035294] ceph: mds0 reconnect denied Jun 12 01:00:46 http-cap01 kernel: [4164237.036990] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:3]:6816 socket closed (con state NEGOTIATING) Jun 12 01:00:47 http-cap01 kernel: [4164237.972853] ceph: mds0 rejected session Jun 12 01:05:15 http-cap01 kernel: [4164506.065665] libceph: mon0 [fdb7:b01e:7b8e:0:10:10:10:1]:6789 session lost, hunting for new mon Jun 12 01:05:15 http-cap01 kernel: [4164506.068613] libceph: mon1 [fdb7:b01e:7b8e:0:10:10:10:2]:6789 session established Jun 12 01:06:47 http-cap01 kernel: [4164597.858261] libceph: mon1 [fdb7:b01e:7b8e:0:10:10:10:2]:6789 socket closed (con state OPEN) Jun 12 01:06:47 http-cap01 kernel: [4164597.858323] libceph: mon1 [fdb7:b01e:7b8e:0:10:10:10:2]:6789 session lost, hunting for new mon Jun 12 01:06:47 http-cap01 kernel: [4164597.864745] libceph: mon2 [fdb7:b01e:7b8e:0:10:10:10:3]:6789 session established Jun 12 01:23:02 http-cap01 kernel: [4165572.915743] ceph: mds0 reconnect start Jun 12 01:23:02 http-cap01 kernel: [4165572.918197] ceph: mds0 reconnect denied Jun 12 01:23:15 http-cap01 kernel: [4165586.526195] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:2]:6817 socket closed (con state NEGOTIATING) Jun 12 01:23:16 http-cap01 kernel: [4165586.992411] ceph: mds0 rejected session -- Note the "ceph: mds0 reconnect denied" Log on a machine where the CephFS mount kept working fine: -- Jun 12 01:08:26 http-hlp02 kernel: [3850613.358329] libceph: mon0 [fdb7:b01e:7b8e:0:10:10:10:1]:6789 socket closed (con state OPEN) Jun 12 01:08:26 http-hlp02 kernel: [3850613.358418] libceph: mon0 [fdb7:b01e:7b8e:0:10:10:10:1]:6789 session lost, hunting for new mon Jun 12 01:08:26 http-hlp02 kernel: [3850613.369597] libceph: mon1 [fdb7:b01e:7b8e:0:10:10:10:2]:6789 session established Jun 12 01:09:50 http-hlp02 kernel: [3850697.708357] libceph: osd9 [fdb7:b01e:7b8e:0:10:10:10:1]:6806 socket closed (con state OPEN) Jun 12 01:09:50 http-hlp02 kernel: [3850697.709897] libceph: osd0 down Jun 12 01:09:50 http-hlp02 kernel: [3850697.709898] libceph: osd1 down Jun 12 01:09:50 http-hlp02 kernel: [3850697.709899] libceph: osd6 down Jun 12 01:09:50 http-hlp02 kernel: [3850697.709899] libceph: osd9 down Jun 12 01:12:37 http-hlp02 kernel: [3850864.673357] libceph: osd9 up Jun 12 01:12:37 http-hlp02 kernel: [3850864.673378] libceph: osd6 up Jun 12 01:12:37 http-hlp02 kernel: [3850864.673394] libceph: osd0 up Jun 12 01:12:37 http-hlp02 kernel: [3850864.673402] libceph: osd1 up Jun 12 01:14:30 http-hlp02 kernel: [3850977.916749] libceph: wrong peer, want [fdb7:b01e:7b8e:0:10:10:10:2]:6808/434742, got [fdb7:b01e:7b8e:0:10:10:10:2]:6808/19887 Jun 12 01:14:30 http-hlp02 kernel: [3850977.916765] libceph: osd10 [fdb7:b01e:7b8e:0:10:10:10:2]:6808 wrong peer at address Jun 12 01:14:30 http-hlp02 kernel: [3850977.918368] libceph: osd4 down Jun 12 01:14:30 http-hlp02 kernel: [3850977.918369] libceph: osd5 down Jun 12 01:14:30 http-hlp02 kernel: [3850977.918370] libceph: osd7 down Jun 12 01:14:30 http-hlp02 kernel: [3850977.918370] libceph: osd10 down Jun 12 01:14:30 http-hlp02 kernel: [3850977.918401] libceph: osd10 up Jun 12 01:14:30 http-hlp02 kernel: [3850977.918406] libceph: osd7 up Jun 12 01:14:34 http-hlp02 kernel: [3850981.985720] libceph: osd5 up Jun 12 01:14:34 http-hlp02 kernel: [3850981.985742] libceph: osd4 up Jun 12 01:19:56 http-hlp02 kernel: [3851304.177469] libceph: osd2 down Jun 12 01:19:56 http-hlp02 kernel: [3851304.177471] libceph: osd3 down Jun 12 01:19:56 http-hlp02 kernel: [3851304.177471] libceph: osd8 down Jun 12 01:19:56 http-hlp02 kernel: [3851304.177472] libceph: osd11 down Jun 12 01:19:56 http-hlp02 kernel: [3851304.177498] libceph: osd8 up Jun 12 01:19:56 http-hlp02 kernel: [3851304.177504] libceph: osd11 up Jun 12 01:19:56 http-hlp02 kernel: [3851304.177513] libceph: osd2 up Jun 12 01:19:56 http-hlp02 kernel: [3851304.177520] libceph: osd3 up Jun 12 01:23:02 http-hlp02 kernel: [3851489.639561] ceph: mds0 reconnect start Jun 12 01:23:02 http-hlp02 kernel: [3851490.116179] ceph: mds0 reconnect success Jun 12 01:23:04 http-hlp02 kernel: [3851491.649324] ceph: mds0 recovery completed -- On the MDS with IP fdb7:b01e:7b8e:0:10:10:10:2, I see: -- 2019-06-12 01:23:02.373348 7f54e3dbc700 1 mds.0.server session is closed, ignoring reconnect, sending close 2019-06-12 01:23:02.373372 7f54e3dbc700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.69299376 [fc00:0:0:1::3 0]:0/1701301202 (session is closed) 2019-06-12 01:23:02.373506 7f54e3dbc700 1 mds.0.server session is closed, ignoring reconnect, sending close 2019-06-12 01:23:02.373524 7f54e3dbc700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.69209334 [fc00:0:0:1::2 8]:0/3335161502 (session is closed) -- The question is: why is one machine getting rejected ("ceph: mds0 reconnect denied") and is the other machine reconnecting just fine ("ceph: mds0 reconnect success")? Can anyone point me in the right direction? Software versions on all machines: ceph-fs-common/stable,stable,now 10.2.11-2 amd64 Kernel: 4.19.0-0.bpo.4-amd64 OS: Debian GNU/Linux 9.9 (stretch) -- William Edwards Tuxis Internet Engineering
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com