Hey Zheng, I've been in the #ceph irc channel all day about this.
We did that, we set max_mds back to 1, but, instead of stopping mds 1, we did a "ceph mds rmfailed 1". Running ceph mds stop 1 produces: # ceph mds stop 1 Error EEXIST: mds.1 not active (???) Our mds in a state of resolve, and will not come back. We then tried to roll back the mds map to the epoch just before we set max_mds to 2, but that command crashes all but one of our monitors and never completes We do not know what to do at this point, if there was a way to get the mds back up just so we could back it up, we're okay with rebuilding. We just need the data back. Mike C On Thu, Jan 14, 2016 at 3:33 PM, Yan, Zheng <uker...@gmail.com> wrote: > On Fri, Jan 15, 2016 at 3:28 AM, Mike Carlson <m...@bayphoto.com> wrote: > > Thank you for the reply Zheng > > > > We tried set mds bal frag to true, but the end result was less than > > desirable. All nfs and smb clients could no longer browse the share, they > > would hang on a directory with anything more than a few hundred files. > > > > We then tried to back out the active/active mds change, no luck, stopping > > one of the mds's (mds 1) prevented us from mounting the cephfs filesystem > > > > So we failed and removed the secondary MDS, and now our primary mds is > stuck > > in a "resovle" state: > > > > # ceph -s > > cluster cabd1728-2eca-4e18-a581-b4885364e5a4 > > health HEALTH_WARN > > clock skew detected on mon.lts-mon > > mds cluster is degraded > > Monitor clock skew detected > > monmap e1: 4 mons at > > {lts-mon= > 10.5.68.236:6789/0,lts-osd1=10.5.68.229:6789/0,lts-osd2=10.5.68.230:6789/0,lts-osd3=10.5.68.203:6789/0 > } > > election epoch 1282, quorum 0,1,2,3 > > lts-osd3,lts-osd1,lts-osd2,lts-mon > > mdsmap e7892: 1/2/1 up {0=lts-mon=up:resolve} > > osdmap e10183: 102 osds: 101 up, 101 in > > pgmap v6714309: 4192 pgs, 7 pools, 31748 GB data, 23494 kobjects > > 96188 GB used, 273 TB / 367 TB avail > > 4188 active+clean > > 4 active+clean+scrubbing+deep > > > > Now we are really down for the count. We cannot get our MDS back up in an > > active state and none of our data is accessible. > > you can't remove active mds this way, you need to: > > 1. make sure all active mds are running > 2. run 'ceph mds set max_mds 1' > 3. run 'ceph mds stop 1' > > step 3 changes the second mds's state to stopping. Wait a while, the > second mds will go to standby state. Occasionally, the second MDS can > stuck in stopping state. If it happens, restart all MDS, then repeat > step 3. > > Regards > Yan, Zheng > > > > > > > > > On Wed, Jan 13, 2016 at 7:05 PM, Yan, Zheng <uker...@gmail.com> wrote: > >> > >> On Thu, Jan 14, 2016 at 3:37 AM, Mike Carlson <m...@bayphoto.com> > wrote: > >> > Hey Greg, > >> > > >> > The inconsistent view is only over nfs/smb on top of our /ceph mount. > >> > > >> > When I look directly on the /ceph mount (which is using the cephfs > >> > kernel > >> > module), everything looks fine > >> > > >> > It is possible that this issue just went unnoticed, and it only being > a > >> > infernalis problem is just a red herring. With that, it is oddly > >> > coincidental that we just started seeing issues. > >> > >> This seems like seekdir bugs in kernel client, could you try 4.0+ > kernel. > >> > >> Besides, do you enable "mds bal frag" for ceph-mds > >> > >> > >> Regards > >> Yan, Zheng > >> > >> > >> > >> > > >> > On Wed, Jan 13, 2016 at 11:30 AM, Gregory Farnum <gfar...@redhat.com> > >> > wrote: > >> >> > >> >> On Wed, Jan 13, 2016 at 11:24 AM, Mike Carlson <m...@bayphoto.com> > >> >> wrote: > >> >> > Hello. > >> >> > > >> >> > Since we upgraded to Infernalis last, we have noticed a severe > >> >> > problem > >> >> > with > >> >> > cephfs when we have it shared over Samba and NFS > >> >> > > >> >> > Directory listings are showing an inconsistent view of the files: > >> >> > > >> >> > > >> >> > $ ls /lts-mon/BD/xmlExport/ | wc -l > >> >> > 100 > >> >> > $ sudo umount /lts-mon > >> >> > $ sudo mount /lts-mon > >> >> > $ ls /lts-mon/BD/xmlExport/ | wc -l > >> >> > 3507 > >> >> > > >> >> > > >> >> > The only work around I have found is un-mounting and re-mounting > the > >> >> > nfs > >> >> > share, that seems to clear it up > >> >> > Same with samba, I'd post it here but its thousands of lines. I can > >> >> > add > >> >> > additional details on request. > >> >> > > >> >> > This happened after our upgrade to infernalis. Is it possible the > MDS > >> >> > is > >> >> > in > >> >> > an inconsistent state? > >> >> > >> >> So this didn't happen to you until after you upgraded? Are you seeing > >> >> missing files when looking at cephfs directly, or only over the > >> >> NFS/Samba re-exports? Are you also sharing Samba by re-exporting the > >> >> kernel cephfs mount? > >> >> > >> >> Zheng, any ideas about kernel issues which might cause this or be > more > >> >> visible under infernalis? > >> >> -Greg > >> >> > >> >> > > >> >> > We have cephfs mounted on a server using the built in cephfs kernel > >> >> > module: > >> >> > > >> >> > lts-mon:6789:/ /ceph ceph > >> >> > name=admin,secretfile=/etc/ceph/admin.secret,noauto,_netdev > >> >> > > >> >> > > >> >> > We are running all of our ceph nodes on ubuntu 14.04 LTS. Samba is > up > >> >> > to > >> >> > date, 4.1.6, and we export nfsv3 to linux and freebsd systems. All > >> >> > seem > >> >> > to > >> >> > exhibit the same behavior. > >> >> > > >> >> > system info: > >> >> > > >> >> > # uname -a > >> >> > Linux lts-osd1 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 > 21:42:59 > >> >> > UTC > >> >> > 2015 x86_64 x86_64 x86_64 GNU/Linux > >> >> > root@lts-osd1:~# lsb > >> >> > lsblk lsb_release > >> >> > root@lts-osd1:~# lsb_release -a > >> >> > No LSB modules are available. > >> >> > Distributor ID: Ubuntu > >> >> > Description: Ubuntu 14.04.3 LTS > >> >> > Release: 14.04 > >> >> > Codename: trusty > >> >> > > >> >> > > >> >> > package info: > >> >> > > >> >> > # dpkg -l|grep ceph > >> >> > ii ceph 9.2.0-1trusty > >> >> > amd64 distributed storage and file system > >> >> > ii ceph-common 9.2.0-1trusty > >> >> > amd64 common utilities to mount and interact with a ceph > >> >> > storage > >> >> > cluster > >> >> > ii ceph-fs-common 9.2.0-1trusty > >> >> > amd64 common utilities to mount and interact with a ceph > file > >> >> > system > >> >> > ii ceph-mds 9.2.0-1trusty > >> >> > amd64 metadata server for the ceph distributed file system > >> >> > ii libcephfs1 9.2.0-1trusty > >> >> > amd64 Ceph distributed file system client library > >> >> > ii python-ceph 9.2.0-1trusty > >> >> > amd64 Meta-package for python libraries for the Ceph > libraries > >> >> > ii python-cephfs 9.2.0-1trusty > >> >> > amd64 Python libraries for the Ceph libcephfs library > >> >> > > >> >> > > >> >> > What is interesting, is a directory or file will not show up in a > >> >> > listing, > >> >> > however, if we directly access the file, it shows up in that > >> >> > instance: > >> >> > > >> >> > > >> >> > # ls -al |grep SCHOOL > >> >> > # ls -alnd SCHOOL667055 > >> >> > drwxrwsr-x 1 21695 21183 2962751438 Jan 13 09:33 SCHOOL667055 > >> >> > > >> >> > > >> >> > Any tips are appreciated! > >> >> > > >> >> > Thanks, > >> >> > Mike C > >> >> > > >> >> > > >> >> > _______________________________________________ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > > >> > > >> > > >> > > >> > _______________________________________________ > >> > ceph-users mailing list > >> > ceph-users@lists.ceph.com > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com