although that said, I’ve just noticed this crash this morning 2018-10-31 14:26:00.522 7f0cf53f5700 -1 /build/ceph-13.2.1/src/mds/CDir.cc: In function 'void CDir::fetch(MDSInternalContextBase*, std::string_view, bool)' thread 7f0cf53f5700 time 2018-10-31 14:26:00.485647 /build/ceph-13.2.1/src/mds/CDir.cc: 1504: FAILED assert(is_auth())
shortly after I set max_mds back to 3 > On 30 Oct 2018, at 18:50, Jon Morby <j...@fido.net> wrote: > > So a big thank you to @yanzheng for his help getting this back online > > The quick answer to what we did was downgrade to 13.2.1 as 13.2.2 is broken > for cephfs > > restored the backup of the journal I’d taken as part of following the > disaster recovery process documents > > turned off mds standby replay and temporarily stopped all but 2 of the mds so > we could monitor the logs more easily > > we then did a wipe sessions and watched the mds repair > > Set mds_wipe_sessions to 1 and restart mds > > finally there was a > > $ ceph daemon mds01 scrub_path / repair force recursive > > and then setting mds_wipe_sessions back to 0 > > Jon > > > I can’t say a big enough thank you to @yanzheng for their assistance though! > > >> On 29 Oct 2018, at 11:13, Jon Morby (Fido) <j...@fido.net >> <mailto:j...@fido.net>> wrote: >> >> I've experimented and whilst the downgrade looks to be working, you end up >> with errors regarding unsupported feature "mimic" amongst others >> >> 2018-10-29 10:51:20.652047 7f6f1b9f5080 -1 ERROR: on disk data includes >> unsupported features: compat={},rocompat={},incompat={10=mimic ondisk layou >> >> so I gave up on that idea >> >> In addition to the cephfs volume (which is basically just mirrors and some >> backups) we have a large rbd deployment using the same ceph cluster, and if >> we lose that we're screwed ... the cephfs volume was more an "experiment" to >> see how viable it would be as an NFS replacement >> >> There's 26TB of data on there, so I'd rather not have to go off and >> redownload it all .. but losing it isn't the end of the world (but it will >> piss off a few friends) >> >> Jon >> >> >> ----- On 29 Oct, 2018, at 09:54, Zheng Yan <uker...@gmail.com >> <mailto:uker...@gmail.com>> wrote: >> >> >> On Mon, Oct 29, 2018 at 5:25 PM Jon Morby (Fido) <j...@fido.net >> <mailto:j...@fido.net>> wrote: >> Hi >> >> Ideally we'd like to undo the whole accidental upgrade to 13.x and ensure >> that ceph-deploy doesn't do another major release upgrade without a lot of >> warnings >> >> Either way, I'm currently getting errors that 13.2.1 isn't available / >> shaman is offline / etc >> >> What's the best / recommended way of doing this downgrade across our estate? >> >> >> You have already upgraded ceph-mon. I don't know If it can be safely >> downgraded (If I remember right, I corrupted monitor's data when downgrading >> ceph-mon from minic to luminous). >> >> >> >> ----- On 29 Oct, 2018, at 08:19, Yan, Zheng <uker...@gmail.com >> <mailto:uker...@gmail.com>> wrote: >> >> We backported a wrong patch to 13.2.2. downgrade ceph to 13.2.1, then run >> 'ceph mds repaired fido_fs:1" . >> Sorry for the trouble >> Yan, Zheng >> >> On Mon, Oct 29, 2018 at 7:48 AM Jon Morby <j...@fido.net >> <mailto:j...@fido.net>> wrote: >> >> We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a >> ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9 and >> not jump a major release without warning) >> >> Anyway .. as a result, we ended up with an mds journal error and 1 daemon >> reporting as damaged >> >> Having got nowhere trying to ask for help on irc, we've followed various >> forum posts and disaster recovery guides, we ended up resetting the journal >> which left the daemon as no longer “damaged” however we’re now seeing mds >> segfault whilst trying to replay >> >> https://pastebin.com/iSLdvu0b <https://pastebin.com/iSLdvu0b> >> >> >> >> /build/ceph-13.2.2/src/mds/journal.cc <http://journal.cc/>: 1572: FAILED >> assert(g_conf->mds_wipe_sessions) >> >> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic >> (stable) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x102) [0x7fad637f70f2] >> 2: (()+0x3162b7) [0x7fad637f72b7] >> 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) >> [0x7a7a6b] >> 4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9] >> 5: (MDLog::_replay_thread()+0x864) [0x752164] >> 6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d] >> 7: (()+0x76ba) [0x7fad6305a6ba] >> 8: (clone()+0x6d) [0x7fad6288341d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> >> full logs >> >> https://pastebin.com/X5UG9vT2 <https://pastebin.com/X5UG9vT2> >> >> We’ve been unable to access the cephfs file system since all of this started >> …. attempts to mount fail with reports that “mds probably not available” >> >> Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds server >> is up >> >> >> root@mds02:~# ceph -s >> cluster: >> id: 78d5bf7d-b074-47ab-8d73-bd4d99df98a5 >> health: HEALTH_WARN >> 1 filesystem is degraded >> insufficient standby MDS daemons available >> too many PGs per OSD (276 > max 250) >> >> services: >> mon: 3 daemons, quorum mon01,mon02,mon03 >> mgr: mon01(active), standbys: mon02, mon03 >> mds: fido_fs-2/2/1 up {0=mds01=up:resolve,1=mds02=up:replay(laggy or >> crashed)} >> osd: 27 osds: 27 up, 27 in >> >> data: >> pools: 15 pools, 3168 pgs >> objects: 16.97 M objects, 30 TiB >> usage: 71 TiB used, 27 TiB / 98 TiB avail >> pgs: 3168 active+clean >> >> io: >> client: 680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wr >> >> >> Before I just trash the entire fs and give up on ceph, does anyone have any >> suggestions as to how we can fix this? >> >> root@mds02:~# ceph versions >> { >> "mon": { >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 3 >> }, >> "mgr": { >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 3 >> }, >> "osd": { >> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) >> luminous (stable)": 27 >> }, >> "mds": { >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 2 >> }, >> "overall": { >> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) >> luminous (stable)": 27, >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 8 >> } >> } >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> -- >> Jon Morby >> FidoNet - the internet made simple! >> 10 - 16 Tiller Road, London, E14 8PX >> tel: 0345 004 3050 / fax: 0345 004 3051 >> >> Need more rack space? >> Check out our Co-Lo offerings at http://www.fido.net/services/colo/ >> <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton >> Linx ConneXions available at all Fido sites! >> https://www.fido.net/services/backbone/connexions/ >> <https://www.fido.net/services/backbone/connexions/> >> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA >> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc >> <http://jonmorby.com/B3B5AD3A.asc> >> >> -- >> Jon Morby >> FidoNet - the internet made simple! >> 10 - 16 Tiller Road, London, E14 8PX >> tel: 0345 004 3050 / fax: 0345 004 3051 >> >> Need more rack space? >> Check out our Co-Lo offerings at http://www.fido.net/services/colo/ >> <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton >> Linx ConneXions available at all Fido sites! >> https://www.fido.net/services/backbone/connexions/ >> <https://www.fido.net/services/backbone/connexions/> >> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA >> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc >> <http://jonmorby.com/B3B5AD3A.asc>_______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > Jon Morby > where those in the know go > Tel: 0345 004 3050 > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Jon Morby where those in the know go Tel: 0345 004 3050
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com