please try again debug_mds=10 and send log to me Regards Yan, Zheng
On Mon, Oct 29, 2018 at 6:30 PM Jon Morby (Fido) <j...@fido.net> wrote: > fyi, downgrading to 13.2.1 doesn't seem to have fixed the issue either :( > > --- end dump of recent events --- > 2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal (Aborted) ** > in thread 7feb58b43700 thread_name:md_log_replay > > ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic > (stable) > 1: (()+0x3ebf40) [0x55deff8e0f40] > 2: (()+0x11390) [0x7feb68246390] > 3: (gsignal()+0x38) [0x7feb67993428] > 4: (abort()+0x16a) [0x7feb6799502a] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x250) [0x7feb689a5630] > 6: (()+0x2e26a7) [0x7feb689a56a7] > 7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) > [0x55deff8ccc8b] > 8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9] > 9: (MDLog::_replay_thread()+0x864) [0x55deff876974] > 10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d] > 11: (()+0x76ba) [0x7feb6823c6ba] > 12: (clone()+0x6d) [0x7feb67a6541d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- begin dump of recent events --- > 0> 2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal > (Aborted) ** > in thread 7feb58b43700 thread_name:md_log_replay > > ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic > (stable) > 1: (()+0x3ebf40) [0x55deff8e0f40] > 2: (()+0x11390) [0x7feb68246390] > 3: (gsignal()+0x38) [0x7feb67993428] > 4: (abort()+0x16a) [0x7feb6799502a] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x250) [0x7feb689a5630] > 6: (()+0x2e26a7) [0x7feb689a56a7] > 7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) > [0x55deff8ccc8b] > 8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9] > 9: (MDLog::_replay_thread()+0x864) [0x55deff876974] > 10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d] > 11: (()+0x76ba) [0x7feb6823c6ba] > 12: (clone()+0x6d) [0x7feb67a6541d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 0 lockdep > 0/ 0 context > 0/ 0 crush > 3/ 3 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 0 buffer > 0/ 0 timer > 0/ 0 filer > 0/ 1 striper > 0/ 0 objecter > 0/ 0 rados > 0/ 0 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 0 journaler > 0/ 5 objectcacher > 0/ 0 client > 0/ 0 osd > 0/ 0 optracker > 0/ 0 objclass > 0/ 0 filestore > 0/ 0 journal > 0/ 0 ms > 0/ 0 mon > 0/ 0 monc > 0/ 0 paxos > 0/ 0 tp > 0/ 0 auth > 1/ 5 crypto > 0/ 0 finisher > 1/ 1 reserver > 0/ 0 heartbeatmap > 0/ 0 perfcounter > 0/ 0 rgw > 1/ 5 rgw_sync > 1/10 civetweb > 1/ 5 javaclient > 0/ 0 asok > 0/ 0 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 4/ 5 memdb > 1/ 5 kinetic > 1/ 5 fuse > 1/ 5 mgr > 1/ 5 mgrc > 1/ 5 dpdk > 1/ 5 eventtrace > 99/99 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-mds.mds04.log > --- end dump of recent events --- > > > ----- On 29 Oct, 2018, at 09:25, Jon Morby <j...@fido.net> wrote: > > Hi > > Ideally we'd like to undo the whole accidental upgrade to 13.x and ensure > that ceph-deploy doesn't do another major release upgrade without a lot of > warnings > > Either way, I'm currently getting errors that 13.2.1 isn't available / > shaman is offline / etc > > What's the best / recommended way of doing this downgrade across our > estate? > > > > ----- On 29 Oct, 2018, at 08:19, Yan, Zheng <uker...@gmail.com> wrote: > > > We backported a wrong patch to 13.2.2. downgrade ceph to 13.2.1, then run > 'ceph mds repaired fido_fs:1" . > Sorry for the trouble > Yan, Zheng > > On Mon, Oct 29, 2018 at 7:48 AM Jon Morby <j...@fido.net> wrote: > >> >> We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a >> ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9 >> and not jump a major release without warning) >> >> Anyway .. as a result, we ended up with an mds journal error and 1 daemon >> reporting as damaged >> >> Having got nowhere trying to ask for help on irc, we've followed various >> forum posts and disaster recovery guides, we ended up resetting the journal >> which left the daemon as no longer “damaged” however we’re now seeing mds >> segfault whilst trying to replay >> >> https://pastebin.com/iSLdvu0b >> >> >> >> /build/ceph-13.2.2/src/mds/journal.cc: 1572: FAILED >> assert(g_conf->mds_wipe_sessions) >> >> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic >> (stable) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x102) [0x7fad637f70f2] >> 2: (()+0x3162b7) [0x7fad637f72b7] >> 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) >> [0x7a7a6b] >> 4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9] >> 5: (MDLog::_replay_thread()+0x864) [0x752164] >> 6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d] >> 7: (()+0x76ba) [0x7fad6305a6ba] >> 8: (clone()+0x6d) [0x7fad6288341d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >> to interpret this. >> >> >> full logs >> >> https://pastebin.com/X5UG9vT2 >> >> >> We’ve been unable to access the cephfs file system since all of this >> started …. attempts to mount fail with reports that “mds probably not >> available” >> >> Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds >> server is up >> >> >> root@mds02:~# ceph -s >> cluster: >> id: 78d5bf7d-b074-47ab-8d73-bd4d99df98a5 >> health: HEALTH_WARN >> 1 filesystem is degraded >> insufficient standby MDS daemons available >> too many PGs per OSD (276 > max 250) >> >> services: >> mon: 3 daemons, quorum mon01,mon02,mon03 >> mgr: mon01(active), standbys: mon02, mon03 >> mds: fido_fs-2/2/1 up {0=mds01=up:resolve,1=mds02=up:replay(laggy or >> crashed)} >> osd: 27 osds: 27 up, 27 in >> >> data: >> pools: 15 pools, 3168 pgs >> objects: 16.97 M objects, 30 TiB >> usage: 71 TiB used, 27 TiB / 98 TiB avail >> pgs: 3168 active+clean >> >> io: >> client: 680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wr >> >> >> Before I just trash the entire fs and give up on ceph, does anyone have >> any suggestions as to how we can fix this? >> >> root@mds02:~# ceph versions >> { >> "mon": { >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 3 >> }, >> "mgr": { >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 3 >> }, >> "osd": { >> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) >> luminous (stable)": 27 >> }, >> "mds": { >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 2 >> }, >> "overall": { >> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) >> luminous (stable)": 27, >> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) >> mimic (stable)": 8 >> } >> } >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > ------------------------------ > Jon Morby > FidoNet - the internet made simple! > 10 - 16 Tiller Road, London, E14 8PX > tel: 0345 004 3050 / fax: 0345 004 3051 > > Need more rack space? > Check out our Co-Lo offerings at http://www.fido.net/services/colo/ > <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton > Linx ConneXions available at all Fido sites! > https://www.fido.net/services/backbone/connexions/ > PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA > 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > ------------------------------ > Jon Morby > FidoNet - the internet made simple! > 10 - 16 Tiller Road, London, E14 8PX > tel: 0345 004 3050 / fax: 0345 004 3051 > > Need more rack space? > Check out our Co-Lo offerings at http://www.fido.net/services/colo/ > <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton > Linx ConneXions available at all Fido sites! > https://www.fido.net/services/backbone/connexions/ > PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA > 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com