Re: [ceph-users] ceph-mds failure replaying journal

Jon Morby (Fido) Mon, 29 Oct 2018 03:30:36 -0700

fyi, downgrading to 13.2.1 doesn't seem to have fixed the issue either :( 

--- end dump of recent events --- 
2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal (Aborted) ** 
in thread 7feb58b43700 thread_name:md_log_replay


ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable) 
1: (()+0x3ebf40) [0x55deff8e0f40] 
2: (()+0x11390) [0x7feb68246390] 
3: (gsignal()+0x38) [0x7feb67993428] 
4: (abort()+0x16a) [0x7feb6799502a] 
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) 
[0x7feb689a5630] 
6: (()+0x2e26a7) [0x7feb689a56a7] 
7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) 
[0x55deff8ccc8b] 
8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9] 
9: (MDLog::_replay_thread()+0x864) [0x55deff876974] 
10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d] 
11: (()+0x76ba) [0x7feb6823c6ba] 
12: (clone()+0x6d) [0x7feb67a6541d] 
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this. 

--- begin dump of recent events --- 
0> 2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal (Aborted) ** 
in thread 7feb58b43700 thread_name:md_log_replay 

ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable) 
1: (()+0x3ebf40) [0x55deff8e0f40] 
2: (()+0x11390) [0x7feb68246390] 
3: (gsignal()+0x38) [0x7feb67993428] 
4: (abort()+0x16a) [0x7feb6799502a] 
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) 
[0x7feb689a5630] 
6: (()+0x2e26a7) [0x7feb689a56a7] 
7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) 
[0x55deff8ccc8b] 
8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9] 
9: (MDLog::_replay_thread()+0x864) [0x55deff876974] 
10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d] 
11: (()+0x76ba) [0x7feb6823c6ba] 
12: (clone()+0x6d) [0x7feb67a6541d] 
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this. 

--- logging levels --- 
0/ 5 none 
0/ 0 lockdep 
0/ 0 context 
0/ 0 crush 
3/ 3 mds 
1/ 5 mds_balancer 
1/ 5 mds_locker 
1/ 5 mds_log 
1/ 5 mds_log_expire 
1/ 5 mds_migrator 
0/ 0 buffer 
0/ 0 timer 
0/ 0 filer 
0/ 1 striper 
0/ 0 objecter 
0/ 0 rados 
0/ 0 rbd 
0/ 5 rbd_mirror 
0/ 5 rbd_replay 
0/ 0 journaler 
0/ 5 objectcacher 
0/ 0 client 
0/ 0 osd 
0/ 0 optracker 
0/ 0 objclass 
0/ 0 filestore 
0/ 0 journal 
0/ 0 ms 
0/ 0 mon 
0/ 0 monc 
0/ 0 paxos 
0/ 0 tp 
0/ 0 auth 
1/ 5 crypto 
0/ 0 finisher 
1/ 1 reserver 
0/ 0 heartbeatmap 
0/ 0 perfcounter 
0/ 0 rgw 
1/ 5 rgw_sync 
1/10 civetweb 
1/ 5 javaclient 
0/ 0 asok 
0/ 0 throttle 
0/ 0 refs 
1/ 5 xio 
1/ 5 compressor 
1/ 5 bluestore 
1/ 5 bluefs 
1/ 3 bdev 
1/ 5 kstore 
4/ 5 rocksdb 
4/ 5 leveldb 
4/ 5 memdb 
1/ 5 kinetic 
1/ 5 fuse 
1/ 5 mgr 
1/ 5 mgrc 
1/ 5 dpdk 
1/ 5 eventtrace 
99/99 (syslog threshold) 
-1/-1 (stderr threshold) 
max_recent 10000 
max_new 1000 
log_file /var/log/ceph/ceph-mds.mds04.log 
--- end dump of recent events --- 

----- On 29 Oct, 2018, at 09:25, Jon Morby <j...@fido.net> wrote: 

> Hi

> Ideally we'd like to undo the whole accidental upgrade to 13.x and ensure that
> ceph-deploy doesn't do another major release upgrade without a lot of warnings

> Either way, I'm currently getting errors that 13.2.1 isn't available / shaman 
> is
> offline / etc

> What's the best / recommended way of doing this downgrade across our estate?

> ----- On 29 Oct, 2018, at 08:19, Yan, Zheng <uker...@gmail.com> wrote:

>> We backported a wrong patch to 13.2.2. downgrade ceph to 13.2.1, then run 
>> 'ceph
>> mds repaired fido_fs:1" .
>> Sorry for the trouble
>> Yan, Zheng

>> On Mon, Oct 29, 2018 at 7:48 AM Jon Morby < [ mailto:j...@fido.net | 
>> j...@fido.net
>> ] > wrote:

>>> We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a
>>> ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9 and
>>> not jump a major release without warning)

>>> Anyway .. as a result, we ended up with an mds journal error and 1 daemon
>>> reporting as damaged

>>> Having got nowhere trying to ask for help on irc, we've followed various 
>>> forum
>>> posts and disaster recovery guides, we ended up resetting the journal which
>>> left the daemon as no longer “damaged” however we’re now seeing mds segfault
>>> whilst trying to replay

>>> [ https://pastebin.com/iSLdvu0b | https://pastebin.com/iSLdvu0b ]

>>> /build/ceph-13.2.2/src/mds/ [ http://journal.cc/ | journal.cc ] : 1572: 
>>> FAILED
>>> assert(g_conf->mds_wipe_sessions)

>>> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
>>> (stable)
>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>> const*)+0x102)
>>> [0x7fad637f70f2]
>>> 2: (()+0x3162b7) [0x7fad637f72b7]
>>> 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) 
>>> [0x7a7a6b]
>>> 4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9]
>>> 5: (MDLog::_replay_thread()+0x864) [0x752164]
>>> 6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d]
>>> 7: (()+0x76ba) [0x7fad6305a6ba]
>>> 8: (clone()+0x6d) [0x7fad6288341d]
>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>>> interpret this.

>>> full logs

>>> [ https://pastebin.com/X5UG9vT2 | https://pastebin.com/X5UG9vT2 ]

>>> We’ve been unable to access the cephfs file system since all of this 
>>> started ….
>>> attempts to mount fail with reports that “mds probably not available”

>>> Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds 
>>> server is
>>> up

>>> root@mds02:~# ceph -s
>>> cluster:
>>> id: 78d5bf7d-b074-47ab-8d73-bd4d99df98a5
>>> health: HEALTH_WARN
>>> 1 filesystem is degraded
>>> insufficient standby MDS daemons available
>>> too many PGs per OSD (276 > max 250)

>>> services:
>>> mon: 3 daemons, quorum mon01,mon02,mon03
>>> mgr: mon01(active), standbys: mon02, mon03
>>> mds: fido_fs-2/2/1 up {0=mds01=up:resolve,1=mds02=up:replay(laggy or 
>>> crashed)}
>>> osd: 27 osds: 27 up, 27 in

>>> data:
>>> pools: 15 pools, 3168 pgs
>>> objects: 16.97 M objects, 30 TiB
>>> usage: 71 TiB used, 27 TiB / 98 TiB avail
>>> pgs: 3168 active+clean

>>> io:
>>> client: 680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wr

>>> Before I just trash the entire fs and give up on ceph, does anyone have any
>>> suggestions as to how we can fix this?

>>> root@mds02:~# ceph versions
>>> {
>>> "mon": {
>>> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
>>> (stable)":
>>> 3
>>> },
>>> "mgr": {
>>> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
>>> (stable)":
>>> 3
>>> },
>>> "osd": {
>>> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous
>>> (stable)": 27
>>> },
>>> "mds": {
>>> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
>>> (stable)":
>>> 2
>>> },
>>> "overall": {
>>> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous
>>> (stable)": 27,
>>> "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
>>> (stable)":
>>> 8
>>> }
>>> }

>>> _______________________________________________
>>> ceph-users mailing list
>>> [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ]
>>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

> --

> Jon Morby
> FidoNet - the internet made simple!
> 10 - 16 Tiller Road, London, E14 8PX
> tel: 0345 004 3050 / fax: 0345 004 3051

> Need more rack space?
> Check out our Co-Lo offerings at [ http://www.fido.net/services/colo/%20 |
> http://www.fido.net/services/colo/  ] 32 amp racks in London and Brighton
> Linx ConneXions available at all Fido sites! [
> https://www.fido.net/services/backbone/connexions/ |
> https://www.fido.net/services/backbone/connexions/ ]
> [ http://jonmorby.com/B3B5AD3A.asc | PGP Key ] : 26DC B618 DE9E F9CB F8B7 1EFA
> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc

> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Jon Morby 
FidoNet - the internet made simple! 
10 - 16 Tiller Road, London, E14 8PX 
tel: 0345 004 3050 / fax: 0345 004 3051 

Need more rack space? 
Check out our Co-Lo offerings at [ http://www.fido.net/services/colo/%20 | 
http://www.fido.net/services/colo/  ] 32 amp racks in London and Brighton 
Linx ConneXions available at all Fido sites! [ 
https://www.fido.net/services/backbone/connexions/ | 
https://www.fido.net/services/backbone/connexions/ ] 
[ http://jonmorby.com/B3B5AD3A.asc | PGP Key ] : 26DC B618 DE9E F9CB F8B7 1EFA 
2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-mds failure replaying journal

Reply via email to