Re: [ceph-users] ceph-mds failure replaying journal

Yan, Zheng Mon, 29 Oct 2018 04:30:30 -0700

cephfs is recoverable. Just set mds_wipe_sessions to 1. After mds recovers,
set it back to 0 and flush journal (ceph daemon mds.x flush journal)


On Mon, Oct 29, 2018 at 7:13 PM Jon Morby (Fido) <j...@fido.net> wrote:

> I've experimented and whilst the downgrade looks to be working, you end up
> with errors regarding unsupported feature "mimic" amongst others
>
> 2018-10-29 10:51:20.652047 7f6f1b9f5080 -1 ERROR: on disk data includes
> unsupported features: compat={},rocompat={},incompat={10=mimic ondisk layou
>
> so I gave up on that idea
>
> In addition to the cephfs volume (which is basically just mirrors and some
> backups) we have a large rbd deployment using the same ceph cluster, and if
> we lose that we're screwed ... the cephfs volume was more an "experiment"
> to see how viable it would be as an NFS replacement
>
> There's 26TB of data on there, so I'd rather not have to go off and
> redownload it all .. but losing it isn't the end of the world (but it will
> piss off a few friends)
>
> Jon
>
>
> ----- On 29 Oct, 2018, at 09:54, Zheng Yan <uker...@gmail.com> wrote:
>
>
>
> On Mon, Oct 29, 2018 at 5:25 PM Jon Morby (Fido) <j...@fido.net> wrote:
>
>> Hi
>>
>> Ideally we'd like to undo the whole accidental upgrade to 13.x and ensure
>> that ceph-deploy doesn't do another major release upgrade without a lot of
>> warnings
>>
>> Either way, I'm currently getting errors that 13.2.1 isn't available /
>> shaman is offline / etc
>>
>> What's the best / recommended way of doing this downgrade across our
>> estate?
>>
>>
> You have already upgraded ceph-mon. I don't know If it can be safely
> downgraded (If I remember right, I corrupted monitor's data when
> downgrading ceph-mon from minic to luminous).
>
>
>>
>>
>> ----- On 29 Oct, 2018, at 08:19, Yan, Zheng <uker...@gmail.com> wrote:
>>
>>
>> We backported a wrong patch to 13.2.2.  downgrade ceph to 13.2.1, then
>> run 'ceph mds repaired fido_fs:1" .
>> Sorry for the trouble
>> Yan, Zheng
>>
>> On Mon, Oct 29, 2018 at 7:48 AM Jon Morby <j...@fido.net> wrote:
>>
>>>
>>> We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a
>>> ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9
>>> and not jump a major release without warning)
>>>
>>> Anyway .. as a result, we ended up with an mds journal error and 1
>>> daemon reporting as damaged
>>>
>>> Having got nowhere trying to ask for help on irc, we've followed various
>>> forum posts and disaster recovery guides, we ended up resetting the journal
>>> which left the daemon as no longer “damaged” however we’re now seeing mds
>>> segfault whilst trying to replay
>>>
>>> https://pastebin.com/iSLdvu0b
>>>
>>>
>>>
>>> /build/ceph-13.2.2/src/mds/journal.cc: 1572: FAILED
>>> assert(g_conf->mds_wipe_sessions)
>>>
>>>  ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
>>> (stable)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x102) [0x7fad637f70f2]
>>>  2: (()+0x3162b7) [0x7fad637f72b7]
>>>  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
>>> [0x7a7a6b]
>>>  4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9]
>>>  5: (MDLog::_replay_thread()+0x864) [0x752164]
>>>  6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d]
>>>  7: (()+0x76ba) [0x7fad6305a6ba]
>>>  8: (clone()+0x6d) [0x7fad6288341d]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> needed to interpret this.
>>>
>>>
>>> full logs
>>>
>>> https://pastebin.com/X5UG9vT2
>>>
>>>
>>> We’ve been unable to access the cephfs file system since all of this
>>> started …. attempts to mount fail with reports that “mds probably not
>>> available”
>>>
>>> Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds
>>> server is up
>>>
>>>
>>> root@mds02:~# ceph -s
>>>   cluster:
>>>     id:     78d5bf7d-b074-47ab-8d73-bd4d99df98a5
>>>     health: HEALTH_WARN
>>>             1 filesystem is degraded
>>>             insufficient standby MDS daemons available
>>>             too many PGs per OSD (276 > max 250)
>>>
>>>   services:
>>>     mon: 3 daemons, quorum mon01,mon02,mon03
>>>     mgr: mon01(active), standbys: mon02, mon03
>>>     mds: fido_fs-2/2/1 up  {0=mds01=up:resolve,1=mds02=up:replay(laggy
>>> or crashed)}
>>>     osd: 27 osds: 27 up, 27 in
>>>
>>>   data:
>>>     pools:   15 pools, 3168 pgs
>>>     objects: 16.97 M objects, 30 TiB
>>>     usage:   71 TiB used, 27 TiB / 98 TiB avail
>>>     pgs:     3168 active+clean
>>>
>>>   io:
>>>     client:   680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wr
>>>
>>>
>>> Before I just trash the entire fs and give up on ceph, does anyone have
>>> any suggestions as to how we can fix this?
>>>
>>> root@mds02:~# ceph versions
>>> {
>>>     "mon": {
>>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>>> mimic (stable)": 3
>>>     },
>>>     "mgr": {
>>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>>> mimic (stable)": 3
>>>     },
>>>     "osd": {
>>>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
>>> luminous (stable)": 27
>>>     },
>>>     "mds": {
>>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>>> mimic (stable)": 2
>>>     },
>>>     "overall": {
>>>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
>>> luminous (stable)": 27,
>>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>>> mimic (stable)": 8
>>>     }
>>> }
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> ------------------------------
>> Jon Morby
>> FidoNet - the internet made simple!
>> 10 - 16 Tiller Road, London, E14 8PX
>> tel: 0345 004 3050 / fax: 0345 004 3051
>>
>> Need more rack space?
>> Check out our Co-Lo offerings at http://www.fido.net/services/colo/
>> <http://www.fido.net/services/colo/%20>32 amp racks in London and
>> Brighton
>> Linx ConneXions available at all Fido sites!
>> https://www.fido.net/services/backbone/connexions/
>> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7
>> 1EFA 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc
>>
>
>
> --
> ------------------------------
> Jon Morby
> FidoNet - the internet made simple!
> 10 - 16 Tiller Road, London, E14 8PX
> tel: 0345 004 3050 / fax: 0345 004 3051
>
> Need more rack space?
> Check out our Co-Lo offerings at http://www.fido.net/services/colo/
> <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton
> Linx ConneXions available at all Fido sites!
> https://www.fido.net/services/backbone/connexions/
> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA
> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-mds failure replaying journal

Reply via email to