Re: [ceph-users] ceph-mds failure replaying journal

Jon Morby Wed, 31 Oct 2018 07:36:02 -0700

although that said, I’ve just noticed this crash this morning

2018-10-31 14:26:00.522 7f0cf53f5700 -1 /build/ceph-13.2.1/src/mds/CDir.cc: In 
function 'void CDir::fetch(MDSInternalContextBase*, std::string_view, bool)' 
thread 7f0cf53f5700 time 2018-10-31 14:26:00.485647
/build/ceph-13.2.1/src/mds/CDir.cc: 1504: FAILED assert(is_auth())


shortly after I set max_mds back to 3



> On 30 Oct 2018, at 18:50, Jon Morby <j...@fido.net> wrote:
> 
> So a big thank you to @yanzheng for his help getting this back online
> 
> The quick answer to what we did was downgrade to 13.2.1 as 13.2.2 is broken 
> for cephfs
> 
> restored the backup of the journal I’d taken as part of following the 
> disaster recovery process documents
> 
> turned off mds standby replay and temporarily stopped all but 2 of the mds so 
> we could monitor the logs more easily
> 
> we then did a wipe sessions and watched the mds repair
> 
> Set mds_wipe_sessions to 1 and restart mds
> 
> finally there was a 
> 
> $ ceph daemon mds01 scrub_path / repair force recursive
> 
> and then setting mds_wipe_sessions back to 0
> 
> Jon
> 
> 
> I can’t say a big enough thank you to @yanzheng for their assistance though!
> 
> 
>> On 29 Oct 2018, at 11:13, Jon Morby (Fido) <j...@fido.net 
>> <mailto:j...@fido.net>> wrote:
>> 
>> I've experimented and whilst the downgrade looks to be working, you end up 
>> with errors regarding unsupported feature "mimic" amongst others
>> 
>> 2018-10-29 10:51:20.652047 7f6f1b9f5080 -1 ERROR: on disk data includes 
>> unsupported features: compat={},rocompat={},incompat={10=mimic ondisk layou
>> 
>> so I gave up on that idea
>> 
>> In addition to the cephfs volume (which is basically just mirrors and some 
>> backups) we have a large rbd deployment using the same ceph cluster, and if 
>> we lose that we're screwed ... the cephfs volume was more an "experiment" to 
>> see how viable it would be as an NFS replacement
>> 
>> There's 26TB of data on there, so I'd rather not have to go off and 
>> redownload it all .. but losing it isn't the end of the world (but it will 
>> piss off a few friends)
>> 
>> Jon
>> 
>> 
>> ----- On 29 Oct, 2018, at 09:54, Zheng Yan <uker...@gmail.com 
>> <mailto:uker...@gmail.com>> wrote:
>> 
>> 
>> On Mon, Oct 29, 2018 at 5:25 PM Jon Morby (Fido) <j...@fido.net 
>> <mailto:j...@fido.net>> wrote:
>> Hi
>> 
>> Ideally we'd like to undo the whole accidental upgrade to 13.x and ensure 
>> that ceph-deploy doesn't do another major release upgrade without a lot of 
>> warnings
>> 
>> Either way, I'm currently getting errors that 13.2.1 isn't available / 
>> shaman is offline / etc
>> 
>> What's the best / recommended way of doing this downgrade across our estate?
>> 
>> 
>> You have already upgraded ceph-mon. I don't know If it can be safely 
>> downgraded (If I remember right, I corrupted monitor's data when downgrading 
>> ceph-mon from minic to luminous). 
>>  
>> 
>> 
>> ----- On 29 Oct, 2018, at 08:19, Yan, Zheng <uker...@gmail.com 
>> <mailto:uker...@gmail.com>> wrote:
>> 
>> We backported a wrong patch to 13.2.2.  downgrade ceph to 13.2.1, then run 
>> 'ceph mds repaired fido_fs:1" .
>> Sorry for the trouble
>> Yan, Zheng
>> 
>> On Mon, Oct 29, 2018 at 7:48 AM Jon Morby <j...@fido.net 
>> <mailto:j...@fido.net>> wrote:
>> 
>> We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a 
>> ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9 and 
>> not jump a major release without warning)
>> 
>> Anyway .. as a result, we ended up with an mds journal error and 1 daemon 
>> reporting as damaged
>> 
>> Having got nowhere trying to ask for help on irc, we've followed various 
>> forum posts and disaster recovery guides, we ended up resetting the journal 
>> which left the daemon as no longer “damaged” however we’re now seeing mds 
>> segfault whilst trying to replay 
>> 
>> https://pastebin.com/iSLdvu0b <https://pastebin.com/iSLdvu0b>
>> 
>> 
>> 
>> /build/ceph-13.2.2/src/mds/journal.cc <http://journal.cc/>: 1572: FAILED 
>> assert(g_conf->mds_wipe_sessions)
>> 
>>  ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
>> (stable)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x102) [0x7fad637f70f2]
>>  2: (()+0x3162b7) [0x7fad637f72b7]
>>  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b) 
>> [0x7a7a6b]
>>  4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9]
>>  5: (MDLog::_replay_thread()+0x864) [0x752164]
>>  6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d]
>>  7: (()+0x76ba) [0x7fad6305a6ba]
>>  8: (clone()+0x6d) [0x7fad6288341d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
>> interpret this.
>> 
>> 
>> full logs
>> 
>> https://pastebin.com/X5UG9vT2 <https://pastebin.com/X5UG9vT2>
>> 
>> We’ve been unable to access the cephfs file system since all of this started 
>> …. attempts to mount fail with reports that “mds probably not available” 
>> 
>> Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds server 
>> is up
>> 
>> 
>> root@mds02:~# ceph -s
>>   cluster:
>>     id:     78d5bf7d-b074-47ab-8d73-bd4d99df98a5
>>     health: HEALTH_WARN
>>             1 filesystem is degraded
>>             insufficient standby MDS daemons available
>>             too many PGs per OSD (276 > max 250)
>> 
>>   services:
>>     mon: 3 daemons, quorum mon01,mon02,mon03
>>     mgr: mon01(active), standbys: mon02, mon03
>>     mds: fido_fs-2/2/1 up  {0=mds01=up:resolve,1=mds02=up:replay(laggy or 
>> crashed)}
>>     osd: 27 osds: 27 up, 27 in
>> 
>>   data:
>>     pools:   15 pools, 3168 pgs
>>     objects: 16.97 M objects, 30 TiB
>>     usage:   71 TiB used, 27 TiB / 98 TiB avail
>>     pgs:     3168 active+clean
>> 
>>   io:
>>     client:   680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wr
>> 
>> 
>> Before I just trash the entire fs and give up on ceph, does anyone have any 
>> suggestions as to how we can fix this?
>> 
>> root@mds02:~# ceph versions
>> {
>>     "mon": {
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) 
>> mimic (stable)": 3
>>     },
>>     "mgr": {
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) 
>> mimic (stable)": 3
>>     },
>>     "osd": {
>>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
>> luminous (stable)": 27
>>     },
>>     "mds": {
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) 
>> mimic (stable)": 2
>>     },
>>     "overall": {
>>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
>> luminous (stable)": 27,
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) 
>> mimic (stable)": 8
>>     }
>> }
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>> 
>> -- 
>> Jon Morby
>> FidoNet - the internet made simple!
>> 10 - 16 Tiller Road, London, E14 8PX 
>> tel: 0345 004 3050 / fax: 0345 004 3051
>> 
>> Need more rack space?
>> Check out our Co-Lo offerings at http://www.fido.net/services/colo/  
>> <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton
>> Linx ConneXions available at all Fido sites! 
>> https://www.fido.net/services/backbone/connexions/ 
>> <https://www.fido.net/services/backbone/connexions/>
>> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA 
>> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc 
>> <http://jonmorby.com/B3B5AD3A.asc>
>> 
>> -- 
>> Jon Morby
>> FidoNet - the internet made simple!
>> 10 - 16 Tiller Road, London, E14 8PX 
>> tel: 0345 004 3050 / fax: 0345 004 3051
>> 
>> Need more rack space?
>> Check out our Co-Lo offerings at http://www.fido.net/services/colo/  
>> <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton
>> Linx ConneXions available at all Fido sites! 
>> https://www.fido.net/services/backbone/connexions/ 
>> <https://www.fido.net/services/backbone/connexions/>
>> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA 
>> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc 
>> <http://jonmorby.com/B3B5AD3A.asc>_______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> Jon Morby
> where those in the know go
> Tel: 0345 004 3050
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jon Morby
where those in the know go
Tel: 0345 004 3050

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-mds failure replaying journal

Reply via email to