Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

Sergey Malinin Mon, 08 Jan 2018 16:25:33 -0800

You cannot force mds quit "replay" state for obvious reason of keeping data 
consistent. You might raise mds_beacon_grace to a somewhat reasonable value 
that would allow MDS to replay the journal without being marked laggy and 
eventually blacklisted.


________________________________
From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Alessandro De 
Salvo <alessandro.desa...@roma1.infn.it>
Sent: Monday, January 8, 2018 7:40:59 PM
To: Lincoln Bryant; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

Thanks Lincoln,

indeed, as I said the cluster is recovering, so there are pending ops:


     pgs:     21.034% pgs not active
              1692310/24980804 objects degraded (6.774%)
              5612149/24980804 objects misplaced (22.466%)
              458 active+clean
              329 active+remapped+backfill_wait
              159 activating+remapped
              100 active+undersized+degraded+remapped+backfill_wait
              58  activating+undersized+degraded+remapped
              27  activating
              22  active+undersized+degraded+remapped+backfilling
              6   active+remapped+backfilling
              1   active+recovery_wait+degraded


If it's just a matter to wait for the system to complete the recovery
it's fine, I'll deal with that, but I was wondendering if there is a
more suble problem here.

OK, I'll wait for the recovery to complete and see what happens, thanks.

Cheers,


     Alessandro


Il 08/01/18 17:36, Lincoln Bryant ha scritto:
> Hi Alessandro,
>
> What is the state of your PGs? Inactive PGs have blocked CephFS
> recovery on our cluster before. I'd try to clear any blocked ops and
> see if the MDSes recover.
>
> --Lincoln
>
> On Mon, 2018-01-08 at 17:21 +0100, Alessandro De Salvo wrote:
>> Hi,
>>
>> I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded.
>>
>> I have 2 active mds instances and 1 standby. All the active
>> instances
>> are now in replay state and show the same error in the logs:
>>
>>
>> ---- mds1 ----
>>
>> 2018-01-08 16:04:15.765637 7fc2e92451c0  0 ceph version 12.2.2
>> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
>> process
>> (unknown), pid 164
>> starting mds.mds1 at -
>> 2018-01-08 16:04:15.785849 7fc2e92451c0  0 pidfile_write: ignore
>> empty
>> --pid-file
>> 2018-01-08 16:04:20.168178 7fc2e1ee1700  1 mds.mds1 handle_mds_map
>> standby
>> 2018-01-08 16:04:20.278424 7fc2e1ee1700  1 mds.1.20635 handle_mds_map
>> i
>> am now mds.1.20635
>> 2018-01-08 16:04:20.278432 7fc2e1ee1700  1 mds.1.20635
>> handle_mds_map
>> state change up:boot --> up:replay
>> 2018-01-08 16:04:20.278443 7fc2e1ee1700  1 mds.1.20635 replay_start
>> 2018-01-08 16:04:20.278449 7fc2e1ee1700  1 mds.1.20635  recovery set
>> is 0
>> 2018-01-08 16:04:20.278458 7fc2e1ee1700  1 mds.1.20635  waiting for
>> osdmap 21467 (which blacklists prior instance)
>>
>>
>> ---- mds2 ----
>>
>> 2018-01-08 16:04:16.870459 7fd8456201c0  0 ceph version 12.2.2
>> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
>> process
>> (unknown), pid 295
>> starting mds.mds2 at -
>> 2018-01-08 16:04:16.881616 7fd8456201c0  0 pidfile_write: ignore
>> empty
>> --pid-file
>> 2018-01-08 16:04:21.274543 7fd83e2bc700  1 mds.mds2 handle_mds_map
>> standby
>> 2018-01-08 16:04:21.314438 7fd83e2bc700  1 mds.0.20637 handle_mds_map
>> i
>> am now mds.0.20637
>> 2018-01-08 16:04:21.314459 7fd83e2bc700  1 mds.0.20637
>> handle_mds_map
>> state change up:boot --> up:replay
>> 2018-01-08 16:04:21.314479 7fd83e2bc700  1 mds.0.20637 replay_start
>> 2018-01-08 16:04:21.314492 7fd83e2bc700  1 mds.0.20637  recovery set
>> is 1
>> 2018-01-08 16:04:21.314517 7fd83e2bc700  1 mds.0.20637  waiting for
>> osdmap 21467 (which blacklists prior instance)
>> 2018-01-08 16:04:21.393307 7fd837aaf700  0 mds.0.cache creating
>> system
>> inode with ino:0x100
>> 2018-01-08 16:04:21.397246 7fd837aaf700  0 mds.0.cache creating
>> system
>> inode with ino:0x1
>>
>> The cluster is recovering as we are changing some of the osds, and
>> there
>> are a few slow/stuck requests, but I'm not sure if this is the cause,
>> as
>> there is apparently no data loss (until now).
>>
>> How can I force the MDSes to quit the replay state?
>>
>> Thanks for any help,
>>
>>
>>       Alessandro
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

Reply via email to