[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-25 Thread Adrien Georget

Hi Xiubo,

Thanks for your analysis.
Is there anything I can do to put CephFS back in healthy state? Or 
should I wait for to patch to fix that bug?


Cheers,
Adrien

Le 25/11/2022 à 06:13, Xiubo Li a écrit :

Hi Adren,

Thank you for your logs.

From your logs I found one bug and I have raised one new tracker [1] 
to follow it, and raised a ceph PR [2] to fix this.


More detail please my analysis in the tracker [2].

[1] https://tracker.ceph.com/issues/58082
[2] https://github.com/ceph/ceph/pull/49048

Thanks

- Xiubo


On 24/11/2022 16:33, Adrien Georget wrote:

Hi Xiubo,

We did the upgrade in rolling mode as always, with only few 
kubernetes pods as clients accessing their PVC on CephFS.


I can reproduce the problem everytime I restart the MDS daemon.
You can find the MDS log with debug_mds 25 and debug_ms 1 here : 
https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd 

(The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 
mds.0.server force_clients_readonly)


I couldn't find any errors in the OSD logs, anything specific should 
I looking for?


Best,
Adrien 




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is there any risk in adjusting the osd_heartbeat_grace & osd_heartbeat_interval

2022-11-25 Thread yite gu
Hi!

osd_heartbeat_interval indicates interval (6 seconds) between peer pings,
if peer does not reply within osd_heartbeat_grace(20 seconds), osd will
report peer osd failure to mon, and then mon to mark down failure osd.

So, the client request will be blocked within 20 seconds, 20 seconds is too
long.
If we adjust osd_heartbeat_grace and osd_heartbeat_interval as follow:
  osd_heartbeat_grace = 7
  osd_heartbeat_interval = 3
When the peer pings failure, the client request will be stuck for 7
seconds.

Is there any risk in adjusting the osd_heartbeat_grace and
osd_heartbeat_interval,
or other better best practices.

Best regard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] osd removal leaves 'stray daemon'

2022-11-25 Thread Holger Naundorf

Hello,
I have a question about osd removal/replacement:

I just removed an osd where the disk was still running but had read 
errors, leading to failed deep scrubs - as the intent is to replace this 
as soon as we manage to get a spare I removed it with the '--replace' flag:


# ceph orch osd rm 224 --replace

After all placement groups are evacuated I now have 1 osd down/out
and showing as 'destroyed':

# ceph osd tree
ID   CLASS  WEIGHT  TYPE NAMESTATUS REWEIGHT  PRI-AFF
(...)
214hdd14.55269  osd.214 up   1.0  1.0
224hdd14.55269  osd.224  destroyed 0  1.0
234hdd14.55269  osd.234 up   1.0  1.0
(...)

All as expected - but now the health check complains that the 
(destroyed) osd is not managed:


# ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
stray daemon osd.224 on host ceph19 not managed by cephadm

Is this expected behaviour and I have to live with the yellow check 
until we get a replacement disk and recreate the osd or did something 
not finish correctly?


Regards,
Holger

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naund...@rz.uni-kiel.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-25 Thread Xiubo Li


On 25/11/2022 16:25, Adrien Georget wrote:

Hi Xiubo,

Thanks for your analysis.
Is there anything I can do to put CephFS back in healthy state? Or 
should I wait for to patch to fix that bug?


Please try to trim the journals and umount all the clients first, and 
then to see could you pull up the MDSs.


- Xiubo


Cheers,
Adrien

Le 25/11/2022 à 06:13, Xiubo Li a écrit :

Hi Adren,

Thank you for your logs.

From your logs I found one bug and I have raised one new tracker [1] 
to follow it, and raised a ceph PR [2] to fix this.


More detail please my analysis in the tracker [2].

[1] https://tracker.ceph.com/issues/58082
[2] https://github.com/ceph/ceph/pull/49048

Thanks

- Xiubo


On 24/11/2022 16:33, Adrien Georget wrote:

Hi Xiubo,

We did the upgrade in rolling mode as always, with only few 
kubernetes pods as clients accessing their PVC on CephFS.


I can reproduce the problem everytime I restart the MDS daemon.
You can find the MDS log with debug_mds 25 and debug_ms 1 here : 
https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd 

(The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 
mds.0.server force_clients_readonly)


I couldn't find any errors in the OSD logs, anything specific should 
I looking for?


Best,
Adrien 






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io