[ceph-users] Re: filesystem became read only after Quincy upgrade
Hi Xiubo, Thanks for your analysis. Is there anything I can do to put CephFS back in healthy state? Or should I wait for to patch to fix that bug? Cheers, Adrien Le 25/11/2022 à 06:13, Xiubo Li a écrit : Hi Adren, Thank you for your logs. From your logs I found one bug and I have raised one new tracker [1] to follow it, and raised a ceph PR [2] to fix this. More detail please my analysis in the tracker [2]. [1] https://tracker.ceph.com/issues/58082 [2] https://github.com/ceph/ceph/pull/49048 Thanks - Xiubo On 24/11/2022 16:33, Adrien Georget wrote: Hi Xiubo, We did the upgrade in rolling mode as always, with only few kubernetes pods as clients accessing their PVC on CephFS. I can reproduce the problem everytime I restart the MDS daemon. You can find the MDS log with debug_mds 25 and debug_ms 1 here : https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd (The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 mds.0.server force_clients_readonly) I couldn't find any errors in the OSD logs, anything specific should I looking for? Best, Adrien ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Is there any risk in adjusting the osd_heartbeat_grace & osd_heartbeat_interval
Hi! osd_heartbeat_interval indicates interval (6 seconds) between peer pings, if peer does not reply within osd_heartbeat_grace(20 seconds), osd will report peer osd failure to mon, and then mon to mark down failure osd. So, the client request will be blocked within 20 seconds, 20 seconds is too long. If we adjust osd_heartbeat_grace and osd_heartbeat_interval as follow: osd_heartbeat_grace = 7 osd_heartbeat_interval = 3 When the peer pings failure, the client request will be stuck for 7 seconds. Is there any risk in adjusting the osd_heartbeat_grace and osd_heartbeat_interval, or other better best practices. Best regard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] osd removal leaves 'stray daemon'
Hello, I have a question about osd removal/replacement: I just removed an osd where the disk was still running but had read errors, leading to failed deep scrubs - as the intent is to replace this as soon as we manage to get a spare I removed it with the '--replace' flag: # ceph orch osd rm 224 --replace After all placement groups are evacuated I now have 1 osd down/out and showing as 'destroyed': # ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF (...) 214hdd14.55269 osd.214 up 1.0 1.0 224hdd14.55269 osd.224 destroyed 0 1.0 234hdd14.55269 osd.234 up 1.0 1.0 (...) All as expected - but now the health check complains that the (destroyed) osd is not managed: # ceph health detail HEALTH_WARN 1 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon osd.224 on host ceph19 not managed by cephadm Is this expected behaviour and I have to live with the yellow check until we get a replacement disk and recreate the osd or did something not finish correctly? Regards, Holger -- Dr. Holger Naundorf Christian-Albrechts-Universität zu Kiel Rechenzentrum / HPC / Server und Storage Tel: +49 431 880-1990 Fax: +49 431 880-1523 naund...@rz.uni-kiel.de ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: filesystem became read only after Quincy upgrade
On 25/11/2022 16:25, Adrien Georget wrote: Hi Xiubo, Thanks for your analysis. Is there anything I can do to put CephFS back in healthy state? Or should I wait for to patch to fix that bug? Please try to trim the journals and umount all the clients first, and then to see could you pull up the MDSs. - Xiubo Cheers, Adrien Le 25/11/2022 à 06:13, Xiubo Li a écrit : Hi Adren, Thank you for your logs. From your logs I found one bug and I have raised one new tracker [1] to follow it, and raised a ceph PR [2] to fix this. More detail please my analysis in the tracker [2]. [1] https://tracker.ceph.com/issues/58082 [2] https://github.com/ceph/ceph/pull/49048 Thanks - Xiubo On 24/11/2022 16:33, Adrien Georget wrote: Hi Xiubo, We did the upgrade in rolling mode as always, with only few kubernetes pods as clients accessing their PVC on CephFS. I can reproduce the problem everytime I restart the MDS daemon. You can find the MDS log with debug_mds 25 and debug_ms 1 here : https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd (The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 mds.0.server force_clients_readonly) I couldn't find any errors in the OSD logs, anything specific should I looking for? Best, Adrien ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io