Sounds great that the upgrade went through!
May be it should be better documented that you should not zap a
device intended for definitive removal if you don't have
osd.all-available-devices service placement was set to unmanaged...
I'd rather vote for not considering osd.all-available-devices as
something for production. ;-) I only use that in small test clusters
to quickly set up OSDs without having to deal with specs. But in
production clusters, I like to have full control over OSD creation.
Just my opinion. :-)
Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Hi,
Thanks for all the feedback and suggestions. Summary of the summary:
after stopping the removal for the OSD waiting to be zapped (because
of the no longer available disk), the upgrade started immediately
and ran well. The cluster is now running 18.2.6! And as said
previously by Eugen, I confirm that in 18.2.6, removed OSDs are no
longer considered stray daemons. I still have the feeling that Ceph
could give more useful information if:
- a cephadm message at INFO level (and visible with 'ceph orch
upgrade status' would report that the upgrade cannot proceed because
of described reason. This information could be given once, a few
minutes after entering the upgrade command is no daemon has been
upgraded yet, for example.
- a message at INFO level was informing that the zap operation
failed (suggesting to use DEBUG level for more information)
About Anthony's last question, yes the 2 OSDs were destroyed as showed by:
# ceph osd tree|grep destroyed
253 hdd 16.37108 osd.253 destroyed 0 1.00000
381 hdd 16.37108 osd.381 destroyed 0 1.00000
@Eugen regarding what I said about osd.381 being picked up by Ceph
to replace the failed osd.381 OSD, I think it is the conjunction of
the fact that osd.all-available-devices service placement was not
set to unmanaged (something we tend to do normally but as we add a
few servers recently we changed it and forgot to set it back to
unmanaged) and that in the initial removal I zapped the device.
Because of this, the device appeared to be free for use... May be it
should be better documented that you should not zap a device
intended for definitive removal if you don't have
osd.all-available-devices service placement was set to unmanaged...
Thanks again. Best regards,
Michel
Le 30/04/2025 à 15:41, Eugen Block a écrit :
Hm, I thought there was an excerpt from the osd tree, but
apparently not? Could you then please confirm that the OSDs are in
fact marked as destroyed in the osd tree?
Zitat von Anthony D'Atri <anthony.da...@gmail.com>:
I'm not entirely sure what the orchestrator will do except for
clearing the pending state, and since the OSDs are already marked
as destroyed in the crush tree,
Do we know that they are? The thread shows some log messages, but
not unless I’m missing it evidence that they were marked. When I
ran into a similar issue recently, they were not marked destroyed
in the CRUSH tree.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io