[ceph-users] Re: orchestrator behaving strangely

Frédéric Nass Mon, 30 Jun 2025 02:33:16 -0700

Hi Holger,

In addition to Eugen's sound advice, I would try restarting the OSD in question.


If that doesn't help, I would stop the current deletion process 'ceph orch osd 
rm stop 406' and restart it 'ceph orch osd rm 406 --force'. As for the 
orchestrator logs, you can use the command 'ceph log last 1000 debug cephadm' 
and also check for any errors in ceph log files on host 'acn07'.

Regards,
Frédéric.

----- Le 28 Juin 25, à 19:53, Eugen Block ebl...@nde.ag a écrit :

> Can you show the overall cluster status (ceph -s)? If there's
> something else going on, it might block (some?) operations. And I'd
> scan the mgr logs, maybe in debug mode to see why it fails to operate
> properly.
> 
> Zitat von Holger Naundorf <naund...@rz.uni-kiel.de>:
> 
>> On 27.06.25 14:16, Eugen Block wrote:
>>>
>>>
>>> Zitat von Holger Naundorf <naund...@rz.uni-kiel.de>:
>>>
>>>> Hello,
>>>> title should of course be
>>>>  "orchestrator behaving strangely"
>>>>
>>>> I did give a mgr restart another try (for the last OSD removal -
>>>> which also did not work - I did already restart the mgr without
>>>> effect)
>>>>
>>>> there is no (immediate- i.e after ~10min) effect now as well - or
>>>> should I reissue the OSD rm command as well?
>>>
>>> Is there something in the queue (ceph orch osd rm status)?
>>> Sometimes the queue clears after a mgr restart, so it might be
>>> necessary to restart the rm command as well.
>>>
>> There is just the one 'waiting for purge' osd in the queue:
>>
>> root@aadm01:~# ceph orch osd rm status
>> OSD  HOST   STATE                    PGS  REPLACE  FORCE  ZAP
>> DRAIN STARTED AT
>> 406  acn07  done, waiting for purge    0  True     False  True
>> 2025-06-25 09:18:07.650734+00:00
>>
>> One more datapoint:
>>
>> This is an OSD on a lage, rotational disk. The orchestrator is still
>> working ok for a subet of OSDs on SSDs. We are just moving our SSD
>> pool around and for this it was no problem using 'ceph orch osd rm
>> ...' - with the difference that there we used --zap and not
>> --replace (as we do not want to replace the disk, but to move the
>> OSD away from this host).
>>
>> Regards,
>> Holger
>>
>>
>>
>>
>>>> Regards,
>>>> Holger
>>>>
>>>>
>>>> On 27.06.25 12:26, Eugen Block wrote:
>>>>> Hi,
>>>>>
>>>>> have you retried it after restarting/failing the mgr?
>>>>>
>>>>> ceph mgr fail
>>>>>
>>>>> Quite often this (still) helps.
>>>>>
>>>>> Zitat von Holger Naundorf <naund...@rz.uni-kiel.de>:
>>>>>
>>>>>> Hello,
>>>>>> we are running a ceph cluster at version:
>>>>>>
>>>>>> ceph version 19.2.2 (0eceb0defba60152a8182f7bd87d164b639885b8)
>>>>>> squid (stable)
>>>>>>
>>>>>> and since a few weeks the orchestrator started to misbehave - up
>>>>>> to now we could not identify any root cause, so I am fishing in
>>>>>> the community to see if there are any hints.
>>>>>>
>>>>>> Problems:
>>>>>>
>>>>>> An OSD removal (for disk replacement) gets stuck in the 'purge' step:
>>>>>>
>>>>>> ceph orch osd rm 406 --replace
>>>>>>
>>>>>> root@aadm01:~# ceph orch osd rm status
>>>>>> OSD  HOST   STATE                    PGS  REPLACE  FORCE  ZAP
>>>>>> DRAIN STARTED AT
>>>>>> 406  acn07  done, waiting for purge    0  True     False  True
>>>>>> 2025-06-25 09:18:07.650734+00:00
>>>>>>
>>>>>> (now for more than 24h in this state)
>>>>>>
>>>>>> At the same time the orchestrator is not restarting OSD daemons
>>>>>> - i.e. an 'ceph orch daemon restart osd.xxx' claims its queuing
>>>>>> uo the restart, but it never happens. Other services continue to
>>>>>> be controlled correctly via 'ceph orch ...'
>>>>>>
>>>>>> If anyone has an idea where to poke around or can match this to
>>>>>> some known problem - I would appreciate any pointers.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Holger
>>>>>>
>>>>>> --
>>>>>> Dr. Holger Naundorf
>>>>>> Christian-Albrechts-Universität zu Kiel
>>>>>> Rechenzentrum / HPC / Server und Storage
>>>>>> Tel: +49 431 880-1990
>>>>>> Fax:  +49 431 880-1523
>>>>>> naund...@rz.uni-kiel.de
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>
>>>> --
>>>> Dr. Holger Naundorf
>>>> Christian-Albrechts-Universität zu Kiel
>>>> Rechenzentrum / HPC / Server und Storage
>>>> Tel: +49 431 880-1990
>>>> Fax:  +49 431 880-1523
>>>> naund...@rz.uni-kiel.de
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>> --
>> Dr. Holger Naundorf
>> Christian-Albrechts-Universität zu Kiel
>> Rechenzentrum / HPC / Server und Storage
>> Tel: +49 431 880-1990
>> Fax:  +49 431 880-1523
>> naund...@rz.uni-kiel.de
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: orchestrator behaving strangely

Reply via email to