Hi Tobi,

Thanks for your response. While I hadn’t tried restarting the active mgr, I did 
effectively accomplish the same result by failing it out with `ceph mgr fail`, 
thereby starting a new mgr process in another container. I’ve since tried 
restarting the active mgr, but it didn’t make any difference.

Eventually `ceph orch device ls` stops producing any output. After restarting 
or failing over the mgr it will report stale results from a few weeks ago until 
it attempts to refresh the device list. At that point it stops producing output 
again. I think this is the root cause of our problem. I followed your 
suggestions of checking the ceph-volume.log and the cephadm debug output. I 
didn’t see any obvious problems, but I didn’t have a lot of time to work on 
this yesterday. I hope to spend more time looking at these logs today.

Cheers,
/rjg

On Oct 24, 2024, at 1:44 AM, Tobias Fischer <tobias.fisc...@clyso.com> wrote:

You don't often get email from tobias.fisc...@clyso.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

EXTERNAL EMAIL | USE CAUTION

Hi Bob,
have you tried to restart the active mgr? ( sometimes mgr gets stuck and 
prevents the orchestrator from working correctly ).
Regarding the orchestrator device scan: have a look into the ceph-volume.log on 
the corresponding host. you will find it under 
/var/log/ceph/CLUSTER-ID/ceph-volume.log
this log is generated by the device scan from the orchestrator. It may also 
help to have a look at cephadm debug logs - see 
https://docs.ceph.com/en/latest/cephadm/operations/#watching-cephadm-log-messages
Cheers,
tobi

Am Mi., 23. Okt. 2024 um 20:15 Uhr schrieb Bob Gibson 
<r...@oicr.on.ca<mailto:r...@oicr.on.ca>>:
Sorry to resurrect this thread, but while I was able to get the cluster healthy 
again by manually creating the osd, I'm still unable to manage osds using the 
orchestrator.

The orchestrator is generally working, but It appears to be unable to scan 
devices. Immediately after failing out the mgr `ceph orch device ls` will 
display device status from >4 weeks ago, which was when we converted the 
cluster to be managed by cephadm. Eventually the orchestrator will attempt to 
refresh its device status. At this point `ceph orch device ls` stops displaying 
any output at all. I can reproduce this state almost immediately if I run `ceph 
orch device ls —refresh` to force an immediate refresh. The mgr log shows 
events like the following just before `ceph orch device ls` stops reporting 
output (one event for every osd node in the cluster):

"Detected new or changed devices on ceph-osd31”

Here are the osd services in play:

# ceph orch ls osd
NAME            PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
osd                         95  8m ago     -    <unmanaged>
osd.ceph-osd31               4  8m ago     6d   ceph-osd31

# ceph orch ls osd --export
service_type: osd
service_name: osd
unmanaged: true
spec:
  filter_logic: AND
  objectstore: bluestore
---
service_type: osd
service_id: ceph-osd31
service_name: osd.ceph-osd31
placement:
  hosts:
  - ceph-osd31
spec:
  data_devices:
    rotational: 0
    size: '3TB:'
  encrypted: true
  filter_logic: AND
  objectstore: bluestore

I tried deleting the default “osd” service in case it was somehow conflicting 
with my per-node spec, but it looks like that’s not allowed, so I assume any 
custom osd service specs override the unmanaged default.

# ceph orch rm osd
Invalid service 'osd'. Use 'ceph orch ls' to list available services.

My hunch is that some persistent state is corrupted, or there’s something else 
preventing the orchestrator from successfully refreshing its device status, but 
I don’t know how to troubleshoot this. Any ideas?

Cheers,
/rjg

P.S. @Eugen: When I first started this thread you said it was unnecessary to 
destroy an osd to convert it from unmanaged to managed. Can you explain how 
this is done? Although we want to recreate the osds to enable encryption, it 
would save time, and unnecessary wear on the SSDs, while troubleshooting.

On Oct 16, 2024, at 2:45 PM, Eugen Block <ebl...@nde.ag<mailto:ebl...@nde.ag>> 
wrote:

EXTERNAL EMAIL | USE CAUTION

Glad to hear it worked out for you!

Zitat von Bob Gibson <r...@oicr.on.ca<mailto:r...@oicr.on.ca>>:

I’ve been away on vacation and just got back to this. I’m happy to
report that manually recreating the OSD with ceph-volume and then
adopting it with cephadm fixed the problem.

Thanks again for your help Eugen!

Cheers,
/rjg

On Sep 29, 2024, at 10:40 AM, Eugen Block <ebl...@nde.ag<mailto:ebl...@nde.ag>> 
wrote:

EXTERNAL EMAIL | USE CAUTION

Okay, apparently this is not what I was facing. I see two other
options right now. The first would be to purge osd.88 from the crush
tree entirely.
The second approach would be to create an osd manually with
ceph-volume, not cephadm ceph-volume, to create a legacy osd (you'd
get warnings about a stray daemon). If that works, adopt the osd with
cephadm.
I don't have a better idea right now.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>


--
Best Regards,

Tobias Fischer

Head of Ceph
Clyso GmbH
p: +49 89 2152527 41
a: Hohenzollernstraße 27 | 80801 München | Germany
w: https://clyso.com<https://clyso.com/> | e: 
tobias.fisc...@clyso.com<mailto:tobias.fisc...@clyso.com>

We are hiring: https://www.clyso.com/jobs/
---
Geschäftsführer: Dipl. Inf. (FH) Joachim Kraftmayer
Unternehmenssitz: Utting am Ammersee
Handelsregister beim Amtsgericht: Augsburg
Handelsregister-Nummer: HRB 25866
USt. ID-Nr.: DE275430677

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to