[ceph-users] Re: Newby woes with ceph

Michel Jouvin Tue, 22 Jul 2025 05:33:34 -0700

Stéphane,

As you use VMs for your deployment, would it make sense to stop and keepthem and restart with a new set of VMs, followinghttps://docs.ceph.com/en/latest/cephadm/install/. I personnally don'thave experience to expand the cluster through the dashboard (it shouldwork! just that I am an old guy so used to command line tools!).Probably using the command line makes easier to identify when there is aproblem, without always digging in the logs. I remember following thisdoc a couple of years ago, the last time I created a cluster and it wasworking as expected.

If you restart something fresh, I'd start with the Squid version ratherthan Reef. It should not make any difference for the installation butwill bring you the latest-and-greatest Ceph! Once you have somethingrunning properly, you may compare what is different in the otherconfiguration. I don't have any opinion on Debian 24 as we are usingRedHat (AlmaLinux in fact) but for me the OS version should not make abig difference when using cephadm, as Ceph in fact runs in containers.But the devil is in the details, Maite may have a reason for her warning...

One thing we didn't mention/check with you in the previous exchanges isthe Ceph network configuration you used. Any discrepancy in the networkconfiguration, in particular if you have a separate cluster (networkused only by OSD) and public (network used by monitors and Ceph clients)networks, you may have a situation where one of them is not working asexpected and may lead to daemons not being able to see eachothers. But Ithink your problem is much more basic: the daemons were not able torestart after some corruptions of their DB, it is pretty unusual and mayalso be the result of something weird that happened in your VMinfrastructure leading to storage corruption... What are you using tomanage to your VMs?


Michel

Le 22/07/2025 à 13:48, Stéphane Barthes a écrit :

Michel,


Thank you very much for the help.
I will look into the documentation provided by Eugen to try reduce to1 mon, remove and re-add the 2 mon on the node ceph01 and ceph02.
Regarding the installation, I have created a vm template with ubuntuinstalled ready for ceph.
Installed cephadm on ceph01, bootstrapped the cluster, and added theother vms from the dashboard. I guess this is all I did to setup thecluster.
Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 12:12, Michel Jouvin a écrit :
Stéphane,
Basically you cannot do anything in your cluster until you reach thequorum. Except managing it with cephadm to restore a functionningcluster. If 'ceph -s' doesn't return, it means you lost the quorum,it is the only reason I'm aware for this. As your cluster is quitesimple, it should be easy to see the state of the monitor daemon oneach host where one should run using `cephadm ls` and/or`podman/docker ps`. And you should be able to get access to thedaemon logs of the monitor daemons.
In one of your message yesterday you reported a log saying therocks.db of one of the mon was corrupted. I personnally never sawthat but the first thing to do is to fix this as it will prevent themon to start. Follow the doc mentioned by Eugen to reduce your quorumto 1 mon (deleting the 2 broken ones from the monmap) if necessary(if you don't find a way to start at least 2 mon). And as said inanother message, ensure you added the label _admin to the hosts whereyou want to be able to use the ceph command else the requiredinformation to connect to the cluster will be missing. It is donewith 'ceph orch host label add' command which requires that you fixedthe quorum issue. One possibility if you have one mon healthy and youmanage to reduce the quorum to 1 is to delete the 2 other mon andreadd them as new mon so that they are reinitialized. This way youwill not loose anything. Look at cephadm documentation to learn howto remove and add daemons.
One thing not fully clear for me is how you installed your differenthosts. It seems they are not configured exactly the same way as onone host the ceph command is not available where it is on the otherones. Ceph doesn't need a lot of things from the OS when usingcephadm but it is pretty important to ensure that all your Ceph hostsare deployed the same way/with the same config else you just add tothe entropy...
I fully agree with you and Eugen that trying to fix things is a wayto learn a lot but at the same time it is not very easy to help youwith the very limited information we have on what you did to be insuch a strange situation... So if you don't manage to converge, maybe it is better to restart from scratch following carefully theinstructions: you will have plenty of other occasions to learn anyway!
Michel

Le 22/07/2025 à 11:04, Stéphane Barthes a écrit :
Hi Michel,


Does this mean I need to recover quorum, before some fixing happens?
Should I kick a new VM, and add a mon to the cluster, via cephaadm?This would allow to have 2 running mons?
S. Barthes
Le 22/07/2025 à 10:39, Michel Jouvin a écrit :
Hi Stéphane,
'ceph -s' requires the mon quorum to be reached, else the Cephcluster hangs. cephadm is not using the Ceph cluster internalcommunication but is building a management cluster on top of it soit can manage the cluster even if the quorum is lost but it cannotprovide any information requires the quorum to be reached.
Michel

Le 22/07/2025 à 10:33, Stéphane Barthes a écrit :
Hi Malte,


Thank for  your reply.  Here are a some info :


ceph -s hangs and times out monhunting after 300s
But I can run cephadm shell. Is there a similar command undercephadm shell?
ceph health detail : same as above.
I would like to repair it, instead of wipe & restart, as it is(from my point of view) a good way to learn (and there are a fewdata I'd like to recover).
What is the problem with ubuntu 24? I did not see warningsregarding this specific version inhttps://docs.ceph.com/en/latest/cephadm/install/#cephadm-install-distros
Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 10:02, Malte Stroem a écrit :
Hello Stéphane,

I think, you're mixing and mismatching up a lot!

You always have to show us the output of:

ceph -s

And more! Logs and stuff, e. g.:

ceph health detail

It is clear you missed something here and there.
It is repairable but since it is a test cluster, just delete itand start again.
And follow the documentation for cephadm. And do not use Ubuntu24.04.
Best,
Malte

On 22.07.25 09:02, Stéphane Barthes wrote:
Hello,
Today, things have degraded a bit more. ceph-03 mon has failedand will not restart. It shows the same kind of checksum errorin rocksdb compact operation during startup. As a consequence, Ilost quorum, and ceph commands hang.
Would it be wise to disable rocksbd compact, to restart and findquorum back? If yes, what is the exactt syntax of the setting inceph.conf, I have seen one for OSD, but not sure if it would apply:
[osd]

osd_compact_on_start = true
If I can restart, I will try to out the OSDs, and recreate them.Last time I saw the OSD seemed fine in the dashboard. Since Ihove no dashboard, is there a command I can use to check theirstatus?
Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny

Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :
Michel,
cephadm shell starts on all 3 nodes without error, and eachhost as the same ceph public key entry in the.ssh/autorized_key file of the root user.
ceph-01 also has ceph.pub in /etc/ceph with the same key (thisis the node I started the install from)
ceph-2 has no/etc/ceph folder

ceph-3 has a /etc/ceph folder, but no ceph.pub file there


S. Barthes
Le 21/07/2025 à 12:36, Michel Jouvin a écrit :
Hi Stéphane,
Sorry I was busy and did not look at your previous answers...It is a bit difficult for me to understand how you ended up inthis situation but for me it is strange that ceph-02 complainsabout a missing keyring and the corruped rocks.db on a freshlycreated cluster is also a bit strange for me. I don't think itmakes sense to destroy and recreate the OSD, I am runningseveral clusters with hundreds of OSDs and I never saw amis-initialized one. The problem is hiding something else I'mafraid. Because of some misconfiguration, may be one OSD is ina bad state and may need to be reinitialized but first weshould get the 3 mons running properly and `cephadm shell`working properly on the 3 hosts. And the rocks.db compactionissue for me is related to your mon, not to an OSD.
Have you checked that SSH configuration for cephadm is workingwell from any host to any other one in your cluster (with 3hosts, it should be really straighforward to check). Theceph-02 problem may be the sign of SSH misconfiguration ascephadm will use SSH connection to push the keyring, if I amright.
Michel

Le 21/07/2025 à 12:17, Stéphane Barthes a écrit :
Hi,
Should I just wipe the OSD and let ceph rebuild it (assuggested there :https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-from-corrupted-rocksdb) ?
Which would the suggested way be  :

cephadm rm-daemon osd.ceph-01

then

cephadm deploy ?


Regards,


S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 10:33, Stéphane Barthes a écrit :
Michel,

ceph-02 logs :

root@srvr-ceph-02:/# ceph log last debug cephadm
2025-07-21T08:16:54.814+0000 7efe1a884640 -1 auth: unable tofind a keyring on/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) Nosuch file or directory
2025-07-21T08:16:54.814+0000 7efe1a884640 -1AuthRegistry(0x7efe14064de0) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2025-07-21T08:16:54.818+0000 7efe1a884640 -1 auth: unable tofind a keyring on/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) Nosuch file or directory
2025-07-21T08:16:54.818+0000 7efe1a884640 -1AuthRegistry(0x7efe1a883000) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2025-07-21T08:16:54.818+0000 7efe137fe640 -1monclient(hunting): handle_auth_bad_method serverallowed_methods [2] but i only support [1]
2025-07-21T08:16:54.818+0000 7efe18e21640 -1monclient(hunting): handle_auth_bad_method serverallowed_methods [2] but i only support [1]
2025-07-21T08:16:57.818+0000 7efe137fe640 -1monclient(hunting): handle_auth_bad_method serverallowed_methods [2] but i only support [1]
2025-07-21T08:16:57.818+0000 7efe13fff640 -1monclient(hunting): handle_auth_bad_method serverallowed_methods [2] but i only support [1]
2025-07-21T08:17:00.818+0000 7efe137fe640 -1monclient(hunting): handle_auth_bad_method serverallowed_methods [2] but i only support [1]
2025-07-21T08:17:00.818+0000 7efe18e21640 -1monclient(hunting): handle_auth_bad_method serverallowed_methods [2] but i only support [1]
^CCluster connection aborted
root@srvr-ceph-02:/#
Regarding the ceph-01 log, there is a LOT. looking from theend, I see this :
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -19>2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 auth:KeyRing::load: loaded key file/var/lib/ceph/mon/ceph-srvr-ceph-01/keyringJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -18>2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 mon.srvr-ceph-01@-1(???) e5 initJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -17>2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrchandle_mgr_map Got map version 73Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -16>2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrchandle_mgr_map Active mgr is now[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -15>2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc reconnectStarting new session with[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -14>2025-07-20T17:52:21.137+0000 7f359c206640 -1 mon.srvr-ceph-01@-1(???) e5 handle_auth_bad_method hmm, they didn'tlike 2 result (13) Permission deniedJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -13>2025-07-20T17:52:21.137+0000 7f359f42d8c0 0 mon.srvr-ceph-01@-1(probing) e5 my rank is now 0 (was -1)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -12>2025-07-20T17:52:21.161+0000 7f359d208640 3 rocksdb:[db/db_impl/ db_impl_compaction_flush.cc:3026] Compactionerror: Corruption: block checksum mismatch: stored =3368055299, computed = 2100551158 in/var/lib/ceph/mon/ceph-srvr-ceph-01/ store.db/061999.sstoffset 10379525 size 91317Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -11>2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb:(Original Log Time 2025/07/20-17:52:21.164193)[db/compaction/ compaction_job.cc:812] [default] compactedto: base level 6 level multiplier 10.00 max bytes base268435456 files[4 0 0 0 0 0 1] max score 0.00, MB/sec: 514.9rd, 272.6 wr, level 6, files in(4, 1) out(0) MB in(4.0,14.8) out(9.9), read-write-amplify(7.2) write- amplify(2.5)Corruption: block checksum mismatch: stored = 3368055299,computed = 2100551158 in /var/lib/ceph/mon/ceph-srvr-ceph-01/store.db/061999.sst offset 10379525 size 91317,records in: 25191, records dropped: 3Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -10>2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb:(Original Log Time 2025/07/20-17:52:21.164212) EVENT_LOG_v1{"time_micros": 1753033941164205, "job": 3, "event":"compaction_finished", "compaction_time_micros": 38166,"compaction_time_cpu_micros": 25133, "output_level": 6,"num_output_files": 0, "total_output_size": 10404253,"num_input_records": 25191, "num_output_records": 21216,"num_subcompactions": 1, "output_compression":"NoCompression", "num_single_delete_mismatches": 0,"num_single_delete_fallthrough": 0, "lsm_state": [4, 0, 0,0, 0, 0, 1]}Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -9>2025-07-20T17:52:21.161+0000 7f359d208640 2 rocksdb:[db/db_impl/ db_impl_compaction_flush.cc:2545] Waiting afterbackground compaction error: Corruption: block checksummismatch: stored = 3368055299, computed = 2100551158 in/var/lib/ceph/mon/ceph-srvr- ceph-01/store.db/061999.sstoffset 10379525 size 91317, Accumulated background errorcounts: 1Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -8>2025-07-20T17:52:21.341+0000 7f359c206640 -1 mon.srvr-ceph-01@0(probing) e5 handle_auth_bad_method hmm, theydidn't like 2 result (13) Permission deniedJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -7>2025-07-20T17:52:21.741+0000 7f359c206640 -1 mon.srvr-ceph-01@0(probing) e5 handle_auth_bad_method hmm, theydidn't like 2 result (13) Permission deniedJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -6>2025-07-20T17:52:21.741+0000 7f359c206640 1 mon.srvr-ceph-01@0(probing) e5 handle_auth_request failed to assignglobal_idJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -5>2025-07-20T17:52:21.749+0000 7f35981fe640 5 mon.srvr-ceph-01@0(probing) e5 _ms_dispatch setting monitor caps onthis connectionJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -4>2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr-ceph-01@0(synchronizing) e5 sync_obtain_latest_monmapJul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -3>2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr-ceph-01@0(synchronizing) e5 sync_obtain_latest_monmapobtained monmap e5
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: [285B blob data]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =mon_sync key = 'latest_monmap' value size = 508)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =mon_sync key = 'in_sync' value size = 8)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =mon_sync key = 'last_committed_floor' value size = 8)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -1>2025-07-20T17:52:21.749+0000 7f35981fe640 -1/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:In function 'intMonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)'thread 7f35981fe640 time 2025-07-20T17:52:21.750611+0000Jul 20 17:52:21 srvr-ceph-01 bash[3424]:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:355: ceph_abort_msg("failed to write to db")Jul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version17.2.8 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy(stable)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1:(ceph::__ceph_abort(char const*, int, char const*,std::__cxx11::basic_string<char, std::char_traits<char>,std::allocator<char> > const&)+0xd3) [0x7f35a03a5469]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /usr/bin/ceph-mon(+0x1e968e) [0x55e079c5768e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3:(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)[0x55e079c8a145]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4:(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)[0x55e079c90baf]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)[0x55e079c925dc]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6:(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)[0x55e079cad20d]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8: /usr/bin/ceph-mon(+0x1f6d3e) [0x55e079c64d3e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10:/usr/lib64/ceph/ libceph-common.so.2(+0x3bdea1)[0x7f35a061eea1]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11: /lib64/libc.so.6(+0x89e92) [0x7f359fb6ce92]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /lib64/libc.so.6(+0x10ef20) [0x7f359fbf1f20]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug 0>2025-07-20T17:52:21.749+0000 7f35981fe640 -1 *** Caughtsignal (Aborted) **Jul 20 17:52:21 srvr-ceph-01 bash[3424]: in thread7f35981fe640 thread_name:ms_dispatchJul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version17.2.8 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy(stable)Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1: /lib64/libc.so.6(+0x3e730) [0x7f359fb21730]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /lib64/libc.so.6(+0x8bbdc) [0x7f359fb6ebdc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3: raise()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4: abort()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:(ceph::__ceph_abort(char const*, int, char const*,std::__cxx11::basic_string<char, std::char_traits<char>,std::allocator<char> > const&)+0x190) [0x7f35a03a5526]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6: /usr/bin/ceph-mon(+0x1e968e) [0x55e079c5768e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)[0x55e079c8a145]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8:(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)[0x55e079c90baf]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)[0x55e079c925dc]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10:(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)[0x55e079cad20d]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11:(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /usr/bin/ceph-mon(+0x1f6d3e) [0x55e079c64d3e]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 13:(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 14:/usr/lib64/ceph/ libceph-common.so.2(+0x3bdea1)[0x7f35a061eea1]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 15: /lib64/libc.so.6(+0x89e92) [0x7f359fb6ce92]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 16: /lib64/libc.so.6(+0x10ef20) [0x7f359fbf1f20]Jul 20 17:52:21 srvr-ceph-01 bash[3424]: NOTE: a copy of theexecutable, or `objdump -rdS <executable>` is needed tointerpret this.
I do not know if the logs are purged from sensitive datathat would prevent emailing them. looking for "checksummismatch", in the logs, there are many of them (138).
How can I fix this checksum issue?


Regards,


S. Barthes

Le 21/07/2025 à 09:59, Michel Jouvin a écrit :
Stéphane,
On ceph-02, I am not sure why the ceph command is notinstalled as on the other nodes, if you installed it thesame way. One way to get access to the ceph command on thisserver should be to execute:
cephadm shell
This will start a container where you have the cephenvironment installed and configured for your cluster.
The situation is not as bad as I thought reading your firstmessage. You have the mon quorum so at least ceph commandshould be usable. The first thing to do is probably to logon your ceph-01 node and try to understand why the mondaemon is crashing. You may want to run on this node:
cephadm ls ---> Look for the exact daemon namecorresponding to the mon
cephadm logs --daemon $daemon_name
Apart from this, it is strange that ceph-03 report a RADOSerror with 'ceph log last...', this probably hides anotherissue. Could you tell what the same command says on ceph-02(when run in cephadm shell)?
Michel

Le 21/07/2025 à 09:44, Stéphane Barthes a écrit :
Michel,
I ran "ceph log last debug cephadm" on my 3 nodes, and"mileage varies"
ceph-01 :

some errors, and it ends with
2025-07-20T03:24:18.887889+0000 mgr.srvr-ceph-03.dhzbpe(mgr.134360) 1368 : cephadm [INF] Deploying daemonmon.srvr- ceph-03 on srvr-ceph-03
when I had to remove the mon daemon and redeploy on ceph-03.

ceph-02 :

root@srvr-ceph-02:~# ceph log last debug cephadm
Command 'ceph' not found, but can be installed with:
snap install microceph    # version 18.2.4+snapc9f2b08f92, or
apt  install ceph-common  # version 17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional versions.

??? should I install ceph-common ???

ceph-03 :

root@srvr-ceph-03:~# ceph log last debug cephadm
Error initializing cluster client: ObjectNotFound('RADOSobject not found (error calling conf_read_file)')
root@srvr-ceph-03:~#

FWIW : ceph health is :

root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down,quorum srvr-ceph-03,srvr-ceph-02; 10 daemons have recentlycrashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon mon.srvr-ceph-01 on srvr-ceph-01 is in error state
[WRN] MON_DOWN: 1/3 mons down, quorumsrvr-ceph-03,srvr-ceph-02 mon.srvr-ceph-01 (rank 0) addr[v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is down(out of quorum)
[WRN] RECENT_CRASH: 10 daemons have recently crashed
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:50:10.202091Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:47.712267Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:50:21.464475Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:36.609442Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:58.966663Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:36.947240Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:52:21.751711Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:48.490875Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:59.651129Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:52:10.552756Z
S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a écrit :
Stephane,
If you are using cephadm, the OS (distrib and version)you use should not matter. When using cephadm withseveral servers (the general case!), it is important tosetup properly the SSH key used by cephadm for thecommunication between nodes (cephadm is sort of aSSH-based management cluster) and to check that you canlog in from one node to the other using SSH. Can youconfirm that it is the case?
Also cephadm has a specific log file. I don't use muchthe dashboard, not sure how you display it (it may bepart of the logs displayed by the dashboard) but you canaccess it with the command:
ceph log last debug cephadm

Michel

Le 21/07/2025 à 09:19, Stéphane Barthes a écrit :
Hi,
Yes, I did use cephadm, to bootstrap the 1st node in thecluster, installed cephadm on the other nodes, and usedthe dashboard to add the nodes to the cluster.
Regards,

S. Barthes

Le 21/07/2025 à 09:12, Michel Jouvin a écrit :
Hi Stephane,
How did you configure your cluster? Have you been usingcephadm? If not, I really advise you to recreate yourcluster with cephadm, that includes a script tobootstrap the cluster. In particular if you don't havea detail knowledge about Ceph architecture andmanagement, it will ensure that your cluster isproperly configured and let you progressively learnabout Ceph details...
Best regards.

Michel

Le 21/07/2025 à 09:02, Stéphane Barthes a écrit :
Hello,
I am very new to ceph and have started a small clusterto get started with ceph.
But so far my experience, is not very impressive,probably by lack of knowledge and good practices.
I started with Ubuntu 24, installed 3 VM for a cephcluster, and some how could not get it running. Addingnodes would fail adding OSDs with some weird error(Ifound it on the web but could not solve the problem).
I then made a new cluster with 3 ubuntu 22 VM. Installok, start ok, I created 1 pool to test storing stuffthere and work my way across crash testing. Howeverthe cluster dies during the weekly vm snapshot. It maynot a good idea to run vm backups on a ceph host, butI find this a little surprising. (crash testingstarted earlier than expected)
Bottom line is that, after the backup the cluster isin warning state with missing mons, or logrotate andsometimes crashed machines. systemctl restart serviceor Rebooting node usually fixes it.
I am now stuck in a situation I cannot fix :
- 1 Machine is ceph rbd client cannot auth : authmethod 'x' error -13. I have tried quite a few things,and none unlocked the situation. I am currently tryingto reboot the machine, but the busy/stuck rbd deviceseems to block it. I am not looking forward to hardreset it.
- Node with the mgr service will not restart mon,or logrotate. I did reboot it again today, but I guessthis is not how a node is expected to behave.
So my questions :
- How can I unlock my stuck ceph client, when thiskind of error occurs?
- Is this expected behavior that client loosesaccess to cluster, which kind of kills the machine?
- Where should I look in the ceph nodes logs tofigure what is going wrong, and how to fix it, so thatis run in a stable manner?
Regards,

--
S. Barthes

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Newby woes with ceph

Reply via email to