Hi Malte why " And do not use Ubuntu 24.04." please ? I just reinstalled my cluster and use 24.04 and 19.2.2. so , if need be, there is still time to redo / reconfigure
Steven On Tue, 22 Jul 2025 at 04:05, Malte Stroem <malte.str...@gmail.com> wrote: > Hello Stéphane, > > I think, you're mixing and mismatching up a lot! > > You always have to show us the output of: > > ceph -s > > And more! Logs and stuff, e. g.: > > ceph health detail > > It is clear you missed something here and there. > > It is repairable but since it is a test cluster, just delete it and > start again. > > And follow the documentation for cephadm. And do not use Ubuntu 24.04. > > Best, > Malte > > On 22.07.25 09:02, Stéphane Barthes wrote: > > Hello, > > > > > > Today, things have degraded a bit more. ceph-03 mon has failed and will > > not restart. It shows the same kind of checksum error in rocksdb compact > > operation during startup. As a consequence, I lost quorum, and ceph > > commands hang. > > > > > > Would it be wise to disable rocksbd compact, to restart and find quorum > > back? If yes, what is the exactt syntax of the setting in ceph.conf, I > > have seen one for OSD, but not sure if it would apply: > > > > [osd] > > > > osd_compact_on_start = true > > > > > > If I can restart, I will try to out the OSDs, and recreate them. Last > > time I saw the OSD seemed fine in the dashboard. Since I hove no > > dashboard, is there a command I can use to check their status? > > > > > > Regards, > > > > S. Barthes > > T: +33 4 72 52 35 40 M: +33 6 14 73 18 34 > > InTest S.A. > > 4 Allée du Levant > > 69890 La Tour de Salvagny > > > > Le 21/07/2025 à 14:27, Stéphane Barthes a écrit : > >> > >> Michel, > >> > >> > >> cephadm shell starts on all 3 nodes without error, and each host as > >> the same ceph public key entry in the .ssh/autorized_key file of the > >> root user. > >> > >> ceph-01 also has ceph.pub in /etc/ceph with the same key (this is the > >> node I started the install from) > >> > >> ceph-2 has no/etc/ceph folder > >> > >> ceph-3 has a /etc/ceph folder, but no ceph.pub file there > >> > >> > >> S. Barthes > >> Le 21/07/2025 à 12:36, Michel Jouvin a écrit : > >>> Hi Stéphane, > >>> > >>> Sorry I was busy and did not look at your previous answers... It is a > >>> bit difficult for me to understand how you ended up in this situation > >>> but for me it is strange that ceph-02 complains about a missing > >>> keyring and the corruped rocks.db on a freshly created cluster is > >>> also a bit strange for me. I don't think it makes sense to destroy > >>> and recreate the OSD, I am running several clusters with hundreds of > >>> OSDs and I never saw a mis-initialized one. The problem is hiding > >>> something else I'm afraid. Because of some misconfiguration, may be > >>> one OSD is in a bad state and may need to be reinitialized but first > >>> we should get the 3 mons running properly and `cephadm shell` working > >>> properly on the 3 hosts. And the rocks.db compaction issue for me is > >>> related to your mon, not to an OSD. > >>> > >>> Have you checked that SSH configuration for cephadm is working well > >>> from any host to any other one in your cluster (with 3 hosts, it > >>> should be really straighforward to check). The ceph-02 problem may be > >>> the sign of SSH misconfiguration as cephadm will use SSH connection > >>> to push the keyring, if I am right. > >>> > >>> Michel > >>> > >>> Le 21/07/2025 à 12:17, Stéphane Barthes a écrit : > >>>> > >>>> Hi, > >>>> > >>>> > >>>> Should I just wipe the OSD and let ceph rebuild it (as suggested > >>>> there : https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover- > >>>> from-corrupted-rocksdb) ? > >>>> > >>>> Which would the suggested way be : > >>>> > >>>> cephadm rm-daemon osd.ceph-01 > >>>> > >>>> then > >>>> > >>>> cephadm deploy ? > >>>> > >>>> > >>>> Regards, > >>>> > >>>> > >>>> S. Barthes > >>>> T: +33 4 72 52 35 40 M: +33 6 14 73 18 34 > >>>> InTest S.A. > >>>> 4 Allée du Levant > >>>> 69890 La Tour de Salvagny > >>>> Le 21/07/2025 à 10:33, Stéphane Barthes a écrit : > >>>>> > >>>>> Michel, > >>>>> > >>>>> ceph-02 logs : > >>>>> > >>>>> root@srvr-ceph-02:/# ceph log last debug cephadm > >>>>> 2025-07-21T08:16:54.814+0000 7efe1a884640 -1 auth: unable to find a > >>>>> keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ > >>>>> ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such > >>>>> file or directory > >>>>> > >>>>> 2025-07-21T08:16:54.814+0000 7efe1a884640 -1 > >>>>> AuthRegistry(0x7efe14064de0) no keyring found at /etc/ceph/ > >>>>> ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/ > >>>>> keyring,/etc/ceph/keyring.bin, disabling cephx > >>>>> > >>>>> 2025-07-21T08:16:54.818+0000 7efe1a884640 -1 auth: unable to find a > >>>>> keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ > >>>>> ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such > >>>>> file or directory > >>>>> > >>>>> 2025-07-21T08:16:54.818+0000 7efe1a884640 -1 > >>>>> AuthRegistry(0x7efe1a883000) no keyring found at /etc/ceph/ > >>>>> ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/ > >>>>> keyring,/etc/ceph/keyring.bin, disabling cephx > >>>>> > >>>>> 2025-07-21T08:16:54.818+0000 7efe137fe640 -1 monclient(hunting): > >>>>> handle_auth_bad_method server allowed_methods [2] but i only > >>>>> support [1] > >>>>> > >>>>> 2025-07-21T08:16:54.818+0000 7efe18e21640 -1 monclient(hunting): > >>>>> handle_auth_bad_method server allowed_methods [2] but i only > >>>>> support [1] > >>>>> > >>>>> 2025-07-21T08:16:57.818+0000 7efe137fe640 -1 monclient(hunting): > >>>>> handle_auth_bad_method server allowed_methods [2] but i only > >>>>> support [1] > >>>>> > >>>>> 2025-07-21T08:16:57.818+0000 7efe13fff640 -1 monclient(hunting): > >>>>> handle_auth_bad_method server allowed_methods [2] but i only > >>>>> support [1] > >>>>> > >>>>> 2025-07-21T08:17:00.818+0000 7efe137fe640 -1 monclient(hunting): > >>>>> handle_auth_bad_method server allowed_methods [2] but i only > >>>>> support [1] > >>>>> > >>>>> 2025-07-21T08:17:00.818+0000 7efe18e21640 -1 monclient(hunting): > >>>>> handle_auth_bad_method server allowed_methods [2] but i only > >>>>> support [1] > >>>>> > >>>>> ^CCluster connection aborted > >>>>> root@srvr-ceph-02:/# > >>>>> > >>>>> > >>>>> Regarding the ceph-01 log, there is a LOT. looking from the end, I > >>>>> see this : > >>>>> > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -19> > >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 auth: KeyRing::load: > >>>>> loaded key file /var/lib/ceph/mon/ceph-srvr-ceph-01/keyring > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -18> > >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0 2 mon.srvr- > >>>>> ceph-01@-1(???) e5 init > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -17> > >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc handle_mgr_map > >>>>> Got map version 73 > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -16> > >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc handle_mgr_map > >>>>> Active mgr is now > >>>>> [v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -15> > >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0 4 mgrc reconnect > >>>>> Starting new session with > >>>>> [v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -14> > >>>>> 2025-07-20T17:52:21.137+0000 7f359c206640 -1 mon.srvr- > >>>>> ceph-01@-1(???) e5 handle_auth_bad_method hmm, they didn't like 2 > >>>>> result (13) Permission denied > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -13> > >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0 0 mon.srvr- > >>>>> ceph-01@-1(probing) e5 my rank is now 0 (was -1) > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -12> > >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640 3 rocksdb: [db/db_impl/ > >>>>> db_impl_compaction_flush.cc:3026] Compaction error: Corruption: > >>>>> block checksum mismatch: stored = 3368055299, computed = > >>>>> 2100551158 in /var/lib/ceph/mon/ceph-srvr-ceph-01/ > >>>>> store.db/061999.sst offset 10379525 size 91317 > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -11> > >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb: (Original Log > >>>>> Time 2025/07/20-17:52:21.164193) [db/compaction/ > >>>>> compaction_job.cc:812] [default] compacted to: base level 6 level > >>>>> multiplier 10.00 max bytes base 268435456 files[4 0 0 0 0 0 1] max > >>>>> score 0.00, MB/sec: 514.9 rd, 272.6 wr, level 6, files in(4, 1) > >>>>> out(0) MB in(4.0, 14.8) out(9.9), read-write-amplify(7.2) write- > >>>>> amplify(2.5) Corruption: block checksum mismatch: stored = > >>>>> 3368055299, computed = 2100551158 in /var/lib/ceph/mon/ceph-srvr- > >>>>> ceph-01/store.db/061999.sst offset 10379525 size 91317, records in: > >>>>> 25191, records dropped: 3 > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -10> > >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640 4 rocksdb: (Original Log > >>>>> Time 2025/07/20-17:52:21.164212) EVENT_LOG_v1 {"time_micros": > >>>>> 1753033941164205, "job": 3, "event": "compaction_finished", > >>>>> "compaction_time_micros": 38166, "compaction_time_cpu_micros": > >>>>> 25133, "output_level": 6, "num_output_files": 0, > >>>>> "total_output_size": 10404253, "num_input_records": 25191, > >>>>> "num_output_records": 21216, "num_subcompactions": 1, > >>>>> "output_compression": "NoCompression", > >>>>> "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": > >>>>> 0, "lsm_state": [4, 0, 0, 0, 0, 0, 1]} > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -9> > >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640 2 rocksdb: [db/db_impl/ > >>>>> db_impl_compaction_flush.cc:2545] Waiting after background > >>>>> compaction error: Corruption: block checksum mismatch: stored = > >>>>> 3368055299, computed = 2100551158 in /var/lib/ceph/mon/ceph-srvr- > >>>>> ceph-01/store.db/061999.sst offset 10379525 size 91317, Accumulated > >>>>> background error counts: 1 > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -8> > >>>>> 2025-07-20T17:52:21.341+0000 7f359c206640 -1 mon.srvr- > >>>>> ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn't like > >>>>> 2 result (13) Permission denied > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -7> > >>>>> 2025-07-20T17:52:21.741+0000 7f359c206640 -1 mon.srvr- > >>>>> ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn't like > >>>>> 2 result (13) Permission denied > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -6> > >>>>> 2025-07-20T17:52:21.741+0000 7f359c206640 1 mon.srvr- > >>>>> ceph-01@0(probing) e5 handle_auth_request failed to assign global_id > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -5> > >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640 5 mon.srvr- > >>>>> ceph-01@0(probing) e5 _ms_dispatch setting monitor caps on this > >>>>> connection > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -4> > >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr- > >>>>> ceph-01@0(synchronizing) e5 sync_obtain_latest_monmap > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -3> > >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640 1 mon.srvr- > >>>>> ceph-01@0(synchronizing) e5 sync_obtain_latest_monmap obtained > >>>>> monmap e5 > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: [285B blob data] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync > >>>>> key = 'latest_monmap' value size = 508) > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync > >>>>> key = 'in_sync' value size = 8) > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync > >>>>> key = 'last_committed_floor' value size = 8) > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug -1> > >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640 -1 /home/jenkins-build/ > >>>>> build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/ > >>>>> AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/ > >>>>> release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h: > >>>>> In function 'int > >>>>> MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)' > >>>>> thread 7f35981fe640 time 2025-07-20T17:52:21.750611+0000 > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: /home/jenkins-build/build/ > >>>>> workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/ > >>>>> AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/ > >>>>> release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h: > >>>>> 355: ceph_abort_msg("failed to write to db") > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version 17.2.8 > >>>>> (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable) > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1: > >>>>> (ceph::__ceph_abort(char const*, int, char const*, > >>>>> std::__cxx11::basic_string<char, std::char_traits<char>, > >>>>> std::allocator<char> > const&)+0xd3) [0x7f35a03a5469] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /usr/bin/ceph- > >>>>> mon(+0x1e968e) [0x55e079c5768e] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3: > >>>>> (Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5) [0x55e079c8a145] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4: > >>>>> > (Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f) > [0x55e079c90baf] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5: > >>>>> (Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c) > >>>>> [0x55e079c925dc] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6: > >>>>> (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d) > >>>>> [0x55e079cad20d] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7: > >>>>> (Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8: /usr/bin/ceph- > >>>>> mon(+0x1f6d3e) [0x55e079c64d3e] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9: > >>>>> (DispatchQueue::entry()+0x53a) [0x7f35a058d34a] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10: /usr/lib64/ceph/ > >>>>> libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11: /lib64/ > >>>>> libc.so.6(+0x89e92) [0x7f359fb6ce92] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /lib64/ > >>>>> libc.so.6(+0x10ef20) [0x7f359fbf1f20] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug 0> > >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640 -1 *** Caught signal > >>>>> (Aborted) ** > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: in thread 7f35981fe640 > >>>>> thread_name:ms_dispatch > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: ceph version 17.2.8 > >>>>> (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable) > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1: /lib64/ > >>>>> libc.so.6(+0x3e730) [0x7f359fb21730] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2: /lib64/ > >>>>> libc.so.6(+0x8bbdc) [0x7f359fb6ebdc] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3: raise() > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4: abort() > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5: > >>>>> (ceph::__ceph_abort(char const*, int, char const*, > >>>>> std::__cxx11::basic_string<char, std::char_traits<char>, > >>>>> std::allocator<char> > const&)+0x190) [0x7f35a03a5526] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6: /usr/bin/ceph- > >>>>> mon(+0x1e968e) [0x55e079c5768e] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7: > >>>>> (Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5) [0x55e079c8a145] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8: > >>>>> > (Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f) > [0x55e079c90baf] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9: > >>>>> (Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c) > >>>>> [0x55e079c925dc] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 10: > >>>>> (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d) > >>>>> [0x55e079cad20d] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 11: > >>>>> (Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 12: /usr/bin/ceph- > >>>>> mon(+0x1f6d3e) [0x55e079c64d3e] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 13: > >>>>> (DispatchQueue::entry()+0x53a) [0x7f35a058d34a] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 14: /usr/lib64/ceph/ > >>>>> libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 15: /lib64/ > >>>>> libc.so.6(+0x89e92) [0x7f359fb6ce92] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 16: /lib64/ > >>>>> libc.so.6(+0x10ef20) [0x7f359fbf1f20] > >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: NOTE: a copy of the > >>>>> executable, or `objdump -rdS <executable>` is needed to interpret > >>>>> this. > >>>>> > >>>>> I do not know if the logs are purged from sensitive data that would > >>>>> prevent emailing them. looking for "checksum mismatch", in the > >>>>> logs, there are many of them (138). > >>>>> > >>>>> How can I fix this checksum issue? > >>>>> > >>>>> > >>>>> Regards, > >>>>> > >>>>> > >>>>> S. Barthes > >>>>> > >>>>> Le 21/07/2025 à 09:59, Michel Jouvin a écrit : > >>>>>> Stéphane, > >>>>>> > >>>>>> On ceph-02, I am not sure why the ceph command is not installed as > >>>>>> on the other nodes, if you installed it the same way. One way to > >>>>>> get access to the ceph command on this server should be to execute: > >>>>>> > >>>>>> cephadm shell > >>>>>> > >>>>>> This will start a container where you have the ceph environment > >>>>>> installed and configured for your cluster. > >>>>>> > >>>>>> The situation is not as bad as I thought reading your first > >>>>>> message. You have the mon quorum so at least ceph command should > >>>>>> be usable. The first thing to do is probably to log on your > >>>>>> ceph-01 node and try to understand why the mon daemon is crashing. > >>>>>> You may want to run on this node: > >>>>>> > >>>>>> cephadm ls ---> Look for the exact daemon name corresponding to > >>>>>> the mon > >>>>>> > >>>>>> cephadm logs --daemon $daemon_name > >>>>>> > >>>>>> Apart from this, it is strange that ceph-03 report a RADOS error > >>>>>> with 'ceph log last...', this probably hides another issue. Could > >>>>>> you tell what the same command says on ceph-02 (when run in > >>>>>> cephadm shell)? > >>>>>> > >>>>>> Michel > >>>>>> > >>>>>> Le 21/07/2025 à 09:44, Stéphane Barthes a écrit : > >>>>>>> > >>>>>>> Michel, > >>>>>>> > >>>>>>> > >>>>>>> I ran "ceph log last debug cephadm" on my 3 nodes, and "mileage > >>>>>>> varies" > >>>>>>> > >>>>>>> ceph-01 : > >>>>>>> > >>>>>>> some errors, and it ends with > >>>>>>> > >>>>>>> 2025-07-20T03:24:18.887889+0000 mgr.srvr-ceph-03.dhzbpe > >>>>>>> (mgr.134360) 1368 : cephadm [INF] Deploying daemon mon.srvr- > >>>>>>> ceph-03 on srvr-ceph-03 > >>>>>>> > >>>>>>> when I had to remove the mon daemon and redeploy on ceph-03. > >>>>>>> > >>>>>>> ceph-02 : > >>>>>>> > >>>>>>> root@srvr-ceph-02:~# ceph log last debug cephadm > >>>>>>> Command 'ceph' not found, but can be installed with: > >>>>>>> snap install microceph # version 18.2.4+snapc9f2b08f92, or > >>>>>>> apt install ceph-common # version 17.2.7-0ubuntu0.22.04.2 > >>>>>>> See 'snap info microceph' for additional versions. > >>>>>>> > >>>>>>> ??? should I install ceph-common ??? > >>>>>>> > >>>>>>> ceph-03 : > >>>>>>> > >>>>>>> root@srvr-ceph-03:~# ceph log last debug cephadm > >>>>>>> Error initializing cluster client: ObjectNotFound('RADOS object > >>>>>>> not found (error calling conf_read_file)') > >>>>>>> root@srvr-ceph-03:~# > >>>>>>> > >>>>>>> FWIW : ceph health is : > >>>>>>> > >>>>>>> root@srvr-ceph-01:~# ceph health detail > >>>>>>> HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum > >>>>>>> srvr-ceph-03,srvr-ceph-02; 10 daemons have recently crashed > >>>>>>> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s) > >>>>>>> daemon mon.srvr-ceph-01 on srvr-ceph-01 is in error state > >>>>>>> [WRN] MON_DOWN: 1/3 mons down, quorum srvr-ceph-03,srvr-ceph-02 > >>>>>>> mon.srvr-ceph-01 (rank 0) addr > >>>>>>> [v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is down (out of > >>>>>>> quorum) > >>>>>>> [WRN] RECENT_CRASH: 10 daemons have recently crashed > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:50:10.202091Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:49:47.712267Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:50:21.464475Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:49:36.609442Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:49:58.966663Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:51:36.947240Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:52:21.751711Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:51:48.490875Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:51:59.651129Z > >>>>>>> mon.srvr-ceph-01 crashed on host srvr-ceph-01 at > >>>>>>> 2025-07-20T17:52:10.552756Z > >>>>>>> > >>>>>>> S. Barthes > >>>>>>> Le 21/07/2025 à 09:31, Michel Jouvin a écrit : > >>>>>>>> Stephane, > >>>>>>>> > >>>>>>>> If you are using cephadm, the OS (distrib and version) you use > >>>>>>>> should not matter. When using cephadm with several servers (the > >>>>>>>> general case!), it is important to setup properly the SSH key > >>>>>>>> used by cephadm for the communication between nodes (cephadm is > >>>>>>>> sort of a SSH-based management cluster) and to check that you > >>>>>>>> can log in from one node to the other using SSH. Can you confirm > >>>>>>>> that it is the case? > >>>>>>>> > >>>>>>>> Also cephadm has a specific log file. I don't use much the > >>>>>>>> dashboard, not sure how you display it (it may be part of the > >>>>>>>> logs displayed by the dashboard) but you can access it with the > >>>>>>>> command: > >>>>>>>> > >>>>>>>> ceph log last debug cephadm > >>>>>>>> > >>>>>>>> Michel > >>>>>>>> > >>>>>>>> Le 21/07/2025 à 09:19, Stéphane Barthes a écrit : > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> Yes, I did use cephadm, to bootstrap the 1st node in the > >>>>>>>>> cluster, installed cephadm on the other nodes, and used the > >>>>>>>>> dashboard to add the nodes to the cluster. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> > >>>>>>>>> S. Barthes > >>>>>>>>> > >>>>>>>>> Le 21/07/2025 à 09:12, Michel Jouvin a écrit : > >>>>>>>>>> Hi Stephane, > >>>>>>>>>> > >>>>>>>>>> How did you configure your cluster? Have you been using > >>>>>>>>>> cephadm? If not, I really advise you to recreate your cluster > >>>>>>>>>> with cephadm, that includes a script to bootstrap the cluster. > >>>>>>>>>> In particular if you don't have a detail knowledge about Ceph > >>>>>>>>>> architecture and management, it will ensure that your cluster > >>>>>>>>>> is properly configured and let you progressively learn about > >>>>>>>>>> Ceph details... > >>>>>>>>>> > >>>>>>>>>> Best regards. > >>>>>>>>>> > >>>>>>>>>> Michel > >>>>>>>>>> > >>>>>>>>>> Le 21/07/2025 à 09:02, Stéphane Barthes a écrit : > >>>>>>>>>>> > >>>>>>>>>>> Hello, > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I am very new to ceph and have started a small cluster to get > >>>>>>>>>>> started with ceph. > >>>>>>>>>>> > >>>>>>>>>>> But so far my experience, is not very impressive, probably by > >>>>>>>>>>> lack of knowledge and good practices. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I started with Ubuntu 24, installed 3 VM for a ceph cluster, > >>>>>>>>>>> and some how could not get it running. Adding nodes would > >>>>>>>>>>> fail adding OSDs with some weird error(I found it on the web > >>>>>>>>>>> but could not solve the problem). > >>>>>>>>>>> > >>>>>>>>>>> I then made a new cluster with 3 ubuntu 22 VM. Install ok, > >>>>>>>>>>> start ok, I created 1 pool to test storing stuff there and > >>>>>>>>>>> work my way across crash testing. However the cluster dies > >>>>>>>>>>> during the weekly vm snapshot. It may not a good idea to run > >>>>>>>>>>> vm backups on a ceph host, but I find this a little > >>>>>>>>>>> surprising. (crash testing started earlier than expected) > >>>>>>>>>>> > >>>>>>>>>>> Bottom line is that, after the backup the cluster is in > >>>>>>>>>>> warning state with missing mons, or logrotate and sometimes > >>>>>>>>>>> crashed machines. systemctl restart service or Rebooting node > >>>>>>>>>>> usually fixes it. > >>>>>>>>>>> > >>>>>>>>>>> I am now stuck in a situation I cannot fix : > >>>>>>>>>>> > >>>>>>>>>>> - 1 Machine is ceph rbd client cannot auth : auth method > >>>>>>>>>>> 'x' error -13. I have tried quite a few things, and none > >>>>>>>>>>> unlocked the situation. I am currently trying to reboot the > >>>>>>>>>>> machine, but the busy/stuck rbd device seems to block it. I > >>>>>>>>>>> am not looking forward to hard reset it. > >>>>>>>>>>> > >>>>>>>>>>> - Node with the mgr service will not restart mon, or > >>>>>>>>>>> logrotate. I did reboot it again today, but I guess this is > >>>>>>>>>>> not how a node is expected to behave. > >>>>>>>>>>> > >>>>>>>>>>> So my questions : > >>>>>>>>>>> > >>>>>>>>>>> - How can I unlock my stuck ceph client, when this kind > >>>>>>>>>>> of error occurs? > >>>>>>>>>>> > >>>>>>>>>>> - Is this expected behavior that client looses access to > >>>>>>>>>>> cluster, which kind of kills the machine? > >>>>>>>>>>> > >>>>>>>>>>> - Where should I look in the ceph nodes logs to figure > >>>>>>>>>>> what is going wrong, and how to fix it, so that is run in a > >>>>>>>>>>> stable manner? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> S. Barthes > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>>>>>> _______________________________________________ > >>>>>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@ceph.io > >>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io