Hi Malte

why " And do not use Ubuntu 24.04." please ?
I just reinstalled my cluster and use 24.04 and 19.2.2. so , if need be,
there is still time to redo / reconfigure

Steven

On Tue, 22 Jul 2025 at 04:05, Malte Stroem <malte.str...@gmail.com> wrote:

> Hello Stéphane,
>
> I think, you're mixing and mismatching up a lot!
>
> You always have to show us the output of:
>
> ceph -s
>
> And more! Logs and stuff, e. g.:
>
> ceph health detail
>
> It is clear you missed something here and there.
>
> It is repairable but since it is a test cluster, just delete it and
> start again.
>
> And follow the documentation for cephadm. And do not use Ubuntu 24.04.
>
> Best,
> Malte
>
> On 22.07.25 09:02, Stéphane Barthes wrote:
> > Hello,
> >
> >
> > Today, things have degraded a bit more. ceph-03 mon has failed and will
> > not restart. It shows the same kind of checksum error in rocksdb compact
> > operation during startup. As a consequence, I lost quorum, and ceph
> > commands hang.
> >
> >
> > Would it be wise to disable rocksbd compact, to restart and find quorum
> > back? If yes, what is the exactt syntax of the setting in ceph.conf, I
> > have seen one for OSD, but not sure if it would apply:
> >
> > [osd]
> >
> > osd_compact_on_start = true
> >
> >
> > If I can restart, I will try to out the OSDs, and recreate them. Last
> > time I saw the OSD seemed fine in the dashboard. Since I hove no
> > dashboard, is there a command I can use to check their status?
> >
> >
> > Regards,
> >
> > S. Barthes
> > T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
> > InTest S.A.
> > 4 Allée du Levant
> > 69890 La Tour de Salvagny
> >
> > Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :
> >>
> >> Michel,
> >>
> >>
> >> cephadm shell starts on all 3 nodes without error, and each host as
> >> the same ceph public key entry in the .ssh/autorized_key file of the
> >> root user.
> >>
> >> ceph-01 also has ceph.pub in /etc/ceph with the same key (this is the
> >> node I started the install from)
> >>
> >> ceph-2 has no/etc/ceph folder
> >>
> >> ceph-3 has a /etc/ceph folder, but no ceph.pub file there
> >>
> >>
> >> S. Barthes
> >> Le 21/07/2025 à 12:36, Michel Jouvin a écrit :
> >>> Hi Stéphane,
> >>>
> >>> Sorry I was busy and did not look at your previous answers... It is a
> >>> bit difficult for me to understand how you ended up in this situation
> >>> but for me it is strange that ceph-02 complains about a missing
> >>> keyring and the corruped rocks.db on a freshly created cluster is
> >>> also a bit strange for me. I don't think it makes sense to destroy
> >>> and recreate the OSD, I am running several clusters with hundreds of
> >>> OSDs and I never saw a mis-initialized one. The problem is hiding
> >>> something else I'm afraid. Because of some misconfiguration, may be
> >>> one OSD is in a bad state and may need to be reinitialized but first
> >>> we should get the 3 mons running properly and `cephadm shell` working
> >>> properly on the 3 hosts. And the rocks.db compaction issue for me is
> >>> related to your mon, not to an OSD.
> >>>
> >>> Have you checked that SSH configuration for cephadm is working well
> >>> from any host to any other one in your cluster (with 3 hosts, it
> >>> should be really straighforward to check). The ceph-02 problem may be
> >>> the sign of SSH misconfiguration as cephadm will use SSH connection
> >>> to push the keyring, if I am right.
> >>>
> >>> Michel
> >>>
> >>> Le 21/07/2025 à 12:17, Stéphane Barthes a écrit :
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> Should I just wipe the OSD and let ceph rebuild it (as suggested
> >>>> there : https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-
> >>>> from-corrupted-rocksdb) ?
> >>>>
> >>>> Which would the suggested way be  :
> >>>>
> >>>> cephadm rm-daemon osd.ceph-01
> >>>>
> >>>> then
> >>>>
> >>>> cephadm deploy ?
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>>
> >>>> S. Barthes
> >>>> T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
> >>>> InTest S.A.
> >>>> 4 Allée du Levant
> >>>> 69890 La Tour de Salvagny
> >>>> Le 21/07/2025 à 10:33, Stéphane Barthes a écrit :
> >>>>>
> >>>>> Michel,
> >>>>>
> >>>>> ceph-02 logs :
> >>>>>
> >>>>> root@srvr-ceph-02:/# ceph log last debug cephadm
> >>>>> 2025-07-21T08:16:54.814+0000 7efe1a884640 -1 auth: unable to find a
> >>>>> keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/
> >>>>> ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such
> >>>>> file or directory
> >>>>>
> >>>>> 2025-07-21T08:16:54.814+0000 7efe1a884640 -1
> >>>>> AuthRegistry(0x7efe14064de0) no keyring found at /etc/ceph/
> >>>>> ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
> >>>>> keyring,/etc/ceph/keyring.bin, disabling cephx
> >>>>>
> >>>>> 2025-07-21T08:16:54.818+0000 7efe1a884640 -1 auth: unable to find a
> >>>>> keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/
> >>>>> ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such
> >>>>> file or directory
> >>>>>
> >>>>> 2025-07-21T08:16:54.818+0000 7efe1a884640 -1
> >>>>> AuthRegistry(0x7efe1a883000) no keyring found at /etc/ceph/
> >>>>> ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
> >>>>> keyring,/etc/ceph/keyring.bin, disabling cephx
> >>>>>
> >>>>> 2025-07-21T08:16:54.818+0000 7efe137fe640 -1 monclient(hunting):
> >>>>> handle_auth_bad_method server allowed_methods [2] but i only
> >>>>> support [1]
> >>>>>
> >>>>> 2025-07-21T08:16:54.818+0000 7efe18e21640 -1 monclient(hunting):
> >>>>> handle_auth_bad_method server allowed_methods [2] but i only
> >>>>> support [1]
> >>>>>
> >>>>> 2025-07-21T08:16:57.818+0000 7efe137fe640 -1 monclient(hunting):
> >>>>> handle_auth_bad_method server allowed_methods [2] but i only
> >>>>> support [1]
> >>>>>
> >>>>> 2025-07-21T08:16:57.818+0000 7efe13fff640 -1 monclient(hunting):
> >>>>> handle_auth_bad_method server allowed_methods [2] but i only
> >>>>> support [1]
> >>>>>
> >>>>> 2025-07-21T08:17:00.818+0000 7efe137fe640 -1 monclient(hunting):
> >>>>> handle_auth_bad_method server allowed_methods [2] but i only
> >>>>> support [1]
> >>>>>
> >>>>> 2025-07-21T08:17:00.818+0000 7efe18e21640 -1 monclient(hunting):
> >>>>> handle_auth_bad_method server allowed_methods [2] but i only
> >>>>> support [1]
> >>>>>
> >>>>> ^CCluster connection aborted
> >>>>> root@srvr-ceph-02:/#
> >>>>>
> >>>>>
> >>>>> Regarding the ceph-01 log, there is a LOT. looking from the end, I
> >>>>> see this :
> >>>>>
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -19>
> >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  2 auth: KeyRing::load:
> >>>>> loaded key file /var/lib/ceph/mon/ceph-srvr-ceph-01/keyring
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -18>
> >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  2 mon.srvr-
> >>>>> ceph-01@-1(???) e5 init
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -17>
> >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  4 mgrc handle_mgr_map
> >>>>> Got map version 73
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -16>
> >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  4 mgrc handle_mgr_map
> >>>>> Active mgr is now
> >>>>> [v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -15>
> >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  4 mgrc reconnect
> >>>>> Starting new session with
> >>>>> [v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -14>
> >>>>> 2025-07-20T17:52:21.137+0000 7f359c206640 -1 mon.srvr-
> >>>>> ceph-01@-1(???) e5 handle_auth_bad_method hmm, they didn't like 2
> >>>>> result (13) Permission denied
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -13>
> >>>>> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  0 mon.srvr-
> >>>>> ceph-01@-1(probing) e5  my rank is now 0 (was -1)
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -12>
> >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640  3 rocksdb: [db/db_impl/
> >>>>> db_impl_compaction_flush.cc:3026] Compaction error: Corruption:
> >>>>> block checksum mismatch: stored = 3368055299, computed =
> >>>>> 2100551158  in /var/lib/ceph/mon/ceph-srvr-ceph-01/
> >>>>> store.db/061999.sst offset 10379525 size 91317
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -11>
> >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640  4 rocksdb: (Original Log
> >>>>> Time 2025/07/20-17:52:21.164193) [db/compaction/
> >>>>> compaction_job.cc:812] [default] compacted to: base level 6 level
> >>>>> multiplier 10.00 max bytes base 268435456 files[4 0 0 0 0 0 1] max
> >>>>> score 0.00, MB/sec: 514.9 rd, 272.6 wr, level 6, files in(4, 1)
> >>>>> out(0) MB in(4.0, 14.8) out(9.9), read-write-amplify(7.2) write-
> >>>>> amplify(2.5) Corruption: block checksum mismatch: stored =
> >>>>> 3368055299, computed = 2100551158 in /var/lib/ceph/mon/ceph-srvr-
> >>>>> ceph-01/store.db/061999.sst offset 10379525 size 91317, records in:
> >>>>> 25191, records dropped: 3
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -10>
> >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640  4 rocksdb: (Original Log
> >>>>> Time 2025/07/20-17:52:21.164212) EVENT_LOG_v1 {"time_micros":
> >>>>> 1753033941164205, "job": 3, "event": "compaction_finished",
> >>>>> "compaction_time_micros": 38166, "compaction_time_cpu_micros":
> >>>>> 25133, "output_level": 6, "num_output_files": 0,
> >>>>> "total_output_size": 10404253, "num_input_records": 25191,
> >>>>> "num_output_records": 21216, "num_subcompactions": 1,
> >>>>> "output_compression": "NoCompression",
> >>>>> "num_single_delete_mismatches": 0, "num_single_delete_fallthrough":
> >>>>> 0, "lsm_state": [4, 0, 0, 0, 0, 0, 1]}
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -9>
> >>>>> 2025-07-20T17:52:21.161+0000 7f359d208640  2 rocksdb: [db/db_impl/
> >>>>> db_impl_compaction_flush.cc:2545] Waiting after background
> >>>>> compaction error: Corruption: block checksum mismatch: stored =
> >>>>> 3368055299, computed = 2100551158  in /var/lib/ceph/mon/ceph-srvr-
> >>>>> ceph-01/store.db/061999.sst offset 10379525 size 91317, Accumulated
> >>>>> background error counts: 1
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -8>
> >>>>> 2025-07-20T17:52:21.341+0000 7f359c206640 -1 mon.srvr-
> >>>>> ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn't like
> >>>>> 2 result (13) Permission denied
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -7>
> >>>>> 2025-07-20T17:52:21.741+0000 7f359c206640 -1 mon.srvr-
> >>>>> ceph-01@0(probing) e5 handle_auth_bad_method hmm, they didn't like
> >>>>> 2 result (13) Permission denied
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -6>
> >>>>> 2025-07-20T17:52:21.741+0000 7f359c206640  1 mon.srvr-
> >>>>> ceph-01@0(probing) e5 handle_auth_request failed to assign global_id
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -5>
> >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640  5 mon.srvr-
> >>>>> ceph-01@0(probing) e5 _ms_dispatch setting monitor caps on this
> >>>>> connection
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -4>
> >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640  1 mon.srvr-
> >>>>> ceph-01@0(synchronizing) e5 sync_obtain_latest_monmap
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -3>
> >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640  1 mon.srvr-
> >>>>> ceph-01@0(synchronizing) e5 sync_obtain_latest_monmap obtained
> >>>>> monmap e5
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: [285B blob data]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync
> >>>>> key = 'latest_monmap' value size = 508)
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync
> >>>>> key = 'in_sync' value size = 8)
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync
> >>>>> key = 'last_committed_floor' value size = 8)
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -1>
> >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640 -1 /home/jenkins-build/
> >>>>> build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
> >>>>> AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/
> >>>>> release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
> >>>>> In function 'int
> >>>>> MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)'
> >>>>> thread 7f35981fe640 time 2025-07-20T17:52:21.750611+0000
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: /home/jenkins-build/build/
> >>>>> workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
> >>>>> AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/
> >>>>> release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
> >>>>> 355: ceph_abort_msg("failed to write to db")
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  ceph version 17.2.8
> >>>>> (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1:
> >>>>> (ceph::__ceph_abort(char const*, int, char const*,
> >>>>> std::__cxx11::basic_string<char, std::char_traits<char>,
> >>>>> std::allocator<char> > const&)+0xd3) [0x7f35a03a5469]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2: /usr/bin/ceph-
> >>>>> mon(+0x1e968e) [0x55e079c5768e]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3:
> >>>>> (Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5) [0x55e079c8a145]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4:
> >>>>>
> (Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
> [0x55e079c90baf]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5:
> >>>>> (Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
> >>>>> [0x55e079c925dc]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6:
> >>>>> (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
> >>>>> [0x55e079cad20d]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7:
> >>>>> (Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8: /usr/bin/ceph-
> >>>>> mon(+0x1f6d3e) [0x55e079c64d3e]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9:
> >>>>> (DispatchQueue::entry()+0x53a) [0x7f35a058d34a]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  10: /usr/lib64/ceph/
> >>>>> libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  11: /lib64/
> >>>>> libc.so.6(+0x89e92) [0x7f359fb6ce92]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  12: /lib64/
> >>>>> libc.so.6(+0x10ef20) [0x7f359fbf1f20]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug      0>
> >>>>> 2025-07-20T17:52:21.749+0000 7f35981fe640 -1 *** Caught signal
> >>>>> (Aborted) **
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  in thread 7f35981fe640
> >>>>> thread_name:ms_dispatch
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  ceph version 17.2.8
> >>>>> (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1: /lib64/
> >>>>> libc.so.6(+0x3e730) [0x7f359fb21730]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2: /lib64/
> >>>>> libc.so.6(+0x8bbdc) [0x7f359fb6ebdc]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3: raise()
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4: abort()
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5:
> >>>>> (ceph::__ceph_abort(char const*, int, char const*,
> >>>>> std::__cxx11::basic_string<char, std::char_traits<char>,
> >>>>> std::allocator<char> > const&)+0x190) [0x7f35a03a5526]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6: /usr/bin/ceph-
> >>>>> mon(+0x1e968e) [0x55e079c5768e]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7:
> >>>>> (Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5) [0x55e079c8a145]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8:
> >>>>>
> (Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
> [0x55e079c90baf]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9:
> >>>>> (Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
> >>>>> [0x55e079c925dc]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  10:
> >>>>> (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
> >>>>> [0x55e079cad20d]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  11:
> >>>>> (Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  12: /usr/bin/ceph-
> >>>>> mon(+0x1f6d3e) [0x55e079c64d3e]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  13:
> >>>>> (DispatchQueue::entry()+0x53a) [0x7f35a058d34a]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  14: /usr/lib64/ceph/
> >>>>> libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  15: /lib64/
> >>>>> libc.so.6(+0x89e92) [0x7f359fb6ce92]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  16: /lib64/
> >>>>> libc.so.6(+0x10ef20) [0x7f359fbf1f20]
> >>>>> Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  NOTE: a copy of the
> >>>>> executable, or `objdump -rdS <executable>` is needed to interpret
> >>>>> this.
> >>>>>
> >>>>> I do not know if the logs are purged from sensitive data that would
> >>>>> prevent emailing them. looking for "checksum mismatch", in the
> >>>>> logs, there are many of them (138).
> >>>>>
> >>>>> How can I fix this checksum issue?
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>>
> >>>>> S. Barthes
> >>>>>
> >>>>> Le 21/07/2025 à 09:59, Michel Jouvin a écrit :
> >>>>>> Stéphane,
> >>>>>>
> >>>>>> On ceph-02, I am not sure why the ceph command is not installed as
> >>>>>> on the other nodes, if you installed it the same way. One way to
> >>>>>> get access to the ceph command on this server should be to execute:
> >>>>>>
> >>>>>> cephadm shell
> >>>>>>
> >>>>>> This will start a container where you have the ceph environment
> >>>>>> installed and configured for your cluster.
> >>>>>>
> >>>>>> The situation is not as bad as I thought reading your first
> >>>>>> message. You have the mon quorum so at least ceph command should
> >>>>>> be usable. The first thing to do is probably to log on your
> >>>>>> ceph-01 node and try to understand why the mon daemon is crashing.
> >>>>>> You may want to run on this node:
> >>>>>>
> >>>>>> cephadm ls  ---> Look for the exact daemon name corresponding to
> >>>>>> the mon
> >>>>>>
> >>>>>> cephadm logs --daemon $daemon_name
> >>>>>>
> >>>>>> Apart from this, it is strange that ceph-03 report a RADOS error
> >>>>>> with 'ceph log last...', this probably hides another issue. Could
> >>>>>> you tell what the same command says on ceph-02 (when run in
> >>>>>> cephadm shell)?
> >>>>>>
> >>>>>> Michel
> >>>>>>
> >>>>>> Le 21/07/2025 à 09:44, Stéphane Barthes a écrit :
> >>>>>>>
> >>>>>>> Michel,
> >>>>>>>
> >>>>>>>
> >>>>>>> I ran "ceph log last debug cephadm" on my 3 nodes, and "mileage
> >>>>>>> varies"
> >>>>>>>
> >>>>>>> ceph-01 :
> >>>>>>>
> >>>>>>> some errors, and it ends with
> >>>>>>>
> >>>>>>> 2025-07-20T03:24:18.887889+0000 mgr.srvr-ceph-03.dhzbpe
> >>>>>>> (mgr.134360) 1368 : cephadm [INF] Deploying daemon mon.srvr-
> >>>>>>> ceph-03 on srvr-ceph-03
> >>>>>>>
> >>>>>>> when I had to remove the mon daemon and redeploy on ceph-03.
> >>>>>>>
> >>>>>>> ceph-02 :
> >>>>>>>
> >>>>>>> root@srvr-ceph-02:~# ceph log last debug cephadm
> >>>>>>> Command 'ceph' not found, but can be installed with:
> >>>>>>> snap install microceph    # version 18.2.4+snapc9f2b08f92, or
> >>>>>>> apt  install ceph-common  # version 17.2.7-0ubuntu0.22.04.2
> >>>>>>> See 'snap info microceph' for additional versions.
> >>>>>>>
> >>>>>>> ??? should I install ceph-common ???
> >>>>>>>
> >>>>>>> ceph-03 :
> >>>>>>>
> >>>>>>> root@srvr-ceph-03:~# ceph log last debug cephadm
> >>>>>>> Error initializing cluster client: ObjectNotFound('RADOS object
> >>>>>>> not found (error calling conf_read_file)')
> >>>>>>> root@srvr-ceph-03:~#
> >>>>>>>
> >>>>>>> FWIW : ceph health is :
> >>>>>>>
> >>>>>>> root@srvr-ceph-01:~# ceph health detail
> >>>>>>> HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum
> >>>>>>> srvr-ceph-03,srvr-ceph-02; 10 daemons have recently crashed
> >>>>>>> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
> >>>>>>>     daemon mon.srvr-ceph-01 on srvr-ceph-01 is in error state
> >>>>>>> [WRN] MON_DOWN: 1/3 mons down, quorum srvr-ceph-03,srvr-ceph-02
> >>>>>>>     mon.srvr-ceph-01 (rank 0) addr
> >>>>>>> [v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is down (out of
> >>>>>>> quorum)
> >>>>>>> [WRN] RECENT_CRASH: 10 daemons have recently crashed
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:50:10.202091Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:49:47.712267Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:50:21.464475Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:49:36.609442Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:49:58.966663Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:51:36.947240Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:52:21.751711Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:51:48.490875Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:51:59.651129Z
> >>>>>>>     mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
> >>>>>>> 2025-07-20T17:52:10.552756Z
> >>>>>>>
> >>>>>>> S. Barthes
> >>>>>>> Le 21/07/2025 à 09:31, Michel Jouvin a écrit :
> >>>>>>>> Stephane,
> >>>>>>>>
> >>>>>>>> If you are using cephadm, the OS (distrib and version) you use
> >>>>>>>> should not matter. When using cephadm with several servers (the
> >>>>>>>> general case!), it is important to setup properly the SSH key
> >>>>>>>> used by cephadm for the communication between nodes (cephadm is
> >>>>>>>> sort of a SSH-based management cluster) and to check that you
> >>>>>>>> can log in from one node to the other using SSH. Can you confirm
> >>>>>>>> that it is the case?
> >>>>>>>>
> >>>>>>>> Also cephadm has a specific log file. I don't use much the
> >>>>>>>> dashboard, not sure how you display it (it may be part of the
> >>>>>>>> logs displayed by the dashboard) but you can access it with the
> >>>>>>>> command:
> >>>>>>>>
> >>>>>>>> ceph log last debug cephadm
> >>>>>>>>
> >>>>>>>> Michel
> >>>>>>>>
> >>>>>>>> Le 21/07/2025 à 09:19, Stéphane Barthes a écrit :
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> Yes, I did use cephadm, to bootstrap the 1st node in the
> >>>>>>>>> cluster, installed cephadm on the other nodes, and used the
> >>>>>>>>> dashboard to add the nodes to the cluster.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>>
> >>>>>>>>> S. Barthes
> >>>>>>>>>
> >>>>>>>>> Le 21/07/2025 à 09:12, Michel Jouvin a écrit :
> >>>>>>>>>> Hi Stephane,
> >>>>>>>>>>
> >>>>>>>>>> How did you configure your cluster? Have you been using
> >>>>>>>>>> cephadm? If not, I really advise you to recreate your cluster
> >>>>>>>>>> with cephadm, that includes a script to bootstrap the cluster.
> >>>>>>>>>> In particular if you don't have a detail knowledge about Ceph
> >>>>>>>>>> architecture and management, it will ensure that your cluster
> >>>>>>>>>> is properly configured and let you progressively learn about
> >>>>>>>>>> Ceph details...
> >>>>>>>>>>
> >>>>>>>>>> Best regards.
> >>>>>>>>>>
> >>>>>>>>>> Michel
> >>>>>>>>>>
> >>>>>>>>>> Le 21/07/2025 à 09:02, Stéphane Barthes a écrit :
> >>>>>>>>>>>
> >>>>>>>>>>> Hello,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I am very new to ceph and have started a small cluster to get
> >>>>>>>>>>> started with ceph.
> >>>>>>>>>>>
> >>>>>>>>>>> But so far my experience, is not very impressive, probably by
> >>>>>>>>>>> lack of knowledge and good practices.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I started with Ubuntu 24, installed 3 VM for a ceph cluster,
> >>>>>>>>>>> and some how could not get it running. Adding nodes would
> >>>>>>>>>>> fail adding OSDs with some weird error(I found it on the web
> >>>>>>>>>>> but could not solve the problem).
> >>>>>>>>>>>
> >>>>>>>>>>> I then made a new cluster with 3 ubuntu 22 VM. Install ok,
> >>>>>>>>>>> start ok, I created 1 pool to test storing stuff there and
> >>>>>>>>>>> work my way across crash testing. However the cluster dies
> >>>>>>>>>>> during the weekly vm snapshot. It may not a good idea to run
> >>>>>>>>>>> vm backups on a ceph host, but I find this a little
> >>>>>>>>>>> surprising. (crash testing started earlier than expected)
> >>>>>>>>>>>
> >>>>>>>>>>> Bottom line is that, after the backup the cluster is in
> >>>>>>>>>>> warning state with missing mons, or logrotate and sometimes
> >>>>>>>>>>> crashed machines. systemctl restart service or Rebooting node
> >>>>>>>>>>> usually fixes it.
> >>>>>>>>>>>
> >>>>>>>>>>> I am now stuck in a situation I cannot fix :
> >>>>>>>>>>>
> >>>>>>>>>>>     - 1 Machine is ceph rbd client cannot auth : auth method
> >>>>>>>>>>> 'x' error -13. I have tried quite a few things, and none
> >>>>>>>>>>> unlocked the situation. I am currently trying to reboot the
> >>>>>>>>>>> machine, but the busy/stuck rbd device seems to block it. I
> >>>>>>>>>>> am not looking forward to hard reset it.
> >>>>>>>>>>>
> >>>>>>>>>>>     - Node with the mgr service will not restart mon, or
> >>>>>>>>>>> logrotate. I did reboot it again today, but I guess this is
> >>>>>>>>>>> not how a node is expected to behave.
> >>>>>>>>>>>
> >>>>>>>>>>> So my questions :
> >>>>>>>>>>>
> >>>>>>>>>>>     - How can I unlock my stuck ceph client, when this kind
> >>>>>>>>>>> of error occurs?
> >>>>>>>>>>>
> >>>>>>>>>>>     - Is this expected behavior that client looses access to
> >>>>>>>>>>> cluster, which kind of kills the machine?
> >>>>>>>>>>>
> >>>>>>>>>>>     - Where should I look in the ceph nodes logs to figure
> >>>>>>>>>>> what is going wrong, and how to fix it, so that is run in a
> >>>>>>>>>>> stable manner?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> S. Barthes
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to