Hi,

I agree, trying to fix a broken test cluster is absolutely helpful. I recommend to read the docs [0], especially [1] and [2]. For [2] you'll have to adopt the commands to cephadm shell since it's still written for non-cephadm clusters. But there are threads on this list that cover those steps for cephadm clusters as well.

But before you do that, you could also try to edit the monmap so you only have one MON that starts successfully (hopefully), from there you could get the cluster back to a working state. There's an example procedure in the docs [3] which describes the steps to change the monitor IPs (which you don't need here), but it can be used as a guide how to modify the monmap in order to reduce the monmap to have one MON only.

Regards,
Eugen

[0] https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
[1] https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#monitor-store-failures [2] https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds [3] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#example-procedure

Zitat von Stéphane Barthes <stephane.bart...@intest.info>:

Hi Malte,

Thank for  your reply.  Here are a some info : 

ceph -s hangs and times out monhunting after 300s

But I can run cephadm shell. Is there a similar command under       cephadm
shell?

ceph health detail : same as above.

I would like to repair it, instead of wipe & restart, as it       is (from
my point of view) a good way to learn (and there are a       few data I'd
like to recover).

What is the problem with ubuntu 24? I did not see warnings       regarding
this specific version in
https://docs.ceph.com/en/latest/cephadm/install/#cephadm-install-distros

Regards,
S. Barthes T: +33 4 72 52 35 40 M: +33 6 14 73 18 34 InTest S.A. 4 Allée
du Levant 69890 La Tour de Salvagny    Le 22/07/2025 à 10:02, Malte Stroem
a       écrit :

Hello       Stéphane,

I think, you're mixing and mismatching up a lot!

You always have to show us the output of:

ceph -s

And more! Logs and stuff, e. g.:

ceph health detail

It is clear you missed something here and there.

It is repairable but since it is a test cluster, just delete it
and start again.

And follow the documentation for cephadm. And do not use Ubuntu
24.04.

Best,
Malte

On 22.07.25 09:02, Stéphane Barthes wrote:

Hello,

Today, things have degraded a bit more. ceph-03 mon has failed
and will not restart. It shows the same kind of checksum error
in rocksdb compact operation during startup. As a consequence, I
 lost quorum, and ceph commands hang.

Would it be wise to disable rocksbd compact, to restart and find
 quorum back? If yes, what is the exactt syntax of the setting in
  ceph.conf, I have seen one for OSD, but not sure if it would
apply:

[osd]

osd_compact_on_start = true

If I can restart, I will try to out the OSDs, and recreate them.
 Last time I saw the OSD seemed fine in the dashboard. Since I
hove no dashboard, is there a command I can use to check their
status?

Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny

Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :

Michel,

cephadm shell starts on all 3 nodes without error, and each
host as the same ceph public key entry in the
.ssh/autorized_key file of the root user.

ceph-01 also has ceph.pub in /etc/ceph with the same key (this
  is the node I started the install from)

ceph-2 has no/etc/ceph folder

ceph-3 has a /etc/ceph folder, but no ceph.pub file there

S. Barthes
Le 21/07/2025 à 12:36, Michel Jouvin a écrit :

Hi Stéphane,

Sorry I was busy and did not look at your previous
answers... It is a bit difficult for me to understand how
you ended up in this situation but for me it is strange that
   ceph-02 complains about a missing keyring and the corruped
    rocks.db on a freshly created cluster is also a bit strange
      for me. I don't think it makes sense to destroy and recreate
         the OSD, I am running several clusters with hundreds of OSDs
            and I never saw a mis-initialized one. The problem is
hiding             something else I'm afraid. Because of some
misconfiguration,             may be one OSD is in a bad state and
may need to be             reinitialized but first we should get the
3 mons running             properly and `cephadm shell` working
properly on the 3             hosts. And the rocks.db compaction
issue for me is related             to your mon, not to an OSD.

Have you checked that SSH configuration for cephadm is
working well from any host to any other one in your cluster
  (with 3 hosts, it should be really straighforward to check).
     The ceph-02 problem may be the sign of SSH misconfiguration
       as cephadm will use SSH connection to push the keyring, if I
          am right.

Michel

Le 21/07/2025 à 12:17, Stéphane Barthes a écrit :

Hi,

Should I just wipe the OSD and let ceph rebuild it (as
suggested there :
https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-
      from-corrupted-rocksdb) ?

Which would the suggested way be  :

cephadm rm-daemon osd.ceph-01

then

cephadm deploy ?

Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 10:33, Stéphane Barthes a écrit :

Michel,

ceph-02 logs :

root@srvr-ceph-02:/# ceph log last debug cephadm
2025-07-21T08:16:54.814+0000 7efe1a884640 -1 auth:
unable to find a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
  (2) No such file or directory

2025-07-21T08:16:54.814+0000 7efe1a884640 -1
AuthRegistry(0x7efe14064de0) no keyring found at
/etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
        keyring,/etc/ceph/keyring.bin, disabling cephx

2025-07-21T08:16:54.818+0000 7efe1a884640 -1 auth:
unable to find a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
  (2) No such file or directory

2025-07-21T08:16:54.818+0000 7efe1a884640 -1
AuthRegistry(0x7efe1a883000) no keyring found at
/etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
        keyring,/etc/ceph/keyring.bin, disabling cephx

2025-07-21T08:16:54.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]

2025-07-21T08:16:54.818+0000 7efe18e21640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]

2025-07-21T08:16:57.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]

2025-07-21T08:16:57.818+0000 7efe13fff640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]

2025-07-21T08:17:00.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]

2025-07-21T08:17:00.818+0000 7efe18e21640 -1
monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [1]

^CCluster connection aborted
root@srvr-ceph-02:/#

Regarding the ceph-01 log, there is a LOT. looking from
    the end, I see this :

Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-19> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  2
auth: KeyRing::load: loaded key file
/var/lib/ceph/mon/ceph-srvr-ceph-01/keyring
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-18> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  2
mon.srvr- ceph-01@-1(???) e5 init
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-17> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  4
mgrc handle_mgr_map Got map version 73
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-16> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  4
mgrc handle_mgr_map Active mgr is now
[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-15> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  4
mgrc reconnect Starting new session with
[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-14> 2025-07-20T17:52:21.137+0000 7f359c206640 -1
mon.srvr- ceph-01@-1(???) e5 handle_auth_bad_method hmm,
     they didn't like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-13> 2025-07-20T17:52:21.137+0000 7f359f42d8c0  0
mon.srvr- ceph-01@-1(probing) e5  my rank is now 0 (was
    -1)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-12> 2025-07-20T17:52:21.161+0000 7f359d208640  3
rocksdb: [db/db_impl/ db_impl_compaction_flush.cc:3026]
    Compaction error: Corruption: block checksum mismatch:
       stored = 3368055299, computed = 2100551158  in
  /var/lib/ceph/mon/ceph-srvr-ceph-01/ store.db/061999.sst
       offset 10379525 size 91317
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-11> 2025-07-20T17:52:21.161+0000 7f359d208640  4
rocksdb: (Original Log Time 2025/07/20-17:52:21.164193)
    [db/compaction/ compaction_job.cc:812] [default]
 compacted to: base level 6 level multiplier 10.00 max
   bytes base 268435456 files[4 0 0 0 0 0 1] max score
   0.00, MB/sec: 514.9 rd, 272.6 wr, level 6, files in(4,
      1) out(0) MB in(4.0, 14.8) out(9.9),
read-write-amplify(7.2) write- amplify(2.5) Corruption:
    block checksum mismatch: stored = 3368055299, computed =
         2100551158 in /var/lib/ceph/mon/ceph-srvr-
ceph-01/store.db/061999.sst offset 10379525 size 91317,
    records in: 25191, records dropped: 3
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
-10> 2025-07-20T17:52:21.161+0000 7f359d208640  4
rocksdb: (Original Log Time 2025/07/20-17:52:21.164212)
    EVENT_LOG_v1 {"time_micros": 1753033941164205, "job": 3,
         "event": "compaction_finished",
"compaction_time_micros": 38166,
"compaction_time_cpu_micros": 25133, "output_level": 6,
    "num_output_files": 0, "total_output_size": 10404253,
      "num_input_records": 25191, "num_output_records": 21216,
           "num_subcompactions": 1, "output_compression":
      "NoCompression", "num_single_delete_mismatches": 0,
      "num_single_delete_fallthrough": 0, "lsm_state": [4, 0,
          0, 0, 0, 0, 1]}
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -9> 2025-07-20T17:52:21.161+0000 7f359d208640  2
rocksdb: [db/db_impl/ db_impl_compaction_flush.cc:2545]
    Waiting after background compaction error: Corruption:
       block checksum mismatch: stored = 3368055299, computed =
            2100551158  in /var/lib/ceph/mon/ceph-srvr-
    ceph-01/store.db/061999.sst offset 10379525 size 91317,
        Accumulated background error counts: 1
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -8> 2025-07-20T17:52:21.341+0000 7f359c206640 -1
mon.srvr- ceph-01@0(probing) e5 handle_auth_bad_method
   hmm, they didn't like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -7> 2025-07-20T17:52:21.741+0000 7f359c206640 -1
mon.srvr- ceph-01@0(probing) e5 handle_auth_bad_method
   hmm, they didn't like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -6> 2025-07-20T17:52:21.741+0000 7f359c206640  1
mon.srvr- ceph-01@0(probing) e5 handle_auth_request
failed to assign global_id
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -5> 2025-07-20T17:52:21.749+0000 7f35981fe640  5
mon.srvr- ceph-01@0(probing) e5 _ms_dispatch setting
 monitor caps on this connection
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -4> 2025-07-20T17:52:21.749+0000 7f35981fe640  1
mon.srvr- ceph-01@0(synchronizing) e5
sync_obtain_latest_monmap
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -3> 2025-07-20T17:52:21.749+0000 7f35981fe640  1
mon.srvr- ceph-01@0(synchronizing) e5
sync_obtain_latest_monmap obtained monmap e5
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: [285B blob
data]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =
     mon_sync key = 'latest_monmap' value size = 508)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =
     mon_sync key = 'in_sync' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix =
     mon_sync key = 'last_committed_floor' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug   
 -1> 2025-07-20T17:52:21.749+0000 7f35981fe640 -1
/home/jenkins-build/
build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/

release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
               In function 'int
MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)'
               thread 7f35981fe640 time
2025-07-20T17:52:21.750611+0000
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
/home/jenkins-build/build/
workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/

release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
               355: ceph_abort_msg("failed to write to db")
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  ceph version
   17.2.8 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy
        (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char>                 >
const&)+0xd3) [0x7f35a03a5469]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2:
/usr/bin/ceph- mon(+0x1e968e) [0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3:
(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)
 [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4:

(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5:
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
              [0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6:
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
              [0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7:
(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8:
/usr/bin/ceph- mon(+0x1f6d3e) [0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9:
(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  10:
/usr/lib64/ceph/ libceph-common.so.2(+0x3bdea1)
[0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  11: /lib64/
  libc.so.6(+0x89e92) [0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  12: /lib64/
  libc.so.6(+0x10ef20) [0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     
0> 2025-07-20T17:52:21.749+0000 7f35981fe640 -1 ***
Caught signal (Aborted) **
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  in thread
7f35981fe640 thread_name:ms_dispatch
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  ceph version
   17.2.8 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy
        (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1: /lib64/
 libc.so.6(+0x3e730) [0x7f359fb21730]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2: /lib64/
 libc.so.6(+0x8bbdc) [0x7f359fb6ebdc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3: raise()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4: abort()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char>                 >
const&)+0x190) [0x7f35a03a5526]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6:
/usr/bin/ceph- mon(+0x1e968e) [0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7:
(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5)
 [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8:

(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9:
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
              [0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  10:
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
              [0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  11:
(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  12:
/usr/bin/ceph- mon(+0x1f6d3e) [0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  13:
(DispatchQueue::entry()+0x53a) [0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  14:
/usr/lib64/ceph/ libceph-common.so.2(+0x3bdea1)
[0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  15: /lib64/
  libc.so.6(+0x89e92) [0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  16: /lib64/
  libc.so.6(+0x10ef20) [0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  NOTE: a copy
   of the executable, or `objdump -rdS <executable>`
 is needed to interpret this.

I do not know if the logs are purged from sensitive data
     that would prevent emailing them. looking for "checksum
         mismatch", in the logs, there are many of them (138).

How can I fix this checksum issue?

Regards,

S. Barthes

Le 21/07/2025 à 09:59, Michel Jouvin a écrit :

Stéphane,

On ceph-02, I am not sure why the ceph command is not
     installed as on the other nodes, if you installed it
         the same way. One way to get access to the ceph
        command on this server should be to execute:

cephadm shell

This will start a container where you have the ceph
   environment installed and configured for your cluster.

The situation is not as bad as I thought reading your
     first message. You have the mon quorum so at least
       ceph command should be usable. The first thing to do
           is probably to log on your ceph-01 node and try to
             understand why the mon daemon is crashing. You may
               want to run on this node:

cephadm ls  ---> Look for the exact daemon name
corresponding to the mon

cephadm logs --daemon $daemon_name

Apart from this, it is strange that ceph-03 report a
    RADOS error with 'ceph log last...', this probably
      hides another issue. Could you tell what the same
       command says on ceph-02 (when run in cephadm shell)?

Michel

Le 21/07/2025 à 09:44, Stéphane Barthes a écrit :

Michel,

I ran "ceph log last debug cephadm" on my 3 nodes,
     and "mileage varies"

ceph-01 :

some errors, and it ends with

2025-07-20T03:24:18.887889+0000
mgr.srvr-ceph-03.dhzbpe (mgr.134360) 1368 : cephadm
      [INF] Deploying daemon mon.srvr- ceph-03 on
    srvr-ceph-03

when I had to remove the mon daemon and redeploy on
      ceph-03.

ceph-02 :

root@srvr-ceph-02:~# ceph log last debug cephadm
Command 'ceph' not found, but can be installed with:
snap install microceph    # version
18.2.4+snapc9f2b08f92, or
apt  install ceph-common  # version
17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional versions.

??? should I install ceph-common ???

ceph-03 :

root@srvr-ceph-03:~# ceph log last debug cephadm
Error initializing cluster client:
ObjectNotFound('RADOS object not found (error
calling conf_read_file)')
root@srvr-ceph-03:~#

FWIW : ceph health is :

root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons
   down, quorum srvr-ceph-03,srvr-ceph-02; 10 daemons
        have recently crashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm
daemon(s)
    daemon mon.srvr-ceph-01 on srvr-ceph-01 is in
    error state
[WRN] MON_DOWN: 1/3 mons down, quorum
srvr-ceph-03,srvr-ceph-02
    mon.srvr-ceph-01 (rank 0) addr
[v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is
     down (out of quorum)
[WRN] RECENT_CRASH: 10 daemons have recently crashed
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:50:10.202091Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:49:47.712267Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:50:21.464475Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:49:36.609442Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:49:58.966663Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:51:36.947240Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:52:21.751711Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:51:48.490875Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:51:59.651129Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at
       2025-07-20T17:52:10.552756Z

S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a écrit :

Stephane,

If you are using cephadm, the OS (distrib and
   version) you use should not matter. When using
       cephadm with several servers (the general case!),
              it is important to setup properly the SSH key used
                      by cephadm for the communication between
nodes                       (cephadm is sort of a SSH-based
management                       cluster) and to check that you
can log in from one                       node to the other
using SSH. Can you confirm that                       it is the
case?

Also cephadm has a specific log file. I don't use
       much the dashboard, not sure how you display it
            (it may be part of the logs displayed by the
              dashboard) but you can access it with the command:

ceph log last debug cephadm

Michel

Le 21/07/2025 à 09:19, Stéphane Barthes a écrit :

Hi,

Yes, I did use cephadm, to bootstrap the 1st
     node in the cluster, installed cephadm on the
           other nodes, and used the dashboard to add the
                  nodes to the cluster.

Regards,

S. Barthes

Le 21/07/2025 à 09:12, Michel Jouvin a écrit :

Hi Stephane,

How did you configure your cluster? Have you
        been using cephadm? If not, I really advise
               you to recreate your cluster with cephadm,
                     that includes a script to bootstrap the
                        cluster. In particular if you don't
have a                           detail knowledge about Ceph
architecture and                           management, it will
ensure that your cluster                           is properly
configured and let you                           progressively
learn about Ceph details...

Best regards.

Michel

Le 21/07/2025 à 09:02, Stéphane Barthes a
     écrit :

Hello,

I am very new to ceph and have started a
       small cluster to get started with ceph.

But so far my experience, is not very
    impressive, probably by lack of knowledge
            and good practices.

I started with Ubuntu 24, installed 3 VM for
           a ceph cluster, and some how could not get
                    it running. Adding nodes would fail
adding                             OSDs with some weird
error(I found it on the                             web but
could not solve the problem).

I then made a new cluster with 3 ubuntu 22
         VM. Install ok, start ok, I created 1 pool
                  to test storing stuff there and work my way
                            across crash testing. However the
cluster                             dies during the weekly vm
snapshot. It may                             not a good idea
to run vm backups on a ceph                             host,
but I find this a little surprising.
   (crash testing started earlier than
     expected)

Bottom line is that, after the backup the
        cluster is in warning state with missing
               mons, or logrotate and sometimes crashed
                      machines. systemctl restart service or
                           Rebooting node usually fixes it.

I am now stuck in a situation I cannot fix :

    - 1 Machine is ceph rbd client cannot
        auth : auth method 'x' error -13. I have
               tried quite a few things, and none unlocked
                         the situation. I am currently trying
to                             reboot the machine, but the
busy/stuck rbd                             device seems to
block it. I am not looking
forward to hard reset it.

    - Node with the mgr service will not
       restart mon, or logrotate. I did reboot it
                again today, but I guess this is not how a
                         node is expected to behave.

So my questions :

    - How can I unlock my stuck ceph client,
           when this kind of error occurs?

    - Is this expected behavior that client
          looses access to cluster, which kind of
                kills the machine?

    - Where should I look in the ceph nodes
          logs to figure what is going wrong, and how
                    to fix it, so that is run in a stable
                        manner?

Regards,

-- 
S. Barthes

_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to