Hi,


Zitat von Stéphane Barthes <stephane.bart...@intest.info>:

Hello,

Thank you very much to every one for helping and giving advice,       my
cluster is backup online with HEALTH_OK, and it looks like no       data
was lost. 

I have not been able to convince the cluster to run on 1 mon, as       all
ceph/cephadm commands would either hang or fail.

That's an offline operation, you need to stop the MON processe(s) and then manipulate the monmap of one MON offline, then inject it and start up the single MON.

I resorted to restoring a backup of ceph-03 which gave me quorum for long enough to
: 

    - Label ceph-02 & ceph-03 as _admin

    - Remove ceph-01 daemon, and redeploy it

During my investigations, I found that ceph-01 was not running       ntpd,
so I installed it. But I am under the impression that my       problem is
hardware linked to the VM host. We have had some VMs       behaving
strangley on this machine recently. I transferred all       ceph-vm to
another host, and will see if this remains stable over       time (even
during periodic backup).

To answer Michel's question, we are running a small proxmox v8.4
cluster.

FWIW, the install document indicates you should label mon nodes       as
_admin when creating the cluster, and gives examples with the       label
in the CLI. When using the GUI, as I did there is a label       option
which is empty by default, and I did not select the _admin       label. 

Best regards,

S. Barthes T: +33 4 72 52 35 40 M: +33 6 14 73 18 34 InTest S.A. 4 Allée
du Levant 69890 La Tour de Salvagny    Le 22/07/2025 à 14:32, Michel
Jouvin a       écrit :

Stéphane,

As you use VMs for your deployment, would it make sense to stop
and keep them and restart with a new set of VMs, following
https://docs.ceph.com/en/latest/cephadm/install/. I personnally
don't have experience to expand the cluster through the dashboard
(it should work! just that I am an old guy so used to command line
tools!). Probably using the command line makes easier to identify
when there is a problem, without always digging in the logs. I
remember following this doc a couple of years ago, the last time I
created a cluster and it was working as expected.

If you restart something fresh, I'd start with the Squid version
rather than Reef. It should not make any difference for the
installation but will bring you the latest-and-greatest Ceph! Once
you have something running properly, you may compare what is
different in the other configuration. I don't have any opinion on
Debian 24 as we are using RedHat (AlmaLinux in fact) but for me
the OS version should not make a big difference when using
cephadm, as Ceph in fact runs in containers. But the devil is in
the details, Maite may have a reason for her warning...

One thing we didn't mention/check with you in the previous
exchanges is the Ceph network configuration you used. Any
discrepancy in the network configuration, in particular if you
have a separate cluster (network used only by OSD) and public
(network used by monitors and Ceph clients) networks, you may have
a situation where one of them is not working as expected and may
lead to daemons not being able to see eachothers. But I think your
problem is much more basic: the daemons were not able to restart
after some corruptions of their DB, it is pretty unusual and may
also be the result of something weird that happened in your VM
infrastructure leading to storage corruption... What are you using
to manage to your VMs?

Michel

Le 22/07/2025 à 13:48, Stéphane Barthes a écrit :

Michel,

Thank you very much for the help.

I will look into the documentation provided by Eugen to try
reduce to 1 mon, remove and re-add the 2 mon on the node ceph01
and ceph02.

Regarding the installation, I have created a vm template with
ubuntu installed ready for ceph.

Installed cephadm on ceph01, bootstrapped the cluster, and added
 the other vms from the dashboard. I guess this is all I did to
setup the cluster.

Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 12:12, Michel Jouvin a écrit :

Stéphane,

Basically you cannot do anything in your cluster until you
reach the quorum. Except managing it with cephadm to restore a
  functionning cluster. If 'ceph -s' doesn't return, it means
 you lost the quorum, it is the only reason I'm aware for this.
   As your cluster is quite simple, it should be easy to see the
    state of the monitor daemon on each host where one should run
     using `cephadm ls` and/or `podman/docker ps`. And you should
     be able to get access to the daemon logs of the monitor
daemons.

In one of your message yesterday you reported a log saying the
  rocks.db of one of the mon was corrupted. I personnally never
   saw that but the first thing to do is to fix this as it will
   prevent the mon to start. Follow the doc mentioned by Eugen to
     reduce your quorum to 1 mon (deleting the 2 broken ones from
     the monmap) if necessary (if you don't find a way to start at
      least 2 mon). And as said in another message, ensure you added
        the label _admin to the hosts where you want to be able to use
          the ceph command else the required information to connect to
          the cluster will be missing. It is done with 'ceph orch host
          label add' command which requires that you fixed the quorum
         issue. One possibility if you have one mon healthy and you
       manage to reduce the quorum to 1 is to delete the 2 other mon
        and readd them as new mon so that they are reinitialized. This
          way you will not loose anything. Look at cephadm
documentation           to learn how to remove and add daemons.

One thing not fully clear for me is how you installed your
different hosts. It seems they are not configured exactly the
 same way as on one host the ceph command is not available
where it is on the other ones. Ceph doesn't need a lot of
things from the OS when using cephadm but it is pretty
important to ensure that all your Ceph hosts are deployed the
 same way/with the same config else you just add to the
entropy...

I fully agree with you and Eugen that trying to fix things is
 a way to learn a lot but at the same time it is not very easy
  to help you with the very limited information we have on what
   you did to be in such a strange situation... So if you don't
   manage to converge, may be it is better to restart from
scratch following carefully the instructions: you will have
plenty of other occasions to learn anyway!

Michel

Le 22/07/2025 à 11:04, Stéphane Barthes a écrit :

Hi Michel,

Does this mean I need to recover quorum, before some fixing
  happens?

Should I kick a new VM, and add a mon to the cluster, via
cephaadm? This would allow to have 2 running mons?

S. Barthes
Le 22/07/2025 à 10:39, Michel Jouvin a écrit :

Hi Stéphane,

'ceph -s' requires the mon quorum to be reached, else the
   Ceph cluster hangs. cephadm is not using the Ceph cluster
      internal communication but is building a management
   cluster on top of it so it can manage the cluster even if
      the quorum is lost but it cannot provide any information
        requires the quorum to be reached.

Michel

Le 22/07/2025 à 10:33, Stéphane Barthes a écrit :

Hi Malte,

Thank for  your reply.  Here are a some info :

ceph -s hangs and times out monhunting after 300s

But I can run cephadm shell. Is there a similar command
    under cephadm shell?

ceph health detail : same as above.

I would like to repair it, instead of wipe &
restart, as it is (from my point of view) a good way to
    learn (and there are a few data I'd like to recover).

What is the problem with ubuntu 24? I did not see
warnings regarding this specific version in

https://docs.ceph.com/en/latest/cephadm/install/#cephadm-install-distros

Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 10:02, Malte Stroem a écrit :

Hello Stéphane,

I think, you're mixing and mismatching up a lot!

You always have to show us the output of:

ceph -s

And more! Logs and stuff, e. g.:

ceph health detail

It is clear you missed something here and there.

It is repairable but since it is a test cluster, just
     delete it and start again.

And follow the documentation for cephadm. And do not
    use Ubuntu 24.04.

Best,
Malte

On 22.07.25 09:02, Stéphane Barthes wrote:

Hello,

Today, things have degraded a bit more. ceph-03 mon
      has failed and will not restart. It shows the same
           kind of checksum error in rocksdb compact operation
                 during startup. As a consequence, I lost quorum,
and                     ceph commands hang.

Would it be wise to disable rocksbd compact, to
  restart and find quorum back? If yes, what is the
      exactt syntax of the setting in ceph.conf, I have
          seen one for OSD, but not sure if it would apply:

[osd]

osd_compact_on_start = true

If I can restart, I will try to out the OSDs, and
    recreate them. Last time I saw the OSD seemed fine
         in the dashboard. Since I hove no dashboard, is
           there a command I can use to check their status?

Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny

Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :

Michel,

cephadm shell starts on all 3 nodes without error,
        and each host as the same ceph public key entry in
                the .ssh/autorized_key file of the root user.

ceph-01 also has ceph.pub in /etc/ceph with the
     same key (this is the node I started the install
           from)

ceph-2 has no/etc/ceph folder

ceph-3 has a /etc/ceph folder, but no ceph.pub
    file there

S. Barthes
Le 21/07/2025 à 12:36, Michel Jouvin a écrit :

Hi Stéphane,

Sorry I was busy and did not look at your
  previous answers... It is a bit difficult for me
           to understand how you ended up in this situation
                    but for me it is strange that ceph-02
complains                         about a missing keyring and
the corruped                         rocks.db on a freshly
created cluster is also a                         bit strange
for me. I don't think it makes sense                         to
destroy and recreate the OSD, I am running
   several clusters with hundreds of OSDs and I
        never saw a mis-initialized one. The problem is
                hiding something else I'm afraid. Because of
                     some misconfiguration, may be one OSD is
in a                         bad state and may need to be
reinitialized but                         first we should get
the 3 mons running properly                         and
`cephadm shell` working properly on the 3
  hosts. And the rocks.db compaction issue for me
          is related to your mon, not to an OSD.

Have you checked that SSH configuration for
    cephadm is working well from any host to any
         other one in your cluster (with 3 hosts, it
             should be really straighforward to check). The
                    ceph-02 problem may be the sign of SSH
                   misconfiguration as cephadm will use SSH
                    connection to push the keyring, if I am
right.

Michel

Le 21/07/2025 à 12:17, Stéphane Barthes a
  écrit :

Hi,

Should I just wipe the OSD and let ceph
   rebuild it (as suggested there :
https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-
                        from-corrupted-rocksdb) ?

Which would the suggested way be  :

cephadm rm-daemon osd.ceph-01

then

cephadm deploy ?

Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 10:33, Stéphane Barthes a
     écrit :

Michel,

ceph-02 logs :

root@srvr-ceph-02:/# ceph log last debug
       cephadm
2025-07-21T08:16:54.814+0000 7efe1a884640 -1
           auth: unable to find a keyring on
           /etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file                             or directory

2025-07-21T08:16:54.814+0000 7efe1a884640 -1
           AuthRegistry(0x7efe14064de0) no keyring
                 found at /etc/ceph/

ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
                          keyring,/etc/ceph/keyring.bin,
disabling                             cephx

2025-07-21T08:16:54.818+0000 7efe1a884640 -1
           auth: unable to find a keyring on
           /etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file                             or directory

2025-07-21T08:16:54.818+0000 7efe1a884640 -1
           AuthRegistry(0x7efe1a883000) no keyring
                 found at /etc/ceph/

ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
                          keyring,/etc/ceph/keyring.bin,
disabling                             cephx

2025-07-21T08:16:54.818+0000 7efe137fe640 -1
           monclient(hunting): handle_auth_bad_method
                    server allowed_methods [2] but i only
                        support [1]

2025-07-21T08:16:54.818+0000 7efe18e21640 -1
           monclient(hunting): handle_auth_bad_method
                    server allowed_methods [2] but i only
                        support [1]

2025-07-21T08:16:57.818+0000 7efe137fe640 -1
           monclient(hunting): handle_auth_bad_method
                    server allowed_methods [2] but i only
                        support [1]

2025-07-21T08:16:57.818+0000 7efe13fff640 -1
           monclient(hunting): handle_auth_bad_method
                    server allowed_methods [2] but i only
                        support [1]

2025-07-21T08:17:00.818+0000 7efe137fe640 -1
           monclient(hunting): handle_auth_bad_method
                    server allowed_methods [2] but i only
                        support [1]

2025-07-21T08:17:00.818+0000 7efe18e21640 -1
           monclient(hunting): handle_auth_bad_method
                    server allowed_methods [2] but i only
                        support [1]

^CCluster connection aborted
root@srvr-ceph-02:/#

Regarding the ceph-01 log, there is a LOT.
         looking from the end, I see this :

Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -19> 2025-07-20T17:52:21.137+0000
             7f359f42d8c0 2 auth: KeyRing::load: loaded
                      key file
/var/lib/ceph/mon/ceph-srvr-ceph-01/keyring
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -18> 2025-07-20T17:52:21.137+0000
             7f359f42d8c0 2 mon.srvr- ceph-01@-1(???) e5
                       init
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -17> 2025-07-20T17:52:21.137+0000
             7f359f42d8c0 4 mgrc handle_mgr_map Got map
                      version 73
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -16> 2025-07-20T17:52:21.137+0000
             7f359f42d8c0 4 mgrc handle_mgr_map Active
                     mgr is now

[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -15> 2025-07-20T17:52:21.137+0000
             7f359f42d8c0 4 mgrc reconnect Starting new
                      session with


[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -14> 2025-07-20T17:52:21.137+0000
             7f359c206640 -1 mon.srvr- ceph-01@-1(???) e5
                        handle_auth_bad_method hmm, they
didn't like                             2 result (13)
Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -13> 2025-07-20T17:52:21.137+0000
             7f359f42d8c0 0 mon.srvr- ceph-01@-1(probing)
                        e5  my rank is now 0 (was -1)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -12> 2025-07-20T17:52:21.161+0000
             7f359d208640 3 rocksdb: [db/db_impl/
                db_impl_compaction_flush.cc:3026] Compaction
                           error: Corruption: block checksum
mismatch:                             stored = 3368055299,
computed = 2100551158                              in
/var/lib/ceph/mon/ceph-srvr-ceph-01/
   store.db/061999.sst offset 10379525 size
          91317
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -11> 2025-07-20T17:52:21.161+0000
             7f359d208640 4 rocksdb: (Original Log Time
                      2025/07/20-17:52:21.164193)
[db/compaction/
compaction_job.cc:812] [default] compacted
         to: base level 6 level multiplier 10.00 max
                   bytes base 268435456 files[4 0 0 0 0 0 1]
                           max score 0.00, MB/sec: 514.9 rd,
272.6 wr,                             level 6, files in(4, 1)
out(0) MB in(4.0,                             14.8) out(9.9),
read-write-amplify(7.2)                             write-
amplify(2.5) Corruption: block
checksum mismatch: stored = 3368055299,
      computed = 2100551158 in
/var/lib/ceph/mon/ceph-srvr-
ceph-01/store.db/061999.sst offset 10379525
          size 91317, records in: 25191, records
               dropped: 3
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug -10> 2025-07-20T17:52:21.161+0000
             7f359d208640 4 rocksdb: (Original Log Time
                      2025/07/20-17:52:21.164212)
EVENT_LOG_v1                             {"time_micros":
1753033941164205, "job": 3,
"event": "compaction_finished",
"compaction_time_micros": 38166,
"compaction_time_cpu_micros": 25133,
   "output_level": 6, "num_output_files": 0,
           "total_output_size": 10404253,
        "num_input_records": 25191,
  "num_output_records": 21216,
"num_subcompactions": 1,
"output_compression": "NoCompression",
     "num_single_delete_mismatches": 0,
      "num_single_delete_fallthrough": 0,
        "lsm_state": [4, 0, 0, 0, 0, 0, 1]}
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -9> 2025-07-20T17:52:21.161+0000
             7f359d208640 2 rocksdb: [db/db_impl/
                db_impl_compaction_flush.cc:2545] Waiting
                        after background compaction error:
                         Corruption: block checksum mismatch:
stored                             = 3368055299, computed =
2100551158  in
/var/lib/ceph/mon/ceph-srvr-
ceph-01/store.db/061999.sst offset 10379525
          size 91317, Accumulated background error
                 counts: 1
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -8> 2025-07-20T17:52:21.341+0000
             7f359c206640 -1 mon.srvr- ceph-01@0(probing)
                        e5 handle_auth_bad_method hmm, they
didn't                             like 2 result (13)
Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -7> 2025-07-20T17:52:21.741+0000
             7f359c206640 -1 mon.srvr- ceph-01@0(probing)
                        e5 handle_auth_bad_method hmm, they
didn't                             like 2 result (13)
Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -6> 2025-07-20T17:52:21.741+0000
             7f359c206640 1 mon.srvr- ceph-01@0(probing)
                       e5 handle_auth_request failed to
assign                             global_id
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -5> 2025-07-20T17:52:21.749+0000
             7f35981fe640 5 mon.srvr- ceph-01@0(probing)
                       e5 _ms_dispatch setting monitor caps
on this                             connection
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -4> 2025-07-20T17:52:21.749+0000
             7f35981fe640 1 mon.srvr-
    ceph-01@0(synchronizing) e5
sync_obtain_latest_monmap
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -3> 2025-07-20T17:52:21.749+0000
             7f35981fe640 1 mon.srvr-
    ceph-01@0(synchronizing) e5
sync_obtain_latest_monmap obtained monmap e5
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       [285B blob data]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       PutCF( prefix = mon_sync key =
    'latest_monmap' value size = 508)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       PutCF( prefix = mon_sync key = 'in_sync'
              value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       PutCF( prefix = mon_sync key =
    'last_committed_floor' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug  -1> 2025-07-20T17:52:21.749+0000
             7f35981fe640 -1 /home/jenkins-build/

build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/


release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
                   In function 'int
MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)'
                   thread 7f35981fe640 time
2025-07-20T17:52:21.750611+0000
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       /home/jenkins-build/build/
workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/


release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
                   355: ceph_abort_msg("failed to write to
db")
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        ceph version 17.2.8
(f817ceb7f187defb1d021d6328fa833eb8e943b3)
         quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1:
           (ceph::__ceph_abort(char const*, int, char
                    const*, std::__cxx11::basic_string<char,
                           std::char_traits<char>,
                 std::allocator<char> >
      const&)+0xd3) [0x7f35a03a5469]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2:
           /usr/bin/ceph- mon(+0x1e968e)
       [0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3:
           (Monitor::sync_start(entity_addrvec_t&,
                 bool)+0x3b5) [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4:

(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5:

(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6:

(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7:
           (Monitor::_ms_dispatch(Message*)+0x7c9)
                 [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8:
           /usr/bin/ceph- mon(+0x1f6d3e)
       [0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9:
           (DispatchQueue::entry()+0x53a)
        [0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        10: /usr/lib64/ceph/
libceph-common.so.2(+0x3bdea1)
[0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        11: /lib64/ libc.so.6(+0x89e92)
      [0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        12: /lib64/ libc.so.6(+0x10ef20)
       [0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       debug 0> 2025-07-20T17:52:21.749+0000
           7f35981fe640 -1 *** Caught signal (Aborted)
                     **
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  in
           thread 7f35981fe640 thread_name:ms_dispatch
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        ceph version 17.2.8
(f817ceb7f187defb1d021d6328fa833eb8e943b3)
         quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1:
           /lib64/ libc.so.6(+0x3e730) [0x7f359fb21730]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2:
           /lib64/ libc.so.6(+0x8bbdc) [0x7f359fb6ebdc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3:
           raise()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4:
           abort()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5:
           (ceph::__ceph_abort(char const*, int, char
                    const*, std::__cxx11::basic_string<char,
                           std::char_traits<char>,
                 std::allocator<char> >
      const&)+0x190) [0x7f35a03a5526]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6:
           /usr/bin/ceph- mon(+0x1e968e)
       [0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7:
           (Monitor::sync_start(entity_addrvec_t&,
                 bool)+0x3b5) [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8:

(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9:

(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        10:

(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        11: (Monitor::_ms_dispatch(Message*)+0x7c9)
                  [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        12: /usr/bin/ceph- mon(+0x1f6d3e)
        [0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        13: (DispatchQueue::entry()+0x53a)
         [0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        14: /usr/lib64/ceph/
libceph-common.so.2(+0x3bdea1)
[0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        15: /lib64/ libc.so.6(+0x89e92)
      [0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
        16: /lib64/ libc.so.6(+0x10ef20)
       [0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
       NOTE: a copy of the executable, or `objdump
                 -rdS <executable>` is needed to
               interpret this.

I do not know if the logs are purged from
        sensitive data that would prevent emailing
                 them. looking for "checksum mismatch", in
                         the logs, there are many of them
(138).

How can I fix this checksum issue?

Regards,

S. Barthes

Le 21/07/2025 à 09:59, Michel Jouvin a
     écrit :

Stéphane,

On ceph-02, I am not sure why the ceph
        command is not installed as on the other
                  nodes, if you installed it the same way.
                            One way to get access to the
ceph command                               on this server
should be to execute:

cephadm shell

This will start a container where you have
            the ceph environment installed and
                configured for your cluster.

The situation is not as bad as I thought
          reading your first message. You have the
                    mon quorum so at least ceph command
should                               be usable. The first
thing to do is                               probably to log
on your ceph-01 node and                               try
to understand why the mon daemon is
     crashing. You may want to run on this
            node:

cephadm ls  ---> Look for the exact
     daemon name corresponding to the mon

cephadm logs --daemon $daemon_name

Apart from this, it is strange that
     ceph-03 report a RADOS error with 'ceph
              log last...', this probably hides another
                         issue. Could you tell what the same
                              command says on ceph-02 (when
run in                               cephadm shell)?

Michel

Le 21/07/2025 à 09:44, Stéphane Barthes a
           écrit :

Michel,

I ran "ceph log last debug cephadm" on
           my 3 nodes, and "mileage varies"

ceph-01 :

some errors, and it ends with

2025-07-20T03:24:18.887889+0000
    mgr.srvr-ceph-03.dhzbpe (mgr.134360)
             1368 : cephadm [INF] Deploying daemon
                       mon.srvr- ceph-03 on srvr-ceph-03

when I had to remove the mon daemon and
            redeploy on ceph-03.

ceph-02 :

root@srvr-ceph-02:~# ceph log last debug
             cephadm
Command 'ceph' not found, but can be
         installed with:
snap install microceph    # version
        18.2.4+snapc9f2b08f92, or
apt  install ceph-common  # version
        17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional
             versions.

??? should I install ceph-common ???

ceph-03 :

root@srvr-ceph-03:~# ceph log last debug
             cephadm
Error initializing cluster client:
       ObjectNotFound('RADOS object not found
                  (error calling conf_read_file)')
root@srvr-ceph-03:~#

FWIW : ceph health is :

root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s);
            1/3 mons down, quorum
      srvr-ceph-03,srvr-ceph-02; 10 daemons
                have recently crashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed
          cephadm daemon(s)
    daemon mon.srvr-ceph-01 on
   srvr-ceph-01 is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum
          srvr-ceph-03,srvr-ceph-02
    mon.srvr-ceph-01 (rank 0) addr
       [v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0]
                           is down (out of quorum)
[WRN] RECENT_CRASH: 10 daemons have
        recently crashed
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:50:10.202091Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:49:47.712267Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:50:21.464475Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:49:36.609442Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:49:58.966663Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:51:36.947240Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:52:21.751711Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:51:48.490875Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:51:59.651129Z
    mon.srvr-ceph-01 crashed on host
         srvr-ceph-01 at
2025-07-20T17:52:10.552756Z

S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a
           écrit :

Stephane,

If you are using cephadm, the OS
        (distrib and version) you use should
                    not matter. When using cephadm with
                               several servers (the
general case!),                                   it is
important to setup properly the
       SSH key used by cephadm for the
              communication between nodes (cephadm
                          is sort of a SSH-based
management                                   cluster) and
to check that you can log
 in from one node to the other using
            SSH. Can you confirm that it is the
                       case?

Also cephadm has a specific log file.
             I don't use much the dashboard, not
                        sure how you display it (it may be
                                  part of the logs
displayed by the
dashboard) but you can access it with
             the command:

ceph log last debug cephadm

Michel

Le 21/07/2025 à 09:19, Stéphane
       Barthes a écrit :

Hi,

Yes, I did use cephadm, to bootstrap
               the 1st node in the cluster,
                      installed cephadm on the other
                               nodes, and used the
dashboard to add                                     the
nodes to the cluster.

Regards,

S. Barthes

Le 21/07/2025 à 09:12, Michel Jouvin
               a écrit :

Hi Stephane,

How did you configure your
        cluster? Have you been using
                  cephadm? If not, I really advise
                                you to recreate your
cluster with
cephadm, that includes a script to
                bootstrap the cluster. In
                       particular if you don't have a
                                   detail knowledge
about Ceph
architecture and management, it
             will ensure that your cluster is
                           properly configured and let
you                                       progressively
learn about Ceph
details...

Best regards.

Michel

Le 21/07/2025 à 09:02, Stéphane
             Barthes a écrit :

Hello,

I am very new to ceph and have
               started a small cluster to get
                              started with ceph.

But so far my experience, is not
                 very impressive, probably by
                              lack of knowledge and
good                                         practices.

I started with Ubuntu 24,
          installed 3 VM for a ceph
                    cluster, and some how could not
                                    get it running.
Adding nodes
would fail adding OSDs with some
                 weird error(I found it on the
                               web but could not solve
the                                         problem).

I then made a new cluster with 3
                 ubuntu 22 VM. Install ok, start
                                 ok, I created 1 pool
to test                                         storing
stuff there and work my
        way across crash testing.
                  However the cluster dies during
                                  the weekly vm
snapshot. It may
 not a good idea to run vm
           backups on a ceph host, but I
                         find this a little surprising.
                                        (crash testing
started earlier
than expected)

Bottom line is that, after the
               backup the cluster is in warning
                                state with missing
mons, or
logrotate and sometimes crashed
                machines. systemctl restart
                            service or Rebooting node
                                      usually fixes it.

I am now stuck in a situation I
                cannot fix :

    - 1 Machine is ceph rbd
            client cannot auth : auth method
                             'x' error -13. I have
tried                                         quite a
few things, and none
     unlocked the situation. I am
                  currently trying to reboot the
                                 machine, but the
busy/stuck rbd
device seems to block it. I am
               not looking forward to hard
                           reset it.

    - Node with the mgr service
                will not restart mon, or
                         logrotate. I did reboot it
again                                         today,
but I guess this is not
        how a node is expected to
                  behave.

So my questions :

    - How can I unlock my stuck
                ceph client, when this kind of
                               error occurs?

    - Is this expected behavior
                that client looses access to
                             cluster, which kind of
kills the
machine?

    - Where should I look in the
                 ceph nodes logs to figure what
                                is going wrong, and how
to fix                                         it, so
that is run in a stable
        manner?

Regards,

-- 
S. Barthes

_______________________________________________
ceph-users mailing list --
           ceph-users@ceph.io
To unsubscribe send an email to
                ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list --
        ceph-users@ceph.io
To unsubscribe send an email to
             ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list --
     ceph-users@ceph.io
To unsubscribe send an email to
          ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list --
  ceph-users@ceph.io
To unsubscribe send an email to
       ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
    ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
 ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to