Hi,
Zitat von Stéphane Barthes <stephane.bart...@intest.info>:
Hello,
Thank you very much to every one for helping and giving advice, my
cluster is backup online with HEALTH_OK, and it looks like no data
was lost.
I have not been able to convince the cluster to run on 1 mon, as all
ceph/cephadm commands would either hang or fail.
That's an offline operation, you need to stop the MON processe(s) and
then manipulate the monmap of one MON offline, then inject it and
start up the single MON.
I resorted to restoring a backup of ceph-03 which gave me quorum for
long enough to
:
- Label ceph-02 & ceph-03 as _admin
- Remove ceph-01 daemon, and redeploy it
During my investigations, I found that ceph-01 was not running ntpd,
so I installed it. But I am under the impression that my problem is
hardware linked to the VM host. We have had some VMs behaving
strangley on this machine recently. I transferred all ceph-vm to
another host, and will see if this remains stable over time (even
during periodic backup).
To answer Michel's question, we are running a small proxmox v8.4
cluster.
FWIW, the install document indicates you should label mon nodes as
_admin when creating the cluster, and gives examples with the label
in the CLI. When using the GUI, as I did there is a label option
which is empty by default, and I did not select the _admin label.
Best regards,
S. Barthes T: +33 4 72 52 35 40 M: +33 6 14 73 18 34 InTest S.A. 4 Allée
du Levant 69890 La Tour de Salvagny Le 22/07/2025 à 14:32, Michel
Jouvin a écrit :
Stéphane,
As you use VMs for your deployment, would it make sense to stop
and keep them and restart with a new set of VMs, following
https://docs.ceph.com/en/latest/cephadm/install/. I personnally
don't have experience to expand the cluster through the dashboard
(it should work! just that I am an old guy so used to command line
tools!). Probably using the command line makes easier to identify
when there is a problem, without always digging in the logs. I
remember following this doc a couple of years ago, the last time I
created a cluster and it was working as expected.
If you restart something fresh, I'd start with the Squid version
rather than Reef. It should not make any difference for the
installation but will bring you the latest-and-greatest Ceph! Once
you have something running properly, you may compare what is
different in the other configuration. I don't have any opinion on
Debian 24 as we are using RedHat (AlmaLinux in fact) but for me
the OS version should not make a big difference when using
cephadm, as Ceph in fact runs in containers. But the devil is in
the details, Maite may have a reason for her warning...
One thing we didn't mention/check with you in the previous
exchanges is the Ceph network configuration you used. Any
discrepancy in the network configuration, in particular if you
have a separate cluster (network used only by OSD) and public
(network used by monitors and Ceph clients) networks, you may have
a situation where one of them is not working as expected and may
lead to daemons not being able to see eachothers. But I think your
problem is much more basic: the daemons were not able to restart
after some corruptions of their DB, it is pretty unusual and may
also be the result of something weird that happened in your VM
infrastructure leading to storage corruption... What are you using
to manage to your VMs?
Michel
Le 22/07/2025 à 13:48, Stéphane Barthes a écrit :
Michel,
Thank you very much for the help.
I will look into the documentation provided by Eugen to try
reduce to 1 mon, remove and re-add the 2 mon on the node ceph01
and ceph02.
Regarding the installation, I have created a vm template with
ubuntu installed ready for ceph.
Installed cephadm on ceph01, bootstrapped the cluster, and added
the other vms from the dashboard. I guess this is all I did to
setup the cluster.
Regards,
S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 12:12, Michel Jouvin a écrit :
Stéphane,
Basically you cannot do anything in your cluster until you
reach the quorum. Except managing it with cephadm to restore a
functionning cluster. If 'ceph -s' doesn't return, it means
you lost the quorum, it is the only reason I'm aware for this.
As your cluster is quite simple, it should be easy to see the
state of the monitor daemon on each host where one should run
using `cephadm ls` and/or `podman/docker ps`. And you should
be able to get access to the daemon logs of the monitor
daemons.
In one of your message yesterday you reported a log saying the
rocks.db of one of the mon was corrupted. I personnally never
saw that but the first thing to do is to fix this as it will
prevent the mon to start. Follow the doc mentioned by Eugen to
reduce your quorum to 1 mon (deleting the 2 broken ones from
the monmap) if necessary (if you don't find a way to start at
least 2 mon). And as said in another message, ensure you added
the label _admin to the hosts where you want to be able to use
the ceph command else the required information to connect to
the cluster will be missing. It is done with 'ceph orch host
label add' command which requires that you fixed the quorum
issue. One possibility if you have one mon healthy and you
manage to reduce the quorum to 1 is to delete the 2 other mon
and readd them as new mon so that they are reinitialized. This
way you will not loose anything. Look at cephadm
documentation to learn how to remove and add daemons.
One thing not fully clear for me is how you installed your
different hosts. It seems they are not configured exactly the
same way as on one host the ceph command is not available
where it is on the other ones. Ceph doesn't need a lot of
things from the OS when using cephadm but it is pretty
important to ensure that all your Ceph hosts are deployed the
same way/with the same config else you just add to the
entropy...
I fully agree with you and Eugen that trying to fix things is
a way to learn a lot but at the same time it is not very easy
to help you with the very limited information we have on what
you did to be in such a strange situation... So if you don't
manage to converge, may be it is better to restart from
scratch following carefully the instructions: you will have
plenty of other occasions to learn anyway!
Michel
Le 22/07/2025 à 11:04, Stéphane Barthes a écrit :
Hi Michel,
Does this mean I need to recover quorum, before some fixing
happens?
Should I kick a new VM, and add a mon to the cluster, via
cephaadm? This would allow to have 2 running mons?
S. Barthes
Le 22/07/2025 à 10:39, Michel Jouvin a écrit :
Hi Stéphane,
'ceph -s' requires the mon quorum to be reached, else the
Ceph cluster hangs. cephadm is not using the Ceph cluster
internal communication but is building a management
cluster on top of it so it can manage the cluster even if
the quorum is lost but it cannot provide any information
requires the quorum to be reached.
Michel
Le 22/07/2025 à 10:33, Stéphane Barthes a écrit :
Hi Malte,
Thank for your reply. Here are a some info :
ceph -s hangs and times out monhunting after 300s
But I can run cephadm shell. Is there a similar command
under cephadm shell?
ceph health detail : same as above.
I would like to repair it, instead of wipe &
restart, as it is (from my point of view) a good way to
learn (and there are a few data I'd like to recover).
What is the problem with ubuntu 24? I did not see
warnings regarding this specific version in
https://docs.ceph.com/en/latest/cephadm/install/#cephadm-install-distros
Regards,
S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 22/07/2025 à 10:02, Malte Stroem a écrit :
Hello Stéphane,
I think, you're mixing and mismatching up a lot!
You always have to show us the output of:
ceph -s
And more! Logs and stuff, e. g.:
ceph health detail
It is clear you missed something here and there.
It is repairable but since it is a test cluster, just
delete it and start again.
And follow the documentation for cephadm. And do not
use Ubuntu 24.04.
Best,
Malte
On 22.07.25 09:02, Stéphane Barthes wrote:
Hello,
Today, things have degraded a bit more. ceph-03 mon
has failed and will not restart. It shows the same
kind of checksum error in rocksdb compact operation
during startup. As a consequence, I lost quorum,
and ceph commands hang.
Would it be wise to disable rocksbd compact, to
restart and find quorum back? If yes, what is the
exactt syntax of the setting in ceph.conf, I have
seen one for OSD, but not sure if it would apply:
[osd]
osd_compact_on_start = true
If I can restart, I will try to out the OSDs, and
recreate them. Last time I saw the OSD seemed fine
in the dashboard. Since I hove no dashboard, is
there a command I can use to check their status?
Regards,
S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :
Michel,
cephadm shell starts on all 3 nodes without error,
and each host as the same ceph public key entry in
the .ssh/autorized_key file of the root user.
ceph-01 also has ceph.pub in /etc/ceph with the
same key (this is the node I started the install
from)
ceph-2 has no/etc/ceph folder
ceph-3 has a /etc/ceph folder, but no ceph.pub
file there
S. Barthes
Le 21/07/2025 à 12:36, Michel Jouvin a écrit :
Hi Stéphane,
Sorry I was busy and did not look at your
previous answers... It is a bit difficult for me
to understand how you ended up in this situation
but for me it is strange that ceph-02
complains about a missing keyring and
the corruped rocks.db on a freshly
created cluster is also a bit strange
for me. I don't think it makes sense to
destroy and recreate the OSD, I am running
several clusters with hundreds of OSDs and I
never saw a mis-initialized one. The problem is
hiding something else I'm afraid. Because of
some misconfiguration, may be one OSD is
in a bad state and may need to be
reinitialized but first we should get
the 3 mons running properly and
`cephadm shell` working properly on the 3
hosts. And the rocks.db compaction issue for me
is related to your mon, not to an OSD.
Have you checked that SSH configuration for
cephadm is working well from any host to any
other one in your cluster (with 3 hosts, it
should be really straighforward to check). The
ceph-02 problem may be the sign of SSH
misconfiguration as cephadm will use SSH
connection to push the keyring, if I am
right.
Michel
Le 21/07/2025 à 12:17, Stéphane Barthes a
écrit :
Hi,
Should I just wipe the OSD and let ceph
rebuild it (as suggested there :
https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-
from-corrupted-rocksdb) ?
Which would the suggested way be :
cephadm rm-daemon osd.ceph-01
then
cephadm deploy ?
Regards,
S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 10:33, Stéphane Barthes a
écrit :
Michel,
ceph-02 logs :
root@srvr-ceph-02:/# ceph log last debug
cephadm
2025-07-21T08:16:54.814+0000 7efe1a884640 -1
auth: unable to find a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file or directory
2025-07-21T08:16:54.814+0000 7efe1a884640 -1
AuthRegistry(0x7efe14064de0) no keyring
found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
keyring,/etc/ceph/keyring.bin,
disabling cephx
2025-07-21T08:16:54.818+0000 7efe1a884640 -1
auth: unable to find a keyring on
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/
ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file or directory
2025-07-21T08:16:54.818+0000 7efe1a884640 -1
AuthRegistry(0x7efe1a883000) no keyring
found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/
keyring,/etc/ceph/keyring.bin,
disabling cephx
2025-07-21T08:16:54.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only
support [1]
2025-07-21T08:16:54.818+0000 7efe18e21640 -1
monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only
support [1]
2025-07-21T08:16:57.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only
support [1]
2025-07-21T08:16:57.818+0000 7efe13fff640 -1
monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only
support [1]
2025-07-21T08:17:00.818+0000 7efe137fe640 -1
monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only
support [1]
2025-07-21T08:17:00.818+0000 7efe18e21640 -1
monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only
support [1]
^CCluster connection aborted
root@srvr-ceph-02:/#
Regarding the ceph-01 log, there is a LOT.
looking from the end, I see this :
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -19> 2025-07-20T17:52:21.137+0000
7f359f42d8c0 2 auth: KeyRing::load: loaded
key file
/var/lib/ceph/mon/ceph-srvr-ceph-01/keyring
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -18> 2025-07-20T17:52:21.137+0000
7f359f42d8c0 2 mon.srvr- ceph-01@-1(???) e5
init
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -17> 2025-07-20T17:52:21.137+0000
7f359f42d8c0 4 mgrc handle_mgr_map Got map
version 73
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -16> 2025-07-20T17:52:21.137+0000
7f359f42d8c0 4 mgrc handle_mgr_map Active
mgr is now
[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -15> 2025-07-20T17:52:21.137+0000
7f359f42d8c0 4 mgrc reconnect Starting new
session with
[v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -14> 2025-07-20T17:52:21.137+0000
7f359c206640 -1 mon.srvr- ceph-01@-1(???) e5
handle_auth_bad_method hmm, they
didn't like 2 result (13)
Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -13> 2025-07-20T17:52:21.137+0000
7f359f42d8c0 0 mon.srvr- ceph-01@-1(probing)
e5 my rank is now 0 (was -1)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -12> 2025-07-20T17:52:21.161+0000
7f359d208640 3 rocksdb: [db/db_impl/
db_impl_compaction_flush.cc:3026] Compaction
error: Corruption: block checksum
mismatch: stored = 3368055299,
computed = 2100551158 in
/var/lib/ceph/mon/ceph-srvr-ceph-01/
store.db/061999.sst offset 10379525 size
91317
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -11> 2025-07-20T17:52:21.161+0000
7f359d208640 4 rocksdb: (Original Log Time
2025/07/20-17:52:21.164193)
[db/compaction/
compaction_job.cc:812] [default] compacted
to: base level 6 level multiplier 10.00 max
bytes base 268435456 files[4 0 0 0 0 0 1]
max score 0.00, MB/sec: 514.9 rd,
272.6 wr, level 6, files in(4, 1)
out(0) MB in(4.0, 14.8) out(9.9),
read-write-amplify(7.2) write-
amplify(2.5) Corruption: block
checksum mismatch: stored = 3368055299,
computed = 2100551158 in
/var/lib/ceph/mon/ceph-srvr-
ceph-01/store.db/061999.sst offset 10379525
size 91317, records in: 25191, records
dropped: 3
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -10> 2025-07-20T17:52:21.161+0000
7f359d208640 4 rocksdb: (Original Log Time
2025/07/20-17:52:21.164212)
EVENT_LOG_v1 {"time_micros":
1753033941164205, "job": 3,
"event": "compaction_finished",
"compaction_time_micros": 38166,
"compaction_time_cpu_micros": 25133,
"output_level": 6, "num_output_files": 0,
"total_output_size": 10404253,
"num_input_records": 25191,
"num_output_records": 21216,
"num_subcompactions": 1,
"output_compression": "NoCompression",
"num_single_delete_mismatches": 0,
"num_single_delete_fallthrough": 0,
"lsm_state": [4, 0, 0, 0, 0, 0, 1]}
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -9> 2025-07-20T17:52:21.161+0000
7f359d208640 2 rocksdb: [db/db_impl/
db_impl_compaction_flush.cc:2545] Waiting
after background compaction error:
Corruption: block checksum mismatch:
stored = 3368055299, computed =
2100551158 in
/var/lib/ceph/mon/ceph-srvr-
ceph-01/store.db/061999.sst offset 10379525
size 91317, Accumulated background error
counts: 1
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -8> 2025-07-20T17:52:21.341+0000
7f359c206640 -1 mon.srvr- ceph-01@0(probing)
e5 handle_auth_bad_method hmm, they
didn't like 2 result (13)
Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -7> 2025-07-20T17:52:21.741+0000
7f359c206640 -1 mon.srvr- ceph-01@0(probing)
e5 handle_auth_bad_method hmm, they
didn't like 2 result (13)
Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -6> 2025-07-20T17:52:21.741+0000
7f359c206640 1 mon.srvr- ceph-01@0(probing)
e5 handle_auth_request failed to
assign global_id
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -5> 2025-07-20T17:52:21.749+0000
7f35981fe640 5 mon.srvr- ceph-01@0(probing)
e5 _ms_dispatch setting monitor caps
on this connection
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -4> 2025-07-20T17:52:21.749+0000
7f35981fe640 1 mon.srvr-
ceph-01@0(synchronizing) e5
sync_obtain_latest_monmap
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -3> 2025-07-20T17:52:21.749+0000
7f35981fe640 1 mon.srvr-
ceph-01@0(synchronizing) e5
sync_obtain_latest_monmap obtained monmap e5
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
[285B blob data]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
PutCF( prefix = mon_sync key =
'latest_monmap' value size = 508)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
PutCF( prefix = mon_sync key = 'in_sync'
value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
PutCF( prefix = mon_sync key =
'last_committed_floor' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug -1> 2025-07-20T17:52:21.749+0000
7f35981fe640 -1 /home/jenkins-build/
build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/
release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
In function 'int
MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)'
thread 7f35981fe640 time
2025-07-20T17:52:21.750611+0000
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
/home/jenkins-build/build/
workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/
release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
355: ceph_abort_msg("failed to write to
db")
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
ceph version 17.2.8
(f817ceb7f187defb1d021d6328fa833eb8e943b3)
quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1:
(ceph::__ceph_abort(char const*, int, char
const*, std::__cxx11::basic_string<char,
std::char_traits<char>,
std::allocator<char> >
const&)+0xd3) [0x7f35a03a5469]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2:
/usr/bin/ceph- mon(+0x1e968e)
[0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3:
(Monitor::sync_start(entity_addrvec_t&,
bool)+0x3b5) [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4:
(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6:
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:
(Monitor::_ms_dispatch(Message*)+0x7c9)
[0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8:
/usr/bin/ceph- mon(+0x1f6d3e)
[0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:
(DispatchQueue::entry()+0x53a)
[0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
10: /usr/lib64/ceph/
libceph-common.so.2(+0x3bdea1)
[0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
11: /lib64/ libc.so.6(+0x89e92)
[0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
12: /lib64/ libc.so.6(+0x10ef20)
[0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
debug 0> 2025-07-20T17:52:21.749+0000
7f35981fe640 -1 *** Caught signal (Aborted)
**
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: in
thread 7f35981fe640 thread_name:ms_dispatch
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
ceph version 17.2.8
(f817ceb7f187defb1d021d6328fa833eb8e943b3)
quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 1:
/lib64/ libc.so.6(+0x3e730) [0x7f359fb21730]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 2:
/lib64/ libc.so.6(+0x8bbdc) [0x7f359fb6ebdc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 3:
raise()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 4:
abort()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 5:
(ceph::__ceph_abort(char const*, int, char
const*, std::__cxx11::basic_string<char,
std::char_traits<char>,
std::allocator<char> >
const&)+0x190) [0x7f35a03a5526]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 6:
/usr/bin/ceph- mon(+0x1e968e)
[0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 7:
(Monitor::sync_start(entity_addrvec_t&,
bool)+0x3b5) [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 8:
(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f)
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 9:
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c)
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
10:
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d)
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
11: (Monitor::_ms_dispatch(Message*)+0x7c9)
[0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
12: /usr/bin/ceph- mon(+0x1f6d3e)
[0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
13: (DispatchQueue::entry()+0x53a)
[0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
14: /usr/lib64/ceph/
libceph-common.so.2(+0x3bdea1)
[0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
15: /lib64/ libc.so.6(+0x89e92)
[0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
16: /lib64/ libc.so.6(+0x10ef20)
[0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:
NOTE: a copy of the executable, or `objdump
-rdS <executable>` is needed to
interpret this.
I do not know if the logs are purged from
sensitive data that would prevent emailing
them. looking for "checksum mismatch", in
the logs, there are many of them
(138).
How can I fix this checksum issue?
Regards,
S. Barthes
Le 21/07/2025 à 09:59, Michel Jouvin a
écrit :
Stéphane,
On ceph-02, I am not sure why the ceph
command is not installed as on the other
nodes, if you installed it the same way.
One way to get access to the
ceph command on this server
should be to execute:
cephadm shell
This will start a container where you have
the ceph environment installed and
configured for your cluster.
The situation is not as bad as I thought
reading your first message. You have the
mon quorum so at least ceph command
should be usable. The first
thing to do is probably to log
on your ceph-01 node and try
to understand why the mon daemon is
crashing. You may want to run on this
node:
cephadm ls ---> Look for the exact
daemon name corresponding to the mon
cephadm logs --daemon $daemon_name
Apart from this, it is strange that
ceph-03 report a RADOS error with 'ceph
log last...', this probably hides another
issue. Could you tell what the same
command says on ceph-02 (when
run in cephadm shell)?
Michel
Le 21/07/2025 à 09:44, Stéphane Barthes a
écrit :
Michel,
I ran "ceph log last debug cephadm" on
my 3 nodes, and "mileage varies"
ceph-01 :
some errors, and it ends with
2025-07-20T03:24:18.887889+0000
mgr.srvr-ceph-03.dhzbpe (mgr.134360)
1368 : cephadm [INF] Deploying daemon
mon.srvr- ceph-03 on srvr-ceph-03
when I had to remove the mon daemon and
redeploy on ceph-03.
ceph-02 :
root@srvr-ceph-02:~# ceph log last debug
cephadm
Command 'ceph' not found, but can be
installed with:
snap install microceph # version
18.2.4+snapc9f2b08f92, or
apt install ceph-common # version
17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional
versions.
??? should I install ceph-common ???
ceph-03 :
root@srvr-ceph-03:~# ceph log last debug
cephadm
Error initializing cluster client:
ObjectNotFound('RADOS object not found
(error calling conf_read_file)')
root@srvr-ceph-03:~#
FWIW : ceph health is :
root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s);
1/3 mons down, quorum
srvr-ceph-03,srvr-ceph-02; 10 daemons
have recently crashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed
cephadm daemon(s)
daemon mon.srvr-ceph-01 on
srvr-ceph-01 is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum
srvr-ceph-03,srvr-ceph-02
mon.srvr-ceph-01 (rank 0) addr
[v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0]
is down (out of quorum)
[WRN] RECENT_CRASH: 10 daemons have
recently crashed
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:50:10.202091Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:49:47.712267Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:50:21.464475Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:49:36.609442Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:49:58.966663Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:51:36.947240Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:52:21.751711Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:51:48.490875Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:51:59.651129Z
mon.srvr-ceph-01 crashed on host
srvr-ceph-01 at
2025-07-20T17:52:10.552756Z
S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a
écrit :
Stephane,
If you are using cephadm, the OS
(distrib and version) you use should
not matter. When using cephadm with
several servers (the
general case!), it is
important to setup properly the
SSH key used by cephadm for the
communication between nodes (cephadm
is sort of a SSH-based
management cluster) and
to check that you can log
in from one node to the other using
SSH. Can you confirm that it is the
case?
Also cephadm has a specific log file.
I don't use much the dashboard, not
sure how you display it (it may be
part of the logs
displayed by the
dashboard) but you can access it with
the command:
ceph log last debug cephadm
Michel
Le 21/07/2025 à 09:19, Stéphane
Barthes a écrit :
Hi,
Yes, I did use cephadm, to bootstrap
the 1st node in the cluster,
installed cephadm on the other
nodes, and used the
dashboard to add the
nodes to the cluster.
Regards,
S. Barthes
Le 21/07/2025 à 09:12, Michel Jouvin
a écrit :
Hi Stephane,
How did you configure your
cluster? Have you been using
cephadm? If not, I really advise
you to recreate your
cluster with
cephadm, that includes a script to
bootstrap the cluster. In
particular if you don't have a
detail knowledge
about Ceph
architecture and management, it
will ensure that your cluster is
properly configured and let
you progressively
learn about Ceph
details...
Best regards.
Michel
Le 21/07/2025 à 09:02, Stéphane
Barthes a écrit :
Hello,
I am very new to ceph and have
started a small cluster to get
started with ceph.
But so far my experience, is not
very impressive, probably by
lack of knowledge and
good practices.
I started with Ubuntu 24,
installed 3 VM for a ceph
cluster, and some how could not
get it running.
Adding nodes
would fail adding OSDs with some
weird error(I found it on the
web but could not solve
the problem).
I then made a new cluster with 3
ubuntu 22 VM. Install ok, start
ok, I created 1 pool
to test storing
stuff there and work my
way across crash testing.
However the cluster dies during
the weekly vm
snapshot. It may
not a good idea to run vm
backups on a ceph host, but I
find this a little surprising.
(crash testing
started earlier
than expected)
Bottom line is that, after the
backup the cluster is in warning
state with missing
mons, or
logrotate and sometimes crashed
machines. systemctl restart
service or Rebooting node
usually fixes it.
I am now stuck in a situation I
cannot fix :
- 1 Machine is ceph rbd
client cannot auth : auth method
'x' error -13. I have
tried quite a
few things, and none
unlocked the situation. I am
currently trying to reboot the
machine, but the
busy/stuck rbd
device seems to block it. I am
not looking forward to hard
reset it.
- Node with the mgr service
will not restart mon, or
logrotate. I did reboot it
again today,
but I guess this is not
how a node is expected to
behave.
So my questions :
- How can I unlock my stuck
ceph client, when this kind of
error occurs?
- Is this expected behavior
that client looses access to
cluster, which kind of
kills the
machine?
- Where should I look in the
ceph nodes logs to figure what
is going wrong, and how
to fix it, so
that is run in a stable
manner?
Regards,
--
S. Barthes
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list --
ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io