Hi,


>Last time I saw the OSD seemed fine in the dashboard. Since I hove no 
>dashboard, is there a command I can use to >check their status?


ceph osd status


>root@srvr-ceph-02:~# ceph log last debug cephadm
>Command 'ceph' not found, but can be installed with:
>snap install microceph    # version 18.2.4+snapc9f2b08f92, or
>apt  install ceph-common  # version 17.2.7-0ubuntu0.22.04.2
>See 'snap info microceph' for additional versions.


If you want to run ceph commands outside of cephadm you'll need to install 
ceph-commons


> ceph-02 and ceph-03


To access to cephadm on those node you have to label them as  "_admin"


Vivien



________________________________
De : Stéphane Barthes <stephane.bart...@intest.info>
Envoyé : mardi 22 juillet 2025 08:02:40
À : ceph-users@ceph.io
Objet : [ceph-users] Re: Newby woes with ceph


Hello,


Today, things have degraded a bit more. ceph-03 mon has failed and will not 
restart. It shows the same kind of checksum error in rocksdb compact operation 
during startup. As a consequence, I lost quorum, and ceph commands hang.


Would it be wise to disable rocksbd compact, to restart and find quorum back? 
If yes, what is the exactt syntax of the setting in ceph.conf, I have seen one 
for OSD, but not sure if it would apply:

[osd]

osd_compact_on_start = true


If I can restart, I will try to out the OSDs, and recreate them. Last time I 
saw the OSD seemed fine in the dashboard. Since I hove no dashboard, is there a 
command I can use to check their status?


Regards,

S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny

Le 21/07/2025 à 14:27, Stéphane Barthes a écrit :

Michel,


cephadm shell starts on all 3 nodes without error, and each host as the same 
ceph public key entry in the .ssh/autorized_key file of the root user.

ceph-01 also has ceph.pub in /etc/ceph with the same key (this is the node I 
started the install from)

ceph-2 has no/etc/ceph folder

ceph-3 has a /etc/ceph folder, but no ceph.pub file there


S. Barthes
Le 21/07/2025 à 12:36, Michel Jouvin a écrit :

Hi Stéphane,

Sorry I was busy and did not look at your previous answers... It is a bit 
difficult for me to understand how you ended up in this situation but for me it 
is strange that ceph-02 complains about a missing keyring and the corruped 
rocks.db on a freshly created cluster is also a bit strange for me. I don't 
think it makes sense to destroy and recreate the OSD, I am running several 
clusters with hundreds of OSDs and I never saw a mis-initialized one. The 
problem is hiding something else I'm afraid. Because of some misconfiguration, 
may be one OSD is in a bad state and may need to be reinitialized but first we 
should get the 3 mons running properly and `cephadm shell` working properly on 
the 3 hosts. And the rocks.db compaction issue for me is related to your mon, 
not to an OSD.

Have you checked that SSH configuration for cephadm is working well from any 
host to any other one in your cluster (with 3 hosts, it should be really 
straighforward to check). The ceph-02 problem may be the sign of SSH 
misconfiguration as cephadm will use SSH connection to push the keyring, if I 
am right.

Michel

Le 21/07/2025 à 12:17, Stéphane Barthes a écrit :

Hi,


Should I just wipe the OSD and let ceph rebuild it (as suggested there : 
https://ceph-users.ceph.narkive.com/LO6ebu9r/how-to-recover-from-corrupted-rocksdb)
 ?

Which would the suggested way be  :

cephadm rm-daemon osd.ceph-01

then

cephadm deploy ?


Regards,


S. Barthes
T: +33 4 72 52 35 40 M: +33 6 14 73 18 34
InTest S.A.
4 Allée du Levant
69890 La Tour de Salvagny
Le 21/07/2025 à 10:33, Stéphane Barthes a écrit :

Michel,

ceph-02 logs :

root@srvr-ceph-02:/# ceph log last debug cephadm
2025-07-21T08:16:54.814+0000 7efe1a884640 -1 auth: unable to find a keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
 (2) No such file or directory

2025-07-21T08:16:54.814+0000 7efe1a884640 -1 AuthRegistry(0x7efe14064de0) no 
keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
 disabling cephx

2025-07-21T08:16:54.818+0000 7efe1a884640 -1 auth: unable to find a keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
 (2) No such file or directory

2025-07-21T08:16:54.818+0000 7efe1a884640 -1 AuthRegistry(0x7efe1a883000) no 
keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
 disabling cephx

2025-07-21T08:16:54.818+0000 7efe137fe640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]

2025-07-21T08:16:54.818+0000 7efe18e21640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]

2025-07-21T08:16:57.818+0000 7efe137fe640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]

2025-07-21T08:16:57.818+0000 7efe13fff640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]

2025-07-21T08:17:00.818+0000 7efe137fe640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]

2025-07-21T08:17:00.818+0000 7efe18e21640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]

^CCluster connection aborted
root@srvr-ceph-02:/#


Regarding the ceph-01 log, there is a LOT. looking from the end, I see this :

Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -19> 
2025-07-20T17:52:21.137+0000 7f359f42d8c0  2 auth: KeyRing::load: loaded key 
file /var/lib/ceph/mon/ceph-srvr-ceph-01/keyring
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -18> 
2025-07-20T17:52:21.137+0000 7f359f42d8c0  2 mon.srvr-ceph-01@-1(???) e5 init
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -17> 
2025-07-20T17:52:21.137+0000 7f359f42d8c0  4 mgrc handle_mgr_map Got map 
version 73
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -16> 
2025-07-20T17:52:21.137+0000 7f359f42d8c0  4 mgrc handle_mgr_map Active mgr is 
now [v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -15> 
2025-07-20T17:52:21.137+0000 7f359f42d8c0  4 mgrc reconnect Starting new 
session with [v2:10.32.100.24:6800/4015208663,v1:10.32.100.24:6801/4015208663]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -14> 
2025-07-20T17:52:21.137+0000 7f359c206640 -1 mon.srvr-ceph-01@-1(???) e5 
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -13> 
2025-07-20T17:52:21.137+0000 7f359f42d8c0  0 mon.srvr-ceph-01@-1(probing) e5  
my rank is now 0 (was -1)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -12> 
2025-07-20T17:52:21.161+0000 7f359d208640  3 rocksdb: 
[db/db_impl/db_impl_compaction_flush.cc:3026] Compaction error: Corruption: 
block checksum mismatch: stored = 3368055299, computed = 2100551158  in 
/var/lib/ceph/mon/ceph-srvr-ceph-01/store.db/061999.sst offset 10379525 size 
91317
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -11> 
2025-07-20T17:52:21.161+0000 7f359d208640  4 rocksdb: (Original Log Time 
2025/07/20-17:52:21.164193) [db/compaction/compaction_job.cc:812] [default] 
compacted to: base level 6 level multiplier 10.00 max bytes base 268435456 
files[4 0 0 0 0 0 1] max score 0.00, MB/sec: 514.9 rd, 272.6 wr, level 6, files 
in(4, 1) out(0) MB in(4.0, 14.8) out(9.9), read-write-amplify(7.2) 
write-amplify(2.5) Corruption: block checksum mismatch: stored = 3368055299, 
computed = 2100551158 in 
/var/lib/ceph/mon/ceph-srvr-ceph-01/store.db/061999.sst offset 10379525 size 
91317, records in: 25191, records dropped: 3
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug    -10> 
2025-07-20T17:52:21.161+0000 7f359d208640  4 rocksdb: (Original Log Time 
2025/07/20-17:52:21.164212) EVENT_LOG_v1 {"time_micros": 1753033941164205, 
"job": 3, "event": "compaction_finished", "compaction_time_micros": 38166, 
"compaction_time_cpu_micros": 25133, "output_level": 6, "num_output_files": 0, 
"total_output_size": 10404253, "num_input_records": 25191, 
"num_output_records": 21216, "num_subcompactions": 1, "output_compression": 
"NoCompression", "num_single_delete_mismatches": 0, 
"num_single_delete_fallthrough": 0, "lsm_state": [4, 0, 0, 0, 0, 0, 1]}
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -9> 
2025-07-20T17:52:21.161+0000 7f359d208640  2 rocksdb: 
[db/db_impl/db_impl_compaction_flush.cc:2545] Waiting after background 
compaction error: Corruption: block checksum mismatch: stored = 3368055299, 
computed = 2100551158  in 
/var/lib/ceph/mon/ceph-srvr-ceph-01/store.db/061999.sst offset 10379525 size 
91317, Accumulated background error counts: 1
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -8> 
2025-07-20T17:52:21.341+0000 7f359c206640 -1 mon.srvr-ceph-01@0(probing) e5 
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -7> 
2025-07-20T17:52:21.741+0000 7f359c206640 -1 mon.srvr-ceph-01@0(probing) e5 
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -6> 
2025-07-20T17:52:21.741+0000 7f359c206640  1 mon.srvr-ceph-01@0(probing) e5 
handle_auth_request failed to assign global_id
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -5> 
2025-07-20T17:52:21.749+0000 7f35981fe640  5 mon.srvr-ceph-01@0(probing) e5 
_ms_dispatch setting monitor caps on this connection
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -4> 
2025-07-20T17:52:21.749+0000 7f35981fe640  1 mon.srvr-ceph-01@0(synchronizing) 
e5 sync_obtain_latest_monmap
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -3> 
2025-07-20T17:52:21.749+0000 7f35981fe640  1 mon.srvr-ceph-01@0(synchronizing) 
e5 sync_obtain_latest_monmap obtained monmap e5
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: [285B blob data]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync key = 
'latest_monmap' value size = 508)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync key = 
'in_sync' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: PutCF( prefix = mon_sync key = 
'last_committed_floor' value size = 8)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug     -1> 
2025-07-20T17:52:21.749+0000 7f35981fe640 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
 In function 'int 
MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)' thread 
7f35981fe640 time 2025-07-20T17:52:21.750611+0000
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/mon/MonitorDBStore.h:
 355: ceph_abort_msg("failed to write to db")
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  ceph version 17.2.8 
(f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1: (ceph::__ceph_abort(char const*, 
int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&)+0xd3) [0x7f35a03a5469]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2: /usr/bin/ceph-mon(+0x1e968e) 
[0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3: 
(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5) [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4: 
(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f) 
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5: 
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c) 
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6: 
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d) 
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7: 
(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8: /usr/bin/ceph-mon(+0x1f6d3e) 
[0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9: (DispatchQueue::entry()+0x53a) 
[0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  10: 
/usr/lib64/ceph/libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  11: /lib64/libc.so.6(+0x89e92) 
[0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  12: /lib64/libc.so.6(+0x10ef20) 
[0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]: debug      0> 
2025-07-20T17:52:21.749+0000 7f35981fe640 -1 *** Caught signal (Aborted) **
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  in thread 7f35981fe640 
thread_name:ms_dispatch
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  ceph version 17.2.8 
(f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  1: /lib64/libc.so.6(+0x3e730) 
[0x7f359fb21730]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  2: /lib64/libc.so.6(+0x8bbdc) 
[0x7f359fb6ebdc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  3: raise()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  4: abort()
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  5: (ceph::__ceph_abort(char const*, 
int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&)+0x190) [0x7f35a03a5526]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  6: /usr/bin/ceph-mon(+0x1e968e) 
[0x55e079c5768e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  7: 
(Monitor::sync_start(entity_addrvec_t&, bool)+0x3b5) [0x55e079c8a145]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  8: 
(Monitor::handle_probe_reply(boost::intrusive_ptr<MonOpRequest>)+0x83f) 
[0x55e079c90baf]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  9: 
(Monitor::handle_probe(boost::intrusive_ptr<MonOpRequest>)+0x36c) 
[0x55e079c925dc]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  10: 
(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x112d) 
[0x55e079cad20d]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  11: 
(Monitor::_ms_dispatch(Message*)+0x7c9) [0x55e079cadd89]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  12: /usr/bin/ceph-mon(+0x1f6d3e) 
[0x55e079c64d3e]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  13: (DispatchQueue::entry()+0x53a) 
[0x7f35a058d34a]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  14: 
/usr/lib64/ceph/libceph-common.so.2(+0x3bdea1) [0x7f35a061eea1]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  15: /lib64/libc.so.6(+0x89e92) 
[0x7f359fb6ce92]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  16: /lib64/libc.so.6(+0x10ef20) 
[0x7f359fbf1f20]
Jul 20 17:52:21 srvr-ceph-01 bash[3424]:  NOTE: a copy of the executable, or 
`objdump -rdS <executable>` is needed to interpret this.

I do not know if the logs are purged from sensitive data that would prevent 
emailing them. looking for "checksum mismatch", in the logs, there are many of 
them (138).

How can I fix this checksum issue?


Regards,


S. Barthes

Le 21/07/2025 à 09:59, Michel Jouvin a écrit :
Stéphane,

On ceph-02, I am not sure why the ceph command is not installed as on the other 
nodes, if you installed it the same way. One way to get access to the ceph 
command on this server should be to execute:

cephadm shell

This will start a container where you have the ceph environment installed and 
configured for your cluster.

The situation is not as bad as I thought reading your first message. You have 
the mon quorum so at least ceph command should be usable. The first thing to do 
is probably to log on your ceph-01 node and try to understand why the mon 
daemon is crashing. You may want to run on this node:

cephadm ls  ---> Look for the exact daemon name corresponding to the mon

cephadm logs --daemon $daemon_name

Apart from this, it is strange that ceph-03 report a RADOS error with 'ceph log 
last...', this probably hides another issue. Could you tell what the same 
command says on ceph-02 (when run in cephadm shell)?

Michel

Le 21/07/2025 à 09:44, Stéphane Barthes a écrit :

Michel,


I ran "ceph log last debug cephadm" on my 3 nodes, and "mileage varies"

ceph-01 :

some errors, and it ends with

2025-07-20T03:24:18.887889+0000 mgr.srvr-ceph-03.dhzbpe (mgr.134360) 1368 : 
cephadm [INF] Deploying daemon mon.srvr-ceph-03 on srvr-ceph-03

when I had to remove the mon daemon and redeploy on ceph-03.

ceph-02 :

root@srvr-ceph-02:~# ceph log last debug cephadm
Command 'ceph' not found, but can be installed with:
snap install microceph    # version 18.2.4+snapc9f2b08f92, or
apt  install ceph-common  # version 17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional versions.

??? should I install ceph-common ???

ceph-03 :

root@srvr-ceph-03:~# ceph log last debug cephadm
Error initializing cluster client: ObjectNotFound('RADOS object not found 
(error calling conf_read_file)')
root@srvr-ceph-03:~#

FWIW : ceph health is :

root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum 
srvr-ceph-03,srvr-ceph-02; 10 daemons have recently crashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon mon.srvr-ceph-01 on srvr-ceph-01 is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum srvr-ceph-03,srvr-ceph-02
    mon.srvr-ceph-01 (rank 0) addr 
[v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is down (out of quorum)
[WRN] RECENT_CRASH: 10 daemons have recently crashed
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:50:10.202091Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:49:47.712267Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:50:21.464475Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:49:36.609442Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:49:58.966663Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:51:36.947240Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:52:21.751711Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:51:48.490875Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:51:59.651129Z
    mon.srvr-ceph-01 crashed on host srvr-ceph-01 at 2025-07-20T17:52:10.552756Z

S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a écrit :
Stephane,

If you are using cephadm, the OS (distrib and version) you use should not 
matter. When using cephadm with several servers (the general case!), it is 
important to setup properly the SSH key used by cephadm for the communication 
between nodes (cephadm is sort of a SSH-based management cluster) and to check 
that you can log in from one node to the other using SSH. Can you confirm that 
it is the case?

Also cephadm has a specific log file. I don't use much the dashboard, not sure 
how you display it (it may be part of the logs displayed by the dashboard) but 
you can access it with the command:

ceph log last debug cephadm

Michel

Le 21/07/2025 à 09:19, Stéphane Barthes a écrit :

Hi,

Yes, I did use cephadm, to bootstrap the 1st node in the cluster, installed 
cephadm on the other nodes, and used the dashboard to add the nodes to the 
cluster.


Regards,

S. Barthes

Le 21/07/2025 à 09:12, Michel Jouvin a écrit :
Hi Stephane,

How did you configure your cluster? Have you been using cephadm? If not, I 
really advise you to recreate your cluster with cephadm, that includes a script 
to bootstrap the cluster. In particular if you don't have a detail knowledge 
about Ceph architecture and management, it will ensure that your cluster is 
properly configured and let you progressively learn about Ceph details...

Best regards.

Michel

Le 21/07/2025 à 09:02, Stéphane Barthes a écrit :

Hello,


I am very new to ceph and have started a small cluster to get started with ceph.

But so far my experience, is not very impressive, probably by lack of knowledge 
and good practices.


I started with Ubuntu 24, installed 3 VM for a ceph cluster, and some how could 
not get it running. Adding nodes would fail adding OSDs with some weird error(I 
found it on the web but could not solve the problem).

I then made a new cluster with 3 ubuntu 22 VM. Install ok, start ok, I created 
1 pool to test storing stuff there and work my way across crash testing. 
However the cluster dies during the weekly vm snapshot. It may not a good idea 
to run vm backups on a ceph host, but I find this a little surprising. (crash 
testing started earlier than expected)

Bottom line is that, after the backup the cluster is in warning state with 
missing mons, or logrotate and sometimes crashed machines. systemctl restart 
service or Rebooting node usually fixes it.

I am now stuck in a situation I cannot fix :

    - 1 Machine is ceph rbd client cannot auth : auth method 'x' error -13. I 
have tried quite a few things, and none unlocked the situation. I am currently 
trying to reboot the machine, but the busy/stuck rbd device seems to block it. 
I am not looking forward to hard reset it.

    - Node with the mgr service will not restart mon, or logrotate. I did 
reboot it again today, but I guess this is not how a node is expected to behave.

So my questions :

    - How can I unlock my stuck ceph client, when this kind of error occurs?

    - Is this expected behavior that client looses access to cluster, which 
kind of kills the machine?

    - Where should I look in the ceph nodes logs to figure what is going wrong, 
and how to fix it, so that is run in a stable manner?


Regards,

--
S. Barthes

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to