[ceph-users] Re: Newby woes with ceph

Michel Jouvin Mon, 21 Jul 2025 01:07:32 -0700

Stéphane,

On ceph-02, I am not sure why the ceph command is not installed as onthe other nodes, if you installed it the same way. One way to get accessto the ceph command on this server should be to execute:


cephadm shell

This will start a container where you have the ceph environmentinstalled and configured for your cluster.

The situation is not as bad as I thought reading your first message. Youhave the mon quorum so at least ceph command should be usable. The firstthing to do is probably to log on your ceph-01 node and try tounderstand why the mon daemon is crashing. You may want to run on this node:


cephadm ls  ---> Look for the exact daemon name corresponding to the mon

cephadm logs --daemon $daemon_name

Apart from this, it is strange that ceph-03 report a RADOS error with'ceph log last...', this probably hides another issue. Could you tellwhat the same command says on ceph-02 (when run in cephadm shell)?


Michel

Le 21/07/2025 à 09:44, Stéphane Barthes a écrit :

Michel,


I ran "ceph log last debug cephadm" on my 3 nodes, and "mileage varies"

ceph-01 :

some errors, and it ends with
2025-07-20T03:24:18.887889+0000 mgr.srvr-ceph-03.dhzbpe (mgr.134360)1368 : cephadm [INF] Deploying daemon mon.srvr-ceph-03 on srvr-ceph-03
when I had to remove the mon daemon and redeploy on ceph-03.

ceph-02 :

root@srvr-ceph-02:~# ceph log last debug cephadm
Command 'ceph' not found, but can be installed with:
snap install microceph    # version 18.2.4+snapc9f2b08f92, or
apt  install ceph-common  # version 17.2.7-0ubuntu0.22.04.2
See 'snap info microceph' for additional versions.

??? should I install ceph-common ???

ceph-03 :

root@srvr-ceph-03:~# ceph log last debug cephadm
Error initializing cluster client: ObjectNotFound('RADOS object notfound (error calling conf_read_file)')
root@srvr-ceph-03:~#

FWIW : ceph health is :

root@srvr-ceph-01:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorumsrvr-ceph-03,srvr-ceph-02; 10 daemons have recently crashed
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon mon.srvr-ceph-01 on srvr-ceph-01 is in error state
[WRN] MON_DOWN: 1/3 mons down, quorum srvr-ceph-03,srvr-ceph-02
mon.srvr-ceph-01 (rank 0) addr[v2:10.32.100.22:3300/0,v1:10.32.100.22:6789/0] is down (out of quorum)
[WRN] RECENT_CRASH: 10 daemons have recently crashed
mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:50:10.202091Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:47.712267Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:50:21.464475Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:36.609442Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:49:58.966663Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:36.947240Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:52:21.751711Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:48.490875Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:51:59.651129Z mon.srvr-ceph-01 crashed on host srvr-ceph-01 at2025-07-20T17:52:10.552756Z
S. Barthes
Le 21/07/2025 à 09:31, Michel Jouvin a écrit :
Stephane,
If you are using cephadm, the OS (distrib and version) you use shouldnot matter. When using cephadm with several servers (the generalcase!), it is important to setup properly the SSH key used by cephadmfor the communication between nodes (cephadm is sort of a SSH-basedmanagement cluster) and to check that you can log in from one node tothe other using SSH. Can you confirm that it is the case?
Also cephadm has a specific log file. I don't use much the dashboard,not sure how you display it (it may be part of the logs displayed bythe dashboard) but you can access it with the command:
ceph log last debug cephadm

Michel

Le 21/07/2025 à 09:19, Stéphane Barthes a écrit :
Hi,
Yes, I did use cephadm, to bootstrap the 1st node in the cluster,installed cephadm on the other nodes, and used the dashboard to addthe nodes to the cluster.
Regards,

S. Barthes

Le 21/07/2025 à 09:12, Michel Jouvin a écrit :
Hi Stephane,
How did you configure your cluster? Have you been using cephadm? Ifnot, I really advise you to recreate your cluster with cephadm,that includes a script to bootstrap the cluster. In particular ifyou don't have a detail knowledge about Ceph architecture andmanagement, it will ensure that your cluster is properly configuredand let you progressively learn about Ceph details...
Best regards.

Michel

Le 21/07/2025 à 09:02, Stéphane Barthes a écrit :
Hello,
I am very new to ceph and have started a small cluster to getstarted with ceph.
But so far my experience, is not very impressive, probably by lackof knowledge and good practices.
I started with Ubuntu 24, installed 3 VM for a ceph cluster, andsome how could not get it running. Adding nodes would fail addingOSDs with some weird error(I found it on the web but could notsolve the problem).
I then made a new cluster with 3 ubuntu 22 VM. Install ok, startok, I created 1 pool to test storing stuff there and work my wayacross crash testing. However the cluster dies during the weeklyvm snapshot. It may not a good idea to run vm backups on a cephhost, but I find this a little surprising. (crash testing startedearlier than expected)
Bottom line is that, after the backup the cluster is in warningstate with missing mons, or logrotate and sometimes crashedmachines. systemctl restart service or Rebooting node usuallyfixes it.
I am now stuck in a situation I cannot fix :
- 1 Machine is ceph rbd client cannot auth : auth method 'x'error -13. I have tried quite a few things, and none unlocked thesituation. I am currently trying to reboot the machine, but thebusy/stuck rbd device seems to block it. I am not looking forwardto hard reset it.
- Node with the mgr service will not restart mon, orlogrotate. I did reboot it again today, but I guess this is nothow a node is expected to behave.
So my questions :
- How can I unlock my stuck ceph client, when this kind oferror occurs?
- Is this expected behavior that client looses access tocluster, which kind of kills the machine?
- Where should I look in the ceph nodes logs to figure what isgoing wrong, and how to fix it, so that is run in a stable manner?
Regards,

--
S. Barthes

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Newby woes with ceph

Reply via email to