[ceph-users] Re: Ceph octopus version cluster not starting

Frank Schilder Tue, 17 Sep 2024 00:13:31 -0700

Hi Amudhan,

sounds like the dependency doesn't have a timeout. It would help if there was a 
(log) message by systemd every minute or so about a dependency pending (like on 
the boot screen). Not sure if this can be configured. Otherwise, you could add 
a timeout and make the units fail after 2-5 minutes. NTP shouldn't take much 
time to come up under normal circumstances.


I'm not a systemd wizard. If you do something like this, please post it here as 
a reply for others to find it.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Amudhan P <amudha...@gmail.com>
Sent: Tuesday, September 17, 2024 8:15 AM
To: Frank Schilder
Cc: Eugen Block; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Ceph octopus version cluster not starting

No there wasn't any error msg in systemd it was just silent even for an hour.

On Mon, Sep 16, 2024 at 10:02 PM Frank Schilder 
<fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote:
Hi Amudhan,

great that you figured that out. Does systemd not output an error in that 
case?? I would expect an error message. On our systems systemd is quite chatty 
when a unit fails.

You probably still need to figure out why your new OSD took everything down 
over time. Maybe create a new case if this happens again.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Amudhan P <amudha...@gmail.com<mailto:amudha...@gmail.com>>
Sent: Monday, September 16, 2024 6:19 PM
To: Frank Schilder
Cc: Eugen Block; ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: Re: [ceph-users] Re: Ceph octopus version cluster not starting

Thanks Frank.

Figured out the issue was NTP, nodes were not able to reach NTP server which 
caused NTP service to fail.

It looks like Ceph systemd service has dependency for NTP service status.

On Mon, Sep 16, 2024 at 4:12 PM Frank Schilder 
<fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> 
wrote:
I think this output is normal and I guess the MON is up? If so, I would start 
another mon in the same way on another host. If the monmap is correct with 
network etc. they should start talking to each other. If you have 3 mons in the 
cluster, you should get quorum.

On the host where the mon is running, you can also ask for the cluster status 
via the mon-admin socket. You should get a response that includes "out of 
quorum" or the like. Once you have the second mon up, you can start checking 
that they form quorum.

If this works, then I would conclude that your cluster is probably OK on disk 
and the issue is somewhere with systemd.

You shouldn't run too much manual. I usually use this to confirm that the 
daemon can start and its data store on disk is healthy. After that, I start 
looking for what prevents startup. In your case it doesn't seem to be 
ceph-daemons crashing and that's what this check mainly is for. You could maybe 
try one mgr and then one OSD. If these come up and join the cluster, its 
something outside ceph.

For your systemd debugging, add at least the option "-f" to the daemon's 
command lines to force traditional log files to be written.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Amudhan P 
<amudha...@gmail.com<mailto:amudha...@gmail.com><mailto:amudha...@gmail.com<mailto:amudha...@gmail.com>>>
Sent: Monday, September 16, 2024 12:18 PM
To: Frank Schilder
Cc: Eugen Block; 
ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>
Subject: Re: [ceph-users] Re: Ceph octopus version cluster not starting

Frank,

with Manual command I was able to start mon and able to see logs in log file 
and I don't find any issue in logs except below lines.
Should I stop manual command and try to start mon service from systemd or 
follow the same approach in all mon nodes?

2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: [db/version_set.cc:3757] 
Recovered from manifest file:/var/lib/ceph/mon/node/store.db/MANIFEST-4328236 s
ucceeded,manifest_file_number is 4328236, next_file_number is 4328238, 
last_sequence is 1782572963, log_number is 4328223,prev_log_number is 
0,max_column_family is 0,mi
n_log_number_to_keep is 0

2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: [db/version_set.cc:3766] 
Column family [default] (ID 0), log number is 4328223

2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1726481214623513, "job": 1, "event": "recovery_started", 
"log_files": [4328237]}
2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: [db/db_impl_open.cc:583] 
Recovering log #4328237 mode 2
2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: [db/version_set.cc:3036] 
Creating manifest 4328239

2024-09-16T15:36:54.620+0530 7f5783d1e5c0  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1726481214625473, "job": 1, "event": "recovery_finished"}
2024-09-16T15:36:54.628+0530 7f5783d1e5c0  4 rocksdb: DB pointer 0x561bb7e90000



On Mon, Sep 16, 2024 at 2:22 PM Frank Schilder 
<fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>>>
 wrote:
Hi. When I have issues like this, what sometimes helps is to start a daemon 
manually (not systemctl or anything like that). Make sure no ceph-mon is 
running on the host:

ps -eo cmd | grep ceph-mon

and start a ceph-mon manually with a command like this (make sure the binary is 
the correct version):

/usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i 
MON-NAME --mon-data /var/lib/ceph/mon/STORE --public-addr MON-IP

Depending on your debug settings, this command does output a bit on startup. If 
your settings in ceph.conf are 0/0, I think you can override this on the 
command line. It might be useful to set the option "-d" (debug mode with "log 
to stderr") on the command line as well. With defaults it will talk at least 
about opening the store and then just wait or complain that there are no peers.

This is a good sign.

If you got one MON running, start another one on another host and so on until 
you have enough up for quorum. Then you can start querying the MONs what their 
problem is.

If none of this works, the output of the manual command maybe with higher debug 
settings on the command line should be helpful.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Amudhan P 
<amudha...@gmail.com<mailto:amudha...@gmail.com><mailto:amudha...@gmail.com<mailto:amudha...@gmail.com>><mailto:amudha...@gmail.com<mailto:amudha...@gmail.com><mailto:amudha...@gmail.com<mailto:amudha...@gmail.com>>>>
Sent: Monday, September 16, 2024 10:36 AM
To: Eugen Block
Cc: 
ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>>
Subject: [ceph-users] Re: Ceph octopus version cluster not starting

No, I don't use cephadm and I have enough space for a log storage.

When I try to start mon service in any of the node it just keeps waiting to
complete without any error msg in stdout or in log file.

On Mon, Sep 16, 2024 at 1:21 PM Eugen Block 
<ebl...@nde.ag<mailto:ebl...@nde.ag><mailto:ebl...@nde.ag<mailto:ebl...@nde.ag>><mailto:ebl...@nde.ag<mailto:ebl...@nde.ag><mailto:ebl...@nde.ag<mailto:ebl...@nde.ag>>>>
 wrote:

> Hi,
>
> I would focus on the MONs first. If they don't start, your cluster is
> not usable. It doesn't look like you use cephadm, but please confirm.
> Check if the nodes are running out of disk space, maybe that's why
> they don't log anything and fail to start.
>
>
> Zitat von Amudhan P 
> <amudha...@gmail.com<mailto:amudha...@gmail.com><mailto:amudha...@gmail.com<mailto:amudha...@gmail.com>><mailto:amudha...@gmail.com<mailto:amudha...@gmail.com><mailto:amudha...@gmail.com<mailto:amudha...@gmail.com>>>>:
>
> > Hi,
> >
> > Recently added one disk in Ceph cluster using "ceph-volume lvm create
> > --data /dev/sdX" but the new OSD didn't start. After some rest of the
> other
> > nodes OSD service also stopped. So, I restarted all nodes in the cluster
> > now after restart.
> > MON, MDS, MGR  and OSD services are not starting. Could find any new logs
> > also after restart it is totally silent in all nodes.
> > Could find some logs in Ceph-volume service.
> >
> >
> > Error in Ceph-volume logs :-
> > [2024-09-15 23:38:15,080][ceph_volume.process][INFO  ] stderr Running
> > command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-5
> > --> Executable selinuxenabled not in PATH:
> > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-5
> > Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
> prime-osd-dir
> > --dev
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e
> > --path /var/lib/ceph/osd/ceph-5 --no-mon-config
> >  stderr: failed to read label for
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> > (2) No such file or directory
> > 2024-09-15T23:38:15.059+0530 7fe7767c8100 -1
> >
> bluestore(/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e)
> > _read_bdev_label failed to open
> >
> /dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> > (2) No such file or directory
> > -->  RuntimeError: command returned non-zero exit status: 1
> > [2024-09-15 23:38:15,084][ceph_volume.process][INFO  ] stderr Running
> > command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
> > --> Executable selinuxenabled not in PATH:
> > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> > Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> > Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
> prime-osd-dir
> > --dev
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988
> > --path /var/lib/ceph/osd/ceph-2 --no-mon-config
> >  stderr: failed to read label for
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988:
> > (2) No such file or directory
> >
> > But I could find "
> >
> /dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988"
> > the path valid and listing folder.
> >
> > Not sure how to proceed or where to start any idea or suggestion ?
> > _______________________________________________
> > ceph-users mailing list -- 
> > ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>>
> > To unsubscribe send an email to 
> > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>>>
>
>
> _______________________________________________
> ceph-users mailing list -- 
> ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>>
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>>>
>
_______________________________________________
ceph-users mailing list -- 
ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph octopus version cluster not starting

Reply via email to