[ceph-users] Re: Ceph crash service deployed with cephadm: unable to find a keyring

Robert Hish Wed, 16 Apr 2025 07:26:29 -0700

Are both ways not valid? unless i'm misinterpretting it? Whatever isuse, bare hostname or fqdn, it only matters that it matches the outputof `hostname`, no?


https://docs.ceph.com/en/reef/cephadm/host-management/#fully-qualified-domain-names-vs-bare-host-names

-Robert


On 4/16/25 16:14, Can Özyurt wrote:

Did you add the host in cephadm with the full name including domain? You
can check with ceph orch host ls.

On Wed, Apr 16, 2025, 5:04 PM Robert Hish <robert.h...@mpcdf.mpg.de> wrote:


I ran into this same puzzling behavior and never resolved it. I *think*
it is a benign bug, that can be ignored.

Here's what I found.

The crash service first attempts to ping the cluster to exercise the
key. "pinging cluster" is accomplished with a `ceph -s`.

# v18.2.4:src/ceph-crash.in
109 def main():
110     global auth_names
111
112     # run as unprivileged ceph user
113     drop_privs()
114
115     # exit code 0 on SIGINT, SIGTERM
116     signal.signal(signal.SIGINT, handler)
117     signal.signal(signal.SIGTERM, handler)
118
119     args = parse_args()
120     if args.log_level == 'DEBUG':
121         log.setLevel(logging.DEBUG)
122
123     postdir = os.path.join(args.path, 'posted')
124     if args.name:
125         auth_names = [args.name]
126
127     while not os.path.isdir(postdir):
128         log.error("directory %s does not exist; please create" %
postdir)
129         time.sleep(30)
130
131     log.info("pinging cluster to exercise our key")
132     pr = subprocess.Popen(args=['timeout', '30', 'ceph', '-s'])
133     pr.wait()

That part seems to be broken. Notice it drops privileges, to ceph:ceph.
Which the eventual intent of processing the crash report to
crash/posted. (global auth_names below). I think, what is happening, is
when the attempt to process a crash report fails, because of directory
ownership, it then tries to use the client.admin key to see if the
cluster is even up. That key cant be found, and so you get the errors.
(my best guess)

Note the global auth_names.

# v18.2.4:src/ceph-crash.in
   19 auth_names = ['client.crash.%s' % socket.gethostname(),
   20               'client.crash',
   21               'client.admin']
   22
   23
   24 def parse_args():
   25     parser = argparse.ArgumentParser()
   26     parser.add_argument(
   27         '-p', '--path', default='/var/lib/ceph/crash',
   28         help='base path to monitor for crash dumps')
   29     parser.add_argument(
   30         '-d', '--delay', default=10.0, type=float,
   31         help='minutes to delay between scans (0 to exit after one)',
   32     )
   33     parser.add_argument(
   34         '--name', '-n',
   35         help='ceph name to authenticate as '
   36              '(default: try client.crash, client.admin)')
   37     parser.add_argument(
   38         '--log-level', '-l',
   39         help='log level output (default: INFO), support INFO or
DEBUG')
   40
   41     return parser.parse_args()

I came across this when we had a soft crash of a ceph node which
resulted in an incomplete crash directory being created.

Once we removed that incomplete directory
/var/lib/ceph/<fsid>/crash/<incomplete crash directory here>

The errors went away. In our situation, there were other complete crash
directories which needed to be processed. But when it ran into the
incomplete directory it just looped the error youre seeing. Once we
deleted the incomplete crash directory the complete directories were
processed. Not sure if the same problem exists for you.

It looks like there was discussion about it in the past[1], and a
tracker report filed[2], but nothing follows.

You can also use TOP to see if there is a zombie process running as
USER: ceph, with PPID running ceph-crash. If so, I think you might be in
the situation below.


[1]

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/5JYWVOK3NFGXUOBNJFL6EED7YW32DXLY/

[2] https://tracker.ceph.com/issues/64102


There is a related problem (I believe fixed in v18.2.5), concerning
directory ownership for the crash directories. For whatever reason,
whenever a non-ceph daemon is deployed (e.g., alertmanager, prometheus,
et al.), cephadm is hard coded to first set the crash directory
ownership according to the daemon being deployed. So, if the
altertmanager daemon runs as nobody:nobody, then when cephadm deploys
alertmanager if first goes in and changes the crash directory ownership
to nobody:nobody.

For whatever reason, it does this always, whenever a non-ceph daemon is
being deployed (e.g., all the monitoring type daemons).

This can be very frustrating. But the simple fix is to just manually set
those directories back to ceph:ceph. And remove any incomplete crash
reports. If you don't, the crash service is effectively broken after
deploying any non-ceph daemons. To test, node-exporter is a good one,
because it touches every host.

Or, you could try just grafana or prometheus, to limit the work in
fixing it again.

# ceph-18.2.4/src/cephadm/cephadm.py
   2775 def make_data_dir_base(fsid, data_dir, uid, gid):
   2776     # type: (str, str, int, int) -> str
   2777     data_dir_base = os.path.join(data_dir, fsid)
   2778     makedirs(data_dir_base, uid, gid, DATA_DIR_MODE)
   2779     makedirs(os.path.join(data_dir_base, 'crash'), uid, gid,
DATA_DIR_MODE)
   2780     makedirs(os.path.join(data_dir_base, 'crash', 'posted'), uid,
gid,
   2781              DATA_DIR_MODE)
   2782     return data_dir_base

We confirmed this theory by hardcoding the uid, gid, to 167, 167
(ceph:ceph) in a modified cephadm, and then redeployed node-exporter.
The crash directories didnt get changed.

This was a slightly more elegant solution, which proved our theory, but
then for whatever reason our ganesha daemons wouldnt start.

   2775 def make_data_dir_base(fsid, data_dir, daemon_type, uid, gid):
   2776     # type: (str, str, int, int) -> str
   2777     data_dir_base = os.path.join(data_dir, fsid)
   2778     makedirs(data_dir_base, uid, gid, DATA_DIR_MODE)
   2779     if daemon_type not in Monitoring.components.keys():
   2780         makedirs(os.path.join(data_dir_base, 'crash'), uid, gid,
DATA_DIR_MODE)
   2781         makedirs(os.path.join(data_dir_base, 'crash', 'posted'),
uid, gid,
   2782              DATA_DIR_MODE)
   2783     return data_dir_base
   2784
   2785
   2786 def make_data_dir(ctx, fsid, daemon_type, daemon_id, uid=None,
gid=None):
   2787     # type: (CephadmContext, str, str, Union[int, str],
Optional[int], Optional[int]) -> str
   2788     if uid is None or gid is None:
   2789         uid, gid = extract_uid_gid(ctx)
   2790     make_data_dir_base(fsid, ctx.data_dir, daemon_type, uid, gid)
   2791     data_dir = get_data_dir(fsid, ctx.data_dir, daemon_type,
daemon_id)
   2792     makedirs(data_dir, uid, gid, DATA_DIR_MODE)
   2793     return data_dir

We gave up at that point, and noticed there might be a fix in the latest
v18.2.5 released a few days ago.

https://github.com/ceph/ceph/pull/58458

Oddly, it is listed as a "debian" fix, I hope it is just a general fix
for the crash directory ownership issue.

-Robert




On 4/13/25 12:16, Daniel Vogelbacher wrote:

Hello,

recently I checked my server logs for error events and got some hits for
the ceph crash service, deployed with cephadm.

When restarted, crash service logs to journald:

auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/
etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file or directory
AuthRegistry(0x7f21a0068da0) no keyring found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
ceph/keyring.bin, disabling cephx
auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/
etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
such file or directory
AuthRegistry(0x7f21a861bff0) no keyring found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
ceph/keyring.bin, disabling cephx
monclient(hunting): handle_auth_bad_method server allowed_methods [2]
but i only support [1]
monclient(hunting): handle_auth_bad_method server allowed_methods [2]
but i only support [1]
monclient: authenticate NOTE: no keyring found; disabled cephx
authentication
[errno 13] RADOS permission denied (error connecting to the cluster)

The image is mounted with:

-v /var/lib/ceph/7644057a-00f6-11f0-9a0c-eac00fed9338/crash.virt-
master3/keyring:/etc/ceph/ceph.client.crash.virt-master3.keyring

so I assume there should be a key available, but crash daemon searches
for a admin keyring instead of "client.crash.virt-master3".

Any advice how to fix this error? I see these error logs on all machines
where crash service is deployed. I've tried a redeployment without any
effect.

# ceph -v
ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid
(stable)

# systemctl restart ceph-7644057a-00f6-11f0-9a0c-
eac00fed9...@crash.virt-master3.service

# journalctl -u ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-
master3.service -ef

Output:

Apr 12 18:36:24 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1402272]: *** Interrupted with signal 15 ***
Apr 12 18:36:24 virt-master3 podman[1408122]: 2025-04-12
18:36:24.992684396 +0200 CEST m=+0.043782773 container died
6c538b6216744326c00d9b3989cfe8d498629fa2f008a62fe184f367b01cbf6b
(image=quay.io/ceph/
ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,

name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
org.label-schema.name=CentOS Stream 9 Base Image,
org.opencontainers.image.authors=Ceph Release Team <
ceph-maintain...@ceph.io>, org.label-schema.vendor=CentOS,
org.label-schema.license=GPLv2, CEPH_REF=squid,
org.label-schema.schema-version=1.0, CEPH_GIT_REPO=
https://github.com/ceph/ceph.git, io.buildah.version=1.33.7,
org.opencontainers.image.documentation=https://docs.ceph.com/,
CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
FROM_IMAGE=quay.io/centos/centos:stream9,
org.label-schema.build-date=20250124, OSD_FLAVOR=default)

Apr 12 18:36:25 virt-master3 podman[1408122]: 2025-04-12
18:36:25.016797163 +0200 CEST m=+0.067895534 container remove
6c538b6216744326c00d9b3989cfe8d498629fa2f008a62fe184f367b01cbf6b
(image=quay.io/ceph/
ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,

name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
org.opencontainers.image.documentation=https://docs.ceph.com/,
org.label-schema.build-date=20250124, FROM_IMAGE=
quay.io/centos/centos:stream9, GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
io.buildah.version=1.33.7, org.label-schema.name=CentOS Stream 9 Base
Image, org.label-schema.vendor=CentOS, org.label-schema.schema-version=1.0,
CEPH_GIT_REPO=https://github.com/ceph/ceph.git,
CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, OSD_FLAVOR=default,
org.label-schema.license=GPLv2, CEPH_REF=squid,
org.opencontainers.image.authors=Ceph Release Team <
ceph-maintain...@ceph.io>)

Apr 12 18:36:25 virt-master3 bash[1408104]:
ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3
Apr 12 18:36:25 virt-master3 systemd[1]: ceph-7644057a-00f6-11f0-9a0c-
eac00fed9...@crash.virt-master3.service: Deactivated successfully.
Apr 12 18:36:25 virt-master3 systemd[1]: Stopped
ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service -
Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338.
Apr 12 18:36:25 virt-master3 systemd[1]: ceph-7644057a-00f6-11f0-9a0c-
eac00fed9...@crash.virt-master3.service: Consumed 1.540s CPU time.
Apr 12 18:36:25 virt-master3 systemd[1]: Starting
ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service -
Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338...
Apr 12 18:36:25 virt-master3 podman[1408259]:
Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
18:36:25.498991532 +0200 CEST m=+0.071935136 container create
bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
(image=quay.io/ceph/
ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,

name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
org.label-schema.build-date=20250124, org.label-schema.name=CentOS Stream
9 Base Image, OSD_FLAVOR=default, org.label-schema.schema-version=1.0,
CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb,
io.buildah.version=1.33.7, CEPH_GIT_REPO=https://github.com/ceph/ceph.git,
org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2, FROM_IMAGE=
quay.io/centos/centos:stream9, org.opencontainers.image.documentation=
https://docs.ceph.com/, GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
CEPH_REF=squid, org.opencontainers.image.authors=Ceph Release Team <
ceph-maintain...@ceph.io>)

Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
18:36:25.462816456 +0200 CEST m=+0.035760071 image pull quay.io/ceph/
ceph@sha256

:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de

Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
18:36:25.589479015 +0200 CEST m=+0.162422619 container init
bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
(image=quay.io/ceph/
ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,

name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
org.label-schema.license=GPLv2, org.label-schema.schema-version=1.0,
CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, org.label-schema.name=CentOS
Stream 9 Base Image, OSD_FLAVOR=default,
org.label-schema.build-date=20250124, org.label-schema.vendor=CentOS,
org.opencontainers.image.authors=Ceph Release Team <
ceph-maintain...@ceph.io>, FROM_IMAGE=quay.io/centos/centos:stream9,
org.opencontainers.image.documentation=https://docs.ceph.com/,
GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
CEPH_GIT_REPO=https://github.com/ceph/ceph.git, CEPH_REF=squid,
io.buildah.version=1.33.7)

Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
18:36:25.595018205 +0200 CEST m=+0.167961840 container start
bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
(image=quay.io/ceph/
ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,

name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
org.label-schema.license=GPLv2, FROM_IMAGE=quay.io/centos/centos:stream9,
org.label-schema.build-date=20250124, io.buildah.version=1.33.7,
CEPH_GIT_REPO=https://github.com/ceph/ceph.git,
org.label-schema.vendor=CentOS,
CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, OSD_FLAVOR=default,
org.opencontainers.image.documentation=https://docs.ceph.com/,
CEPH_REF=squid, org.opencontainers.image.authors=Ceph Release Team <
ceph-maintain...@ceph.io>, GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
org.label-schema.name=CentOS Stream 9 Base Image,
org.label-schema.schema-version=1.0)

Apr 12 18:36:25 virt-master3 bash[1408259]:
bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
Apr 12 18:36:25 virt-master3 systemd[1]: Started
ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service -
Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338.
Apr 12 18:36:25 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: INFO:ceph-crash:pinging cluster to exercise
our key
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
-1 auth: unable to find a keyring on /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
ceph/keyring.bin: (2) No such file or directory
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
-1 AuthRegistry(0x7f21a0068da0) no keyring found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
ceph/keyring.bin, disabling cephx
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
-1 auth: unable to find a keyring on /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
ceph/keyring.bin: (2) No such file or directory
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
-1 AuthRegistry(0x7f21a861bff0) no keyring found at /etc/ceph/
ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
ceph/keyring.bin, disabling cephx
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a5b91640
-1 monclient(hunting): handle_auth_bad_method server allowed_methods [2]
but i only support [1]
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a6392640
-1 monclient(hunting): handle_auth_bad_method server allowed_methods [2]
but i only support [1]
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
-1 monclient: authenticate NOTE: no keyring found; disabled cephx
authentication
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: [errno 13] RADOS permission denied (error
connecting to the cluster)
Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
crash-virt-master3[1408301]: INFO:ceph-crash:monitoring path /var/lib/
ceph/crash, delay 600s

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph crash service deployed with cephadm: unable to find a keyring

Reply via email to