[ceph-users] Re: Ceph crash service deployed with cephadm: unable to find a keyring

Can Özyurt Wed, 16 Apr 2025 07:16:08 -0700

Did you add the host in cephadm with the full name including domain? You
can check with ceph orch host ls.


On Wed, Apr 16, 2025, 5:04 PM Robert Hish <robert.h...@mpcdf.mpg.de> wrote:

>
> I ran into this same puzzling behavior and never resolved it. I *think*
> it is a benign bug, that can be ignored.
>
> Here's what I found.
>
> The crash service first attempts to ping the cluster to exercise the
> key. "pinging cluster" is accomplished with a `ceph -s`.
>
> # v18.2.4:src/ceph-crash.in
> 109 def main():
> 110     global auth_names
> 111
> 112     # run as unprivileged ceph user
> 113     drop_privs()
> 114
> 115     # exit code 0 on SIGINT, SIGTERM
> 116     signal.signal(signal.SIGINT, handler)
> 117     signal.signal(signal.SIGTERM, handler)
> 118
> 119     args = parse_args()
> 120     if args.log_level == 'DEBUG':
> 121         log.setLevel(logging.DEBUG)
> 122
> 123     postdir = os.path.join(args.path, 'posted')
> 124     if args.name:
> 125         auth_names = [args.name]
> 126
> 127     while not os.path.isdir(postdir):
> 128         log.error("directory %s does not exist; please create" %
> postdir)
> 129         time.sleep(30)
> 130
> 131     log.info("pinging cluster to exercise our key")
> 132     pr = subprocess.Popen(args=['timeout', '30', 'ceph', '-s'])
> 133     pr.wait()
>
> That part seems to be broken. Notice it drops privileges, to ceph:ceph.
> Which the eventual intent of processing the crash report to
> crash/posted. (global auth_names below). I think, what is happening, is
> when the attempt to process a crash report fails, because of directory
> ownership, it then tries to use the client.admin key to see if the
> cluster is even up. That key cant be found, and so you get the errors.
> (my best guess)
>
> Note the global auth_names.
>
> # v18.2.4:src/ceph-crash.in
>   19 auth_names = ['client.crash.%s' % socket.gethostname(),
>   20               'client.crash',
>   21               'client.admin']
>   22
>   23
>   24 def parse_args():
>   25     parser = argparse.ArgumentParser()
>   26     parser.add_argument(
>   27         '-p', '--path', default='/var/lib/ceph/crash',
>   28         help='base path to monitor for crash dumps')
>   29     parser.add_argument(
>   30         '-d', '--delay', default=10.0, type=float,
>   31         help='minutes to delay between scans (0 to exit after one)',
>   32     )
>   33     parser.add_argument(
>   34         '--name', '-n',
>   35         help='ceph name to authenticate as '
>   36              '(default: try client.crash, client.admin)')
>   37     parser.add_argument(
>   38         '--log-level', '-l',
>   39         help='log level output (default: INFO), support INFO or
> DEBUG')
>   40
>   41     return parser.parse_args()
>
> I came across this when we had a soft crash of a ceph node which
> resulted in an incomplete crash directory being created.
>
> Once we removed that incomplete directory
> /var/lib/ceph/<fsid>/crash/<incomplete crash directory here>
>
> The errors went away. In our situation, there were other complete crash
> directories which needed to be processed. But when it ran into the
> incomplete directory it just looped the error youre seeing. Once we
> deleted the incomplete crash directory the complete directories were
> processed. Not sure if the same problem exists for you.
>
> It looks like there was discussion about it in the past[1], and a
> tracker report filed[2], but nothing follows.
>
> You can also use TOP to see if there is a zombie process running as
> USER: ceph, with PPID running ceph-crash. If so, I think you might be in
> the situation below.
>
>
> [1]
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/5JYWVOK3NFGXUOBNJFL6EED7YW32DXLY/
>
> [2] https://tracker.ceph.com/issues/64102
>
>
> There is a related problem (I believe fixed in v18.2.5), concerning
> directory ownership for the crash directories. For whatever reason,
> whenever a non-ceph daemon is deployed (e.g., alertmanager, prometheus,
> et al.), cephadm is hard coded to first set the crash directory
> ownership according to the daemon being deployed. So, if the
> altertmanager daemon runs as nobody:nobody, then when cephadm deploys
> alertmanager if first goes in and changes the crash directory ownership
> to nobody:nobody.
>
> For whatever reason, it does this always, whenever a non-ceph daemon is
> being deployed (e.g., all the monitoring type daemons).
>
> This can be very frustrating. But the simple fix is to just manually set
> those directories back to ceph:ceph. And remove any incomplete crash
> reports. If you don't, the crash service is effectively broken after
> deploying any non-ceph daemons. To test, node-exporter is a good one,
> because it touches every host.
>
> Or, you could try just grafana or prometheus, to limit the work in
> fixing it again.
>
> # ceph-18.2.4/src/cephadm/cephadm.py
>   2775 def make_data_dir_base(fsid, data_dir, uid, gid):
>   2776     # type: (str, str, int, int) -> str
>   2777     data_dir_base = os.path.join(data_dir, fsid)
>   2778     makedirs(data_dir_base, uid, gid, DATA_DIR_MODE)
>   2779     makedirs(os.path.join(data_dir_base, 'crash'), uid, gid,
> DATA_DIR_MODE)
>   2780     makedirs(os.path.join(data_dir_base, 'crash', 'posted'), uid,
> gid,
>   2781              DATA_DIR_MODE)
>   2782     return data_dir_base
>
> We confirmed this theory by hardcoding the uid, gid, to 167, 167
> (ceph:ceph) in a modified cephadm, and then redeployed node-exporter.
> The crash directories didnt get changed.
>
> This was a slightly more elegant solution, which proved our theory, but
> then for whatever reason our ganesha daemons wouldnt start.
>
>   2775 def make_data_dir_base(fsid, data_dir, daemon_type, uid, gid):
>   2776     # type: (str, str, int, int) -> str
>   2777     data_dir_base = os.path.join(data_dir, fsid)
>   2778     makedirs(data_dir_base, uid, gid, DATA_DIR_MODE)
>   2779     if daemon_type not in Monitoring.components.keys():
>   2780         makedirs(os.path.join(data_dir_base, 'crash'), uid, gid,
> DATA_DIR_MODE)
>   2781         makedirs(os.path.join(data_dir_base, 'crash', 'posted'),
> uid, gid,
>   2782              DATA_DIR_MODE)
>   2783     return data_dir_base
>   2784
>   2785
>   2786 def make_data_dir(ctx, fsid, daemon_type, daemon_id, uid=None,
> gid=None):
>   2787     # type: (CephadmContext, str, str, Union[int, str],
> Optional[int], Optional[int]) -> str
>   2788     if uid is None or gid is None:
>   2789         uid, gid = extract_uid_gid(ctx)
>   2790     make_data_dir_base(fsid, ctx.data_dir, daemon_type, uid, gid)
>   2791     data_dir = get_data_dir(fsid, ctx.data_dir, daemon_type,
> daemon_id)
>   2792     makedirs(data_dir, uid, gid, DATA_DIR_MODE)
>   2793     return data_dir
>
> We gave up at that point, and noticed there might be a fix in the latest
> v18.2.5 released a few days ago.
>
> https://github.com/ceph/ceph/pull/58458
>
> Oddly, it is listed as a "debian" fix, I hope it is just a general fix
> for the crash directory ownership issue.
>
> -Robert
>
>
>
>
> On 4/13/25 12:16, Daniel Vogelbacher wrote:
> > Hello,
> >
> > recently I checked my server logs for error events and got some hits for
> > the ceph crash service, deployed with cephadm.
> >
> > When restarted, crash service logs to journald:
> >
> > auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/
> > etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
> > such file or directory
> > AuthRegistry(0x7f21a0068da0) no keyring found at /etc/ceph/
> > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
> > ceph/keyring.bin, disabling cephx
> > auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/
> > etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No
> > such file or directory
> > AuthRegistry(0x7f21a861bff0) no keyring found at /etc/ceph/
> > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
> > ceph/keyring.bin, disabling cephx
> > monclient(hunting): handle_auth_bad_method server allowed_methods [2]
> > but i only support [1]
> > monclient(hunting): handle_auth_bad_method server allowed_methods [2]
> > but i only support [1]
> > monclient: authenticate NOTE: no keyring found; disabled cephx
> > authentication
> > [errno 13] RADOS permission denied (error connecting to the cluster)
> >
> > The image is mounted with:
> >
> > -v /var/lib/ceph/7644057a-00f6-11f0-9a0c-eac00fed9338/crash.virt-
> > master3/keyring:/etc/ceph/ceph.client.crash.virt-master3.keyring
> >
> > so I assume there should be a key available, but crash daemon searches
> > for a admin keyring instead of "client.crash.virt-master3".
> >
> > Any advice how to fix this error? I see these error logs on all machines
> > where crash service is deployed. I've tried a redeployment without any
> > effect.
> >
> > # ceph -v
> > ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid
> > (stable)
> >
> > # systemctl restart ceph-7644057a-00f6-11f0-9a0c-
> > eac00fed9...@crash.virt-master3.service
> >
> > # journalctl -u ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-
> > master3.service -ef
> >
> > Output:
> >
> > Apr 12 18:36:24 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1402272]: *** Interrupted with signal 15 ***
> > Apr 12 18:36:24 virt-master3 podman[1408122]: 2025-04-12
> > 18:36:24.992684396 +0200 CEST m=+0.043782773 container died
> > 6c538b6216744326c00d9b3989cfe8d498629fa2f008a62fe184f367b01cbf6b
> > (image=quay.io/ceph/
> > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,
> name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
> org.label-schema.name=CentOS Stream 9 Base Image,
> org.opencontainers.image.authors=Ceph Release Team <
> ceph-maintain...@ceph.io>, org.label-schema.vendor=CentOS,
> org.label-schema.license=GPLv2, CEPH_REF=squid,
> org.label-schema.schema-version=1.0, CEPH_GIT_REPO=
> https://github.com/ceph/ceph.git, io.buildah.version=1.33.7,
> org.opencontainers.image.documentation=https://docs.ceph.com/,
> CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> FROM_IMAGE=quay.io/centos/centos:stream9,
> org.label-schema.build-date=20250124, OSD_FLAVOR=default)
> > Apr 12 18:36:25 virt-master3 podman[1408122]: 2025-04-12
> > 18:36:25.016797163 +0200 CEST m=+0.067895534 container remove
> > 6c538b6216744326c00d9b3989cfe8d498629fa2f008a62fe184f367b01cbf6b
> > (image=quay.io/ceph/
> > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,
> name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
> org.opencontainers.image.documentation=https://docs.ceph.com/,
> org.label-schema.build-date=20250124, FROM_IMAGE=
> quay.io/centos/centos:stream9, GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> io.buildah.version=1.33.7, org.label-schema.name=CentOS Stream 9 Base
> Image, org.label-schema.vendor=CentOS, org.label-schema.schema-version=1.0,
> CEPH_GIT_REPO=https://github.com/ceph/ceph.git,
> CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, OSD_FLAVOR=default,
> org.label-schema.license=GPLv2, CEPH_REF=squid,
> org.opencontainers.image.authors=Ceph Release Team <
> ceph-maintain...@ceph.io>)
> > Apr 12 18:36:25 virt-master3 bash[1408104]:
> > ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3
> > Apr 12 18:36:25 virt-master3 systemd[1]: ceph-7644057a-00f6-11f0-9a0c-
> > eac00fed9...@crash.virt-master3.service: Deactivated successfully.
> > Apr 12 18:36:25 virt-master3 systemd[1]: Stopped
> > ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service -
> > Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338.
> > Apr 12 18:36:25 virt-master3 systemd[1]: ceph-7644057a-00f6-11f0-9a0c-
> > eac00fed9...@crash.virt-master3.service: Consumed 1.540s CPU time.
> > Apr 12 18:36:25 virt-master3 systemd[1]: Starting
> > ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service -
> > Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338...
> > Apr 12 18:36:25 virt-master3 podman[1408259]:
> > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
> > 18:36:25.498991532 +0200 CEST m=+0.071935136 container create
> > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
> > (image=quay.io/ceph/
> > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,
> name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
> org.label-schema.build-date=20250124, org.label-schema.name=CentOS Stream
> 9 Base Image, OSD_FLAVOR=default, org.label-schema.schema-version=1.0,
> CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb,
> io.buildah.version=1.33.7, CEPH_GIT_REPO=https://github.com/ceph/ceph.git,
> org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2, FROM_IMAGE=
> quay.io/centos/centos:stream9, org.opencontainers.image.documentation=
> https://docs.ceph.com/, GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> CEPH_REF=squid, org.opencontainers.image.authors=Ceph Release Team <
> ceph-maintain...@ceph.io>)
> > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
> > 18:36:25.462816456 +0200 CEST m=+0.035760071 image pull quay.io/ceph/
> > ceph@sha256
> :41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de
> > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
> > 18:36:25.589479015 +0200 CEST m=+0.162422619 container init
> > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
> > (image=quay.io/ceph/
> > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,
> name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
> org.label-schema.license=GPLv2, org.label-schema.schema-version=1.0,
> CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, 
> org.label-schema.name=CentOS
> Stream 9 Base Image, OSD_FLAVOR=default,
> org.label-schema.build-date=20250124, org.label-schema.vendor=CentOS,
> org.opencontainers.image.authors=Ceph Release Team <
> ceph-maintain...@ceph.io>, FROM_IMAGE=quay.io/centos/centos:stream9,
> org.opencontainers.image.documentation=https://docs.ceph.com/,
> GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> CEPH_GIT_REPO=https://github.com/ceph/ceph.git, CEPH_REF=squid,
> io.buildah.version=1.33.7)
> > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12
> > 18:36:25.595018205 +0200 CEST m=+0.167961840 container start
> > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
> > (image=quay.io/ceph/
> > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de,
> name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3,
> org.label-schema.license=GPLv2, FROM_IMAGE=quay.io/centos/centos:stream9,
> org.label-schema.build-date=20250124, io.buildah.version=1.33.7,
> CEPH_GIT_REPO=https://github.com/ceph/ceph.git,
> org.label-schema.vendor=CentOS,
> CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, OSD_FLAVOR=default,
> org.opencontainers.image.documentation=https://docs.ceph.com/,
> CEPH_REF=squid, org.opencontainers.image.authors=Ceph Release Team <
> ceph-maintain...@ceph.io>, GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> org.label-schema.name=CentOS Stream 9 Base Image,
> org.label-schema.schema-version=1.0)
> > Apr 12 18:36:25 virt-master3 bash[1408259]:
> > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63
> > Apr 12 18:36:25 virt-master3 systemd[1]: Started
> > ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service -
> > Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338.
> > Apr 12 18:36:25 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: INFO:ceph-crash:pinging cluster to exercise
> > our key
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
> > -1 auth: unable to find a keyring on /etc/ceph/
> > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
> > ceph/keyring.bin: (2) No such file or directory
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
> > -1 AuthRegistry(0x7f21a0068da0) no keyring found at /etc/ceph/
> > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
> > ceph/keyring.bin, disabling cephx
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
> > -1 auth: unable to find a keyring on /etc/ceph/
> > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
> > ceph/keyring.bin: (2) No such file or directory
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
> > -1 AuthRegistry(0x7f21a861bff0) no keyring found at /etc/ceph/
> > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/
> > ceph/keyring.bin, disabling cephx
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a5b91640
> > -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2]
> > but i only support [1]
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a6392640
> > -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2]
> > but i only support [1]
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640
> > -1 monclient: authenticate NOTE: no keyring found; disabled cephx
> > authentication
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: [errno 13] RADOS permission denied (error
> > connecting to the cluster)
> > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-
> > crash-virt-master3[1408301]: INFO:ceph-crash:monitoring path /var/lib/
> > ceph/crash, delay 600s
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph crash service deployed with cephadm: unable to find a keyring

Reply via email to