Did you add the host in cephadm with the full name including domain? You can check with ceph orch host ls.
On Wed, Apr 16, 2025, 5:04 PM Robert Hish <robert.h...@mpcdf.mpg.de> wrote: > > I ran into this same puzzling behavior and never resolved it. I *think* > it is a benign bug, that can be ignored. > > Here's what I found. > > The crash service first attempts to ping the cluster to exercise the > key. "pinging cluster" is accomplished with a `ceph -s`. > > # v18.2.4:src/ceph-crash.in > 109 def main(): > 110 global auth_names > 111 > 112 # run as unprivileged ceph user > 113 drop_privs() > 114 > 115 # exit code 0 on SIGINT, SIGTERM > 116 signal.signal(signal.SIGINT, handler) > 117 signal.signal(signal.SIGTERM, handler) > 118 > 119 args = parse_args() > 120 if args.log_level == 'DEBUG': > 121 log.setLevel(logging.DEBUG) > 122 > 123 postdir = os.path.join(args.path, 'posted') > 124 if args.name: > 125 auth_names = [args.name] > 126 > 127 while not os.path.isdir(postdir): > 128 log.error("directory %s does not exist; please create" % > postdir) > 129 time.sleep(30) > 130 > 131 log.info("pinging cluster to exercise our key") > 132 pr = subprocess.Popen(args=['timeout', '30', 'ceph', '-s']) > 133 pr.wait() > > That part seems to be broken. Notice it drops privileges, to ceph:ceph. > Which the eventual intent of processing the crash report to > crash/posted. (global auth_names below). I think, what is happening, is > when the attempt to process a crash report fails, because of directory > ownership, it then tries to use the client.admin key to see if the > cluster is even up. That key cant be found, and so you get the errors. > (my best guess) > > Note the global auth_names. > > # v18.2.4:src/ceph-crash.in > 19 auth_names = ['client.crash.%s' % socket.gethostname(), > 20 'client.crash', > 21 'client.admin'] > 22 > 23 > 24 def parse_args(): > 25 parser = argparse.ArgumentParser() > 26 parser.add_argument( > 27 '-p', '--path', default='/var/lib/ceph/crash', > 28 help='base path to monitor for crash dumps') > 29 parser.add_argument( > 30 '-d', '--delay', default=10.0, type=float, > 31 help='minutes to delay between scans (0 to exit after one)', > 32 ) > 33 parser.add_argument( > 34 '--name', '-n', > 35 help='ceph name to authenticate as ' > 36 '(default: try client.crash, client.admin)') > 37 parser.add_argument( > 38 '--log-level', '-l', > 39 help='log level output (default: INFO), support INFO or > DEBUG') > 40 > 41 return parser.parse_args() > > I came across this when we had a soft crash of a ceph node which > resulted in an incomplete crash directory being created. > > Once we removed that incomplete directory > /var/lib/ceph/<fsid>/crash/<incomplete crash directory here> > > The errors went away. In our situation, there were other complete crash > directories which needed to be processed. But when it ran into the > incomplete directory it just looped the error youre seeing. Once we > deleted the incomplete crash directory the complete directories were > processed. Not sure if the same problem exists for you. > > It looks like there was discussion about it in the past[1], and a > tracker report filed[2], but nothing follows. > > You can also use TOP to see if there is a zombie process running as > USER: ceph, with PPID running ceph-crash. If so, I think you might be in > the situation below. > > > [1] > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/5JYWVOK3NFGXUOBNJFL6EED7YW32DXLY/ > > [2] https://tracker.ceph.com/issues/64102 > > > There is a related problem (I believe fixed in v18.2.5), concerning > directory ownership for the crash directories. For whatever reason, > whenever a non-ceph daemon is deployed (e.g., alertmanager, prometheus, > et al.), cephadm is hard coded to first set the crash directory > ownership according to the daemon being deployed. So, if the > altertmanager daemon runs as nobody:nobody, then when cephadm deploys > alertmanager if first goes in and changes the crash directory ownership > to nobody:nobody. > > For whatever reason, it does this always, whenever a non-ceph daemon is > being deployed (e.g., all the monitoring type daemons). > > This can be very frustrating. But the simple fix is to just manually set > those directories back to ceph:ceph. And remove any incomplete crash > reports. If you don't, the crash service is effectively broken after > deploying any non-ceph daemons. To test, node-exporter is a good one, > because it touches every host. > > Or, you could try just grafana or prometheus, to limit the work in > fixing it again. > > # ceph-18.2.4/src/cephadm/cephadm.py > 2775 def make_data_dir_base(fsid, data_dir, uid, gid): > 2776 # type: (str, str, int, int) -> str > 2777 data_dir_base = os.path.join(data_dir, fsid) > 2778 makedirs(data_dir_base, uid, gid, DATA_DIR_MODE) > 2779 makedirs(os.path.join(data_dir_base, 'crash'), uid, gid, > DATA_DIR_MODE) > 2780 makedirs(os.path.join(data_dir_base, 'crash', 'posted'), uid, > gid, > 2781 DATA_DIR_MODE) > 2782 return data_dir_base > > We confirmed this theory by hardcoding the uid, gid, to 167, 167 > (ceph:ceph) in a modified cephadm, and then redeployed node-exporter. > The crash directories didnt get changed. > > This was a slightly more elegant solution, which proved our theory, but > then for whatever reason our ganesha daemons wouldnt start. > > 2775 def make_data_dir_base(fsid, data_dir, daemon_type, uid, gid): > 2776 # type: (str, str, int, int) -> str > 2777 data_dir_base = os.path.join(data_dir, fsid) > 2778 makedirs(data_dir_base, uid, gid, DATA_DIR_MODE) > 2779 if daemon_type not in Monitoring.components.keys(): > 2780 makedirs(os.path.join(data_dir_base, 'crash'), uid, gid, > DATA_DIR_MODE) > 2781 makedirs(os.path.join(data_dir_base, 'crash', 'posted'), > uid, gid, > 2782 DATA_DIR_MODE) > 2783 return data_dir_base > 2784 > 2785 > 2786 def make_data_dir(ctx, fsid, daemon_type, daemon_id, uid=None, > gid=None): > 2787 # type: (CephadmContext, str, str, Union[int, str], > Optional[int], Optional[int]) -> str > 2788 if uid is None or gid is None: > 2789 uid, gid = extract_uid_gid(ctx) > 2790 make_data_dir_base(fsid, ctx.data_dir, daemon_type, uid, gid) > 2791 data_dir = get_data_dir(fsid, ctx.data_dir, daemon_type, > daemon_id) > 2792 makedirs(data_dir, uid, gid, DATA_DIR_MODE) > 2793 return data_dir > > We gave up at that point, and noticed there might be a fix in the latest > v18.2.5 released a few days ago. > > https://github.com/ceph/ceph/pull/58458 > > Oddly, it is listed as a "debian" fix, I hope it is just a general fix > for the crash directory ownership issue. > > -Robert > > > > > On 4/13/25 12:16, Daniel Vogelbacher wrote: > > Hello, > > > > recently I checked my server logs for error events and got some hits for > > the ceph crash service, deployed with cephadm. > > > > When restarted, crash service logs to journald: > > > > auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/ > > etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No > > such file or directory > > AuthRegistry(0x7f21a0068da0) no keyring found at /etc/ceph/ > > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ > > ceph/keyring.bin, disabling cephx > > auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/ > > etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No > > such file or directory > > AuthRegistry(0x7f21a861bff0) no keyring found at /etc/ceph/ > > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ > > ceph/keyring.bin, disabling cephx > > monclient(hunting): handle_auth_bad_method server allowed_methods [2] > > but i only support [1] > > monclient(hunting): handle_auth_bad_method server allowed_methods [2] > > but i only support [1] > > monclient: authenticate NOTE: no keyring found; disabled cephx > > authentication > > [errno 13] RADOS permission denied (error connecting to the cluster) > > > > The image is mounted with: > > > > -v /var/lib/ceph/7644057a-00f6-11f0-9a0c-eac00fed9338/crash.virt- > > master3/keyring:/etc/ceph/ceph.client.crash.virt-master3.keyring > > > > so I assume there should be a key available, but crash daemon searches > > for a admin keyring instead of "client.crash.virt-master3". > > > > Any advice how to fix this error? I see these error logs on all machines > > where crash service is deployed. I've tried a redeployment without any > > effect. > > > > # ceph -v > > ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid > > (stable) > > > > # systemctl restart ceph-7644057a-00f6-11f0-9a0c- > > eac00fed9...@crash.virt-master3.service > > > > # journalctl -u ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt- > > master3.service -ef > > > > Output: > > > > Apr 12 18:36:24 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1402272]: *** Interrupted with signal 15 *** > > Apr 12 18:36:24 virt-master3 podman[1408122]: 2025-04-12 > > 18:36:24.992684396 +0200 CEST m=+0.043782773 container died > > 6c538b6216744326c00d9b3989cfe8d498629fa2f008a62fe184f367b01cbf6b > > (image=quay.io/ceph/ > > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de, > name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3, > org.label-schema.name=CentOS Stream 9 Base Image, > org.opencontainers.image.authors=Ceph Release Team < > ceph-maintain...@ceph.io>, org.label-schema.vendor=CentOS, > org.label-schema.license=GPLv2, CEPH_REF=squid, > org.label-schema.schema-version=1.0, CEPH_GIT_REPO= > https://github.com/ceph/ceph.git, io.buildah.version=1.33.7, > org.opencontainers.image.documentation=https://docs.ceph.com/, > CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, GANESHA_REPO_BASEURL= > https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/, > FROM_IMAGE=quay.io/centos/centos:stream9, > org.label-schema.build-date=20250124, OSD_FLAVOR=default) > > Apr 12 18:36:25 virt-master3 podman[1408122]: 2025-04-12 > > 18:36:25.016797163 +0200 CEST m=+0.067895534 container remove > > 6c538b6216744326c00d9b3989cfe8d498629fa2f008a62fe184f367b01cbf6b > > (image=quay.io/ceph/ > > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de, > name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3, > org.opencontainers.image.documentation=https://docs.ceph.com/, > org.label-schema.build-date=20250124, FROM_IMAGE= > quay.io/centos/centos:stream9, GANESHA_REPO_BASEURL= > https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/, > io.buildah.version=1.33.7, org.label-schema.name=CentOS Stream 9 Base > Image, org.label-schema.vendor=CentOS, org.label-schema.schema-version=1.0, > CEPH_GIT_REPO=https://github.com/ceph/ceph.git, > CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, OSD_FLAVOR=default, > org.label-schema.license=GPLv2, CEPH_REF=squid, > org.opencontainers.image.authors=Ceph Release Team < > ceph-maintain...@ceph.io>) > > Apr 12 18:36:25 virt-master3 bash[1408104]: > > ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3 > > Apr 12 18:36:25 virt-master3 systemd[1]: ceph-7644057a-00f6-11f0-9a0c- > > eac00fed9...@crash.virt-master3.service: Deactivated successfully. > > Apr 12 18:36:25 virt-master3 systemd[1]: Stopped > > ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service - > > Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338. > > Apr 12 18:36:25 virt-master3 systemd[1]: ceph-7644057a-00f6-11f0-9a0c- > > eac00fed9...@crash.virt-master3.service: Consumed 1.540s CPU time. > > Apr 12 18:36:25 virt-master3 systemd[1]: Starting > > ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service - > > Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338... > > Apr 12 18:36:25 virt-master3 podman[1408259]: > > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12 > > 18:36:25.498991532 +0200 CEST m=+0.071935136 container create > > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63 > > (image=quay.io/ceph/ > > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de, > name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3, > org.label-schema.build-date=20250124, org.label-schema.name=CentOS Stream > 9 Base Image, OSD_FLAVOR=default, org.label-schema.schema-version=1.0, > CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, > io.buildah.version=1.33.7, CEPH_GIT_REPO=https://github.com/ceph/ceph.git, > org.label-schema.vendor=CentOS, org.label-schema.license=GPLv2, FROM_IMAGE= > quay.io/centos/centos:stream9, org.opencontainers.image.documentation= > https://docs.ceph.com/, GANESHA_REPO_BASEURL= > https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/, > CEPH_REF=squid, org.opencontainers.image.authors=Ceph Release Team < > ceph-maintain...@ceph.io>) > > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12 > > 18:36:25.462816456 +0200 CEST m=+0.035760071 image pull quay.io/ceph/ > > ceph@sha256 > :41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de > > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12 > > 18:36:25.589479015 +0200 CEST m=+0.162422619 container init > > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63 > > (image=quay.io/ceph/ > > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de, > name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3, > org.label-schema.license=GPLv2, org.label-schema.schema-version=1.0, > CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, > org.label-schema.name=CentOS > Stream 9 Base Image, OSD_FLAVOR=default, > org.label-schema.build-date=20250124, org.label-schema.vendor=CentOS, > org.opencontainers.image.authors=Ceph Release Team < > ceph-maintain...@ceph.io>, FROM_IMAGE=quay.io/centos/centos:stream9, > org.opencontainers.image.documentation=https://docs.ceph.com/, > GANESHA_REPO_BASEURL= > https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/, > CEPH_GIT_REPO=https://github.com/ceph/ceph.git, CEPH_REF=squid, > io.buildah.version=1.33.7) > > Apr 12 18:36:25 virt-master3 podman[1408259]: 2025-04-12 > > 18:36:25.595018205 +0200 CEST m=+0.167961840 container start > > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63 > > (image=quay.io/ceph/ > > ceph@sha256:41d3f5e46ff7de28544cc8869fdea13fca824dcef83936cb3288ed9de935e4de, > name=ceph-7644057a-00f6-11f0-9a0c-eac00fed9338-crash-virt-master3, > org.label-schema.license=GPLv2, FROM_IMAGE=quay.io/centos/centos:stream9, > org.label-schema.build-date=20250124, io.buildah.version=1.33.7, > CEPH_GIT_REPO=https://github.com/ceph/ceph.git, > org.label-schema.vendor=CentOS, > CEPH_SHA1=58a7fab8be0a062d730ad7da874972fd3fba59fb, OSD_FLAVOR=default, > org.opencontainers.image.documentation=https://docs.ceph.com/, > CEPH_REF=squid, org.opencontainers.image.authors=Ceph Release Team < > ceph-maintain...@ceph.io>, GANESHA_REPO_BASEURL= > https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/, > org.label-schema.name=CentOS Stream 9 Base Image, > org.label-schema.schema-version=1.0) > > Apr 12 18:36:25 virt-master3 bash[1408259]: > > bb21751667c7c4dac8f5624fb2e84f79facd5e809ff08c5168216cd591470b63 > > Apr 12 18:36:25 virt-master3 systemd[1]: Started > > ceph-7644057a-00f6-11f0-9a0c-eac00fed9...@crash.virt-master3.service - > > Ceph crash.virt-master3 for 7644057a-00f6-11f0-9a0c-eac00fed9338. > > Apr 12 18:36:25 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: INFO:ceph-crash:pinging cluster to exercise > > our key > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640 > > -1 auth: unable to find a keyring on /etc/ceph/ > > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ > > ceph/keyring.bin: (2) No such file or directory > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640 > > -1 AuthRegistry(0x7f21a0068da0) no keyring found at /etc/ceph/ > > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ > > ceph/keyring.bin, disabling cephx > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640 > > -1 auth: unable to find a keyring on /etc/ceph/ > > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ > > ceph/keyring.bin: (2) No such file or directory > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640 > > -1 AuthRegistry(0x7f21a861bff0) no keyring found at /etc/ceph/ > > ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ > > ceph/keyring.bin, disabling cephx > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a5b91640 > > -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] > > but i only support [1] > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a6392640 > > -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] > > but i only support [1] > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: 2025-04-12T16:36:26.131+0000 7f21a861d640 > > -1 monclient: authenticate NOTE: no keyring found; disabled cephx > > authentication > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: [errno 13] RADOS permission denied (error > > connecting to the cluster) > > Apr 12 18:36:26 virt-master3 ceph-7644057a-00f6-11f0-9a0c-eac00fed9338- > > crash-virt-master3[1408301]: INFO:ceph-crash:monitoring path /var/lib/ > > ceph/crash, delay 600s > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io