[ceph-users] OSD apply failing, how to stop
Hi, I have tried to create OSD with this config: service_type: osd service_id: osd_nnn1 placement: hosts: - nakidra data_devices: paths: - /dev/sdc - /dev/sdd db_devices: paths: - ceph-nvme-04/block wal_devices: paths: - ceph-nvme-14/block with command: ceph orch apply osd -i osd1.yml but unfortunately the system stuck in retry cycle: 8/30/21 2:27:00 AM [ERR] Failed to apply osd.osd_nakidra1 spec DriveGroupSpec(name=osd_nakidra1->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='nnn', network='', name='')]), service_id='osd_nakidra1', service_type='osd', data_devices=DeviceSelection(paths=[, ], all=False), db_devices=DeviceSelection(paths=[], all=False), wal_devices=DeviceSelection(paths=[], all=False), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): cephadm exited with an error code: 1, stderr:Non-zero exit code 1 from /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE= docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb -e NODE_NAME=nnn -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_OSDSPEC_AFFINITY=osd_nnn1 -v /var/run/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/run/ceph:z -v /var/log/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/log/ceph:z -v /var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /tmp/ceph-tmp26b6lukq:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp9unbqyia:/var/lib/ceph/bootstrap-osd/ceph.keyring:z docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb lvm batch --no-auto /dev/sdc /dev/sdd --db-devices ceph-nvme-04/block --wal-devices ceph-nvme-14/block --yes --no-systemd /usr/bin/docker: stderr --> passed data devices: 2 physical, 0 LVM /usr/bin/docker: stderr --> relative data size: 1.0 /usr/bin/docker: stderr --> passed block_db devices: 0 physical, 1 LVM /usr/bin/docker: stderr --> ZeroDivisionError: integer division or modulo by zero Traceback (most recent call last): File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 8230, in main() File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 8218, in main r = ctx.func(ctx) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 1653, in _infer_fsid return func(ctx) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 1737, in _infer_image return func(ctx) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 4599, in command_ceph_volume out, err, code = call_throws(ctx, c.run_cmd()) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 1453, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) Runt The error I see: stderr --> ZeroDivisionError: integer division or modulo by zero What could be wrong? According to the docs you can pass LVM volume ar db and wal device.. How can I stop this cycle? e.g. cancel apply command.. How correctly to set up OSD with rotational disk as data and nvme as db and wal device? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Fwd: OSD apply failing, how to stop
Hi, I have tried to create OSD with this config: service_type: osd service_id: osd_nnn1 placement: hosts: - nakidra data_devices: paths: - /dev/sdc - /dev/sdd db_devices: paths: - ceph-nvme-04/block wal_devices: paths: - ceph-nvme-14/block with command: ceph orch apply osd -i osd1.yml but unfortunately the system stuck in retry cycle: 8/30/21 2:27:00 AM [ERR] Failed to apply osd.osd_nakidra1 spec DriveGroupSpec(name=osd_nakidra1->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='nnn', network='', name='')]), service_id='osd_nakidra1', service_type='osd', data_devices=DeviceSelection(paths=[, ], all=False), db_devices=DeviceSelection(paths=[], all=False), wal_devices=DeviceSelection(paths=[], all=False), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): cephadm exited with an error code: 1, stderr:Non-zero exit code 1 from /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE= docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb -e NODE_NAME=nnn -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_OSDSPEC_AFFINITY=osd_nnn1 -v /var/run/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/run/ceph:z -v /var/log/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/log/ceph:z -v /var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /tmp/ceph-tmp26b6lukq:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp9unbqyia:/var/lib/ceph/bootstrap-osd/ceph.keyring:z docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb lvm batch --no-auto /dev/sdc /dev/sdd --db-devices ceph-nvme-04/block --wal-devices ceph-nvme-14/block --yes --no-systemd /usr/bin/docker: stderr --> passed data devices: 2 physical, 0 LVM /usr/bin/docker: stderr --> relative data size: 1.0 /usr/bin/docker: stderr --> passed block_db devices: 0 physical, 1 LVM /usr/bin/docker: stderr --> ZeroDivisionError: integer division or modulo by zero Traceback (most recent call last): File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 8230, in main() File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 8218, in main r = ctx.func(ctx) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 1653, in _infer_fsid return func(ctx) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 1737, in _infer_image return func(ctx) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 4599, in command_ceph_volume out, err, code = call_throws(ctx, c.run_cmd()) File "/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931", line 1453, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) Runt The error I see: stderr --> ZeroDivisionError: integer division or modulo by zero What could be wrong? According to the docs you can pass LVM volume ar db and wal device.. How can I stop this cycle? e.g. cancel apply command.. How correctly to set up OSD with rotational disk as data and nvme as db and wal device? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: tcmu-runner crashing on 16.2.5
Thanks Xiubo, I actually had the same idea on Friday and I reduced the number of iSCSI gateways to 1 and the problem appears to have disappeared for now. I’m guessing there is still some chance it could happen, but it would be much more rare to occur. I did notice the blacklist was growing very large (over 14,000 entries) and I found 1503692 which appears to explain why those entries are growing so high, but like you said that doesn’t appear to be a problem in and of itself. The initiators accessing the iSCSI volumes are all VMWare ESXi hosts. Do you think it’s expected to see so much path switching in this kind of environment or perhaps I need to look at some parameters on the ESXi side to make it not switch so often. Now we don’t have redundancy, but at least things are stable while we wait for a fix. Any chance this fix will make it into the 16.2.6 release? -Paul On Aug 29, 2021, at 8:48 PM, Xiubo Li mailto:xiu...@redhat.com>> wrote: On 8/27/21 11:10 PM, Paul Giralt (pgiralt) wrote: Ok - thanks Xiubo. Not sure I feel comfortable doing that without breaking something else, so will wait for a new release that incorporates the fix. In the meantime I’m trying to figure out what might be triggering the issue, since this has been running fine for months and just recently started happening. Now it happens fairly regularly. I noticed that in the tcmu logs, I see the following: 2021-08-27 15:06:40.158 8:ework-thread [ERROR] tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not update service status. (Err -107) 2021-08-27 15:06:40.158 8:ework-thread [ERROR] __tcmu_report_event:173 rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not report events. Error -107. 2021-08-27 15:06:41.131 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 rbd/iscsi-pool-0002.iscsi-p0002-img-02: Async lock drop. Old state 5 2021-08-27 15:06:41.147 8:cmdproc-uio9 [INFO] alua_implicit_transition:592 rbd/iscsi-pool-0002.iscsi-p0002-img-02: Starting write lock acquisition operation. 2021-08-27 15:06:42.132 8:ework-thread [ERROR] tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0002.iscsi-p0002-img-02: Could not update service status. (Err -107) 2021-08-27 15:06:42.132 8:ework-thread [ERROR] __tcmu_report_event:173 rbd/iscsi-pool-0002.iscsi-p0002-img-02: Could not report events. Error -107. 2021-08-27 15:06:42.216 8:ework-thread [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:340 rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: {10.122.242.197:0/2251669337} 2021-08-27 15:06:42.217 8:ework-thread [ERROR] tcmu_rbd_rm_stale_entry_from_blacklist:322 rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry '�(~'. (Err -13) 2021-08-27 15:06:42.217 8:ework-thread [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:340 rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: {10.122.242.197:0/3276725458} 2021-08-27 15:06:42.218 8:ework-thread [ERROR] tcmu_rbd_rm_stale_entry_from_blacklist:322 rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry ''. (Err -13) 2021-08-27 15:06:42.443 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 rbd/iscsi-pool-0005.iscsi-p0005-img-01: Async lock drop. Old state 5 2021-08-27 15:06:42.459 8:cmdproc-uio0 [INFO] alua_implicit_transition:592 rbd/iscsi-pool-0005.iscsi-p0005-img-01: Starting write lock acquisition operation. 2021-08-27 15:06:42.488 8:ework-thread [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:340 rbd/iscsi-pool-0005.iscsi-p0005-img-01: removing addrs: {10.122.242.197:0/2189482708} 2021-08-27 15:06:42.489 8:ework-thread [ERROR] tcmu_rbd_rm_stale_entry_from_blacklist:322 rbd/iscsi-pool-0005.iscsi-p0005-img-01: Could not rm blacklist entry '`"�'. (Err -13) The tcmu_rbd_service_status_update is showing up in there which is the code that is affected by this bug. Any idea what the error -107 means? Maybe if I fix what is causing some of these errors, it might work around the problem. Also if you have thoughts on the other blacklist entry errors and what might be causing them, that would be greatly appreciated as well. There has one way to improve this, which is to make the HA=1, but won't void it 100%. I found your case is triggered when it's doing active paths switching between different gateways, which will do the exclusive lock broke and acquiring frequently, the Error -107 means the image has been closed by tcmu-runner but another thread is trying to use the freed connection to report the status. The blocklist error should be okay, it won't affect anything, it's just a waning. - Xiubo -Paul On Aug 26, 2021, at 8:37 PM, Xiubo Li mailto:xiu...@redhat.com>> wrote: On 8/27/21 12:06 AM, Paul Giralt (pgiralt) wrote: This is great. Is there a way to test the fix in my environment? It seem you could restart the tcmu-runner service from the container. Since this change not only in the handler_rbd.so but also the libtcmu.so and tcmu-runner binary, the whole tcmu-runner need to
[ceph-users] mon startup problem on upgrade octopus to pacific
Hi, I'm stuck, mid upgrade from octopus to pacific using cephadm, at the point of upgrading the mons. I have 3 mons still on octopus and in quorum. When I try to bring up a new pacific mon it stays permanently in "probing" state. The pacific mon is running off: docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb The lead octopus mon is running off: quay.io/ceph/ceph:v15 The other 2 octopus mons are 15.2.14-1~bpo10+1. These are manually started due to the cephadm upgrade failing at the point of upgrading the mons and leaving me with only one cephadm mon running. I've confirmed all mons (current and new) can contact each other on ports 3300 and 6789, and max mtu packets (9000) get through in all directions. On the box where I'm trying to start the pacific mon, if I start up an octopus mon it happily joins the mon set. With debug_mon=20 on the pacific mon I see *constant* repeated mon_probe reply processing. The first mon_probe reply produces: e0 got newer/committed monmap epoch 35, mine was 0 Subsequent mon_probe replies produce: e35 got newer/committed monmap epoch 35, mine was 35 ...but this just keeps repeating and it never gets any further - see below. Where to from here? Cheers, Chris -- debug_mon=20 from pacific mon -- Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e0 handle_probe mon_probe(reply c6618970-0ce0-4cb2-bc9a-dd5f29b62e24 name b4 quorum 0,1,2 leader 0 paxos( fc 364908695 lc 364909318 ) mon_release octopus) v7 Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e0 handle_probe_reply mon.2 v2:10.200.63.132:3300/0 mon_probe(reply c6618970-0ce0-4cb2-bc9a-dd5f29b62e24 name b4 quorum 0,1,2 leader 0 paxos( fc 364908695 lc 364909318 ) mon_release octopus) v7 Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e0 monmap is e0: 3 mons at {noname-a=[v2:10.200.63.130:3300/0,v1:10.200.63.130:6789/0],noname-b=[v2:10.200.63.132:3300/0,v1:10.200.63.132:6789/0],noname-c=[v2:192.168.254.251:3300/0,v1:192.168.254.251:6789/0]} Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e0 got newer/committed monmap epoch 35, mine was 0 Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 bootstrap Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 sync_reset_requester Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 unregister_cluster_logger - not registered Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 cancel_probe_timeout 0x5564a433c900 Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 monmap e35: 3 mons at {b2=[v2:10.200.63.130:3300/0,v1:10.200.63.130:6789/0],b4=[v2:10.200.63.132:3300/0,v1:10.200.63.132:6789/0],k2=[v2:192.168.254.251:3300/0,v1:192.168.254.251:6789/0]} Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 _reset Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing).auth v0 _set_mon_num_rank num 0 rank 0 Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 cancel_probe_timeout (none scheduled) Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 timecheck_finish Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 15 mon.b5@-1(probing) e35 health_tick_stop Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 15 mon.b5@-1(probing) e35 health_interval_stop Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 scrub_event_cancel Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 scrub_reset Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 cancel_probe_timeout (none scheduled) Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 reset_probe_timeout 0x5564a433c900 after 2 seconds Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 10 mon.b5@-1(probing) e35 probing other monitors Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 7f74f223a700 20 mon.b5@-1(probing) e35 _ms_dispatch existing session