[ceph-users] OSD apply failing, how to stop

2021-08-29 Thread Vardas Pavardė arba Įmonė
Hi,

I have tried to create OSD with this config:
service_type: osd
service_id: osd_nnn1
placement:
  hosts:
- nakidra
data_devices:
  paths:
- /dev/sdc
- /dev/sdd
db_devices:
  paths:
- ceph-nvme-04/block
wal_devices:
  paths:
- ceph-nvme-14/block

with command:
ceph orch apply osd -i osd1.yml

but unfortunately the system stuck in retry cycle:

8/30/21 2:27:00 AM
[ERR]
Failed to apply osd.osd_nakidra1 spec
DriveGroupSpec(name=osd_nakidra1->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='nnn',
network='', name='')]), service_id='osd_nakidra1', service_type='osd',
data_devices=DeviceSelection(paths=[, ], all=False),
db_devices=DeviceSelection(paths=[], all=False),
wal_devices=DeviceSelection(paths=[], all=False), osd_id_claims={}, unmanaged=False,
filter_logic='AND', preview_only=False): cephadm exited with an error code:
1, stderr:Non-zero exit code 1 from /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
--privileged --group-add=disk --init -e CONTAINER_IMAGE=
docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb
-e NODE_NAME=nnn -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_OSDSPEC_AFFINITY=osd_nnn1 -v
/var/run/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/run/ceph:z -v
/var/log/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/log/ceph:z -v
/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/crash:/var/lib/ceph/crash:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /tmp/ceph-tmp26b6lukq:/etc/ceph/ceph.conf:z
-v /tmp/ceph-tmp9unbqyia:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb
lvm batch --no-auto /dev/sdc /dev/sdd --db-devices ceph-nvme-04/block
--wal-devices ceph-nvme-14/block --yes --no-systemd /usr/bin/docker: stderr
--> passed data devices: 2 physical, 0 LVM /usr/bin/docker: stderr -->
relative data size: 1.0 /usr/bin/docker: stderr --> passed block_db
devices: 0 physical, 1 LVM /usr/bin/docker: stderr --> ZeroDivisionError:
integer division or modulo by zero Traceback (most recent call last): File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 8230, in  main() File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 8218, in main r = ctx.func(ctx) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 1653, in _infer_fsid return func(ctx) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 1737, in _infer_image return func(ctx) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 4599, in command_ceph_volume out, err, code = call_throws(ctx,
c.run_cmd()) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 1453, in call_throws raise RuntimeError('Failed command: %s' % '
'.join(command)) Runt

The error I see:
stderr --> ZeroDivisionError: integer division or modulo by zero

What could be wrong? According to the docs you can pass LVM volume ar db
and wal device..

How can I stop this cycle? e.g. cancel apply command..
How correctly to set up OSD with rotational disk as data and nvme as db and
wal device?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Fwd: OSD apply failing, how to stop

2021-08-29 Thread Arunas B.
Hi,

I have tried to create OSD with this config:
service_type: osd
service_id: osd_nnn1
placement:
  hosts:
- nakidra
data_devices:
  paths:
- /dev/sdc
- /dev/sdd
db_devices:
  paths:
- ceph-nvme-04/block
wal_devices:
  paths:
- ceph-nvme-14/block

with command:
ceph orch apply osd -i osd1.yml

but unfortunately the system stuck in retry cycle:

8/30/21 2:27:00 AM
[ERR]
Failed to apply osd.osd_nakidra1 spec
DriveGroupSpec(name=osd_nakidra1->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='nnn',
network='', name='')]), service_id='osd_nakidra1', service_type='osd',
data_devices=DeviceSelection(paths=[, ], all=False),
db_devices=DeviceSelection(paths=[], all=False),
wal_devices=DeviceSelection(paths=[], all=False), osd_id_claims={}, unmanaged=False,
filter_logic='AND', preview_only=False): cephadm exited with an error code:
1, stderr:Non-zero exit code 1 from /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
--privileged --group-add=disk --init -e CONTAINER_IMAGE=
docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb
-e NODE_NAME=nnn -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_OSDSPEC_AFFINITY=osd_nnn1 -v
/var/run/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/run/ceph:z -v
/var/log/ceph/03d0b03e-085b-11ec-8e4b-814a39073967:/var/log/ceph:z -v
/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/crash:/var/lib/ceph/crash:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /tmp/ceph-tmp26b6lukq:/etc/ceph/ceph.conf:z
-v /tmp/ceph-tmp9unbqyia:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb
lvm batch --no-auto /dev/sdc /dev/sdd --db-devices ceph-nvme-04/block
--wal-devices ceph-nvme-14/block --yes --no-systemd /usr/bin/docker: stderr
--> passed data devices: 2 physical, 0 LVM /usr/bin/docker: stderr -->
relative data size: 1.0 /usr/bin/docker: stderr --> passed block_db
devices: 0 physical, 1 LVM /usr/bin/docker: stderr --> ZeroDivisionError:
integer division or modulo by zero Traceback (most recent call last): File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 8230, in  main() File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 8218, in main r = ctx.func(ctx) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 1653, in _infer_fsid return func(ctx) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 1737, in _infer_image return func(ctx) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 4599, in command_ceph_volume out, err, code = call_throws(ctx,
c.run_cmd()) File
"/var/lib/ceph/03d0b03e-085b-11ec-8e4b-814a39073967/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931",
line 1453, in call_throws raise RuntimeError('Failed command: %s' % '
'.join(command)) Runt

The error I see:
stderr --> ZeroDivisionError: integer division or modulo by zero

What could be wrong? According to the docs you can pass LVM volume ar db
and wal device..

How can I stop this cycle? e.g. cancel apply command..
How correctly to set up OSD with rotational disk as data and nvme as db and
wal device?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-08-29 Thread Paul Giralt (pgiralt)
Thanks Xiubo,

I actually had the same idea on Friday and I reduced the number of iSCSI 
gateways to 1 and the problem appears to have disappeared for now. I’m guessing 
there is still some chance it could happen, but it would be much more rare to 
occur.

I did notice the blacklist was growing very large (over 14,000 entries) and I 
found 1503692 which appears to explain why those entries are growing so high, 
but like you said that doesn’t appear to be a problem in and of itself.

The initiators accessing the iSCSI volumes are all VMWare ESXi hosts. Do you 
think it’s expected to see so much path switching in this kind of environment 
or perhaps I need to look at some parameters on the ESXi side to make it not 
switch so often.

Now we don’t have redundancy, but at least things are stable while we wait for 
a fix. Any chance this fix will make it into the 16.2.6 release?

-Paul


On Aug 29, 2021, at 8:48 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:



On 8/27/21 11:10 PM, Paul Giralt (pgiralt) wrote:
Ok - thanks Xiubo. Not sure I feel comfortable doing that without breaking 
something else, so will wait for a new release that incorporates the fix. In 
the meantime I’m trying to figure out what might be triggering the issue, since 
this has been running fine for months and just recently started happening. Now 
it happens fairly regularly.

I noticed that in the tcmu logs, I see the following:

2021-08-27 15:06:40.158 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0001.iscsi-p0001-img-01: 
Could not update service status. (Err -107)
2021-08-27 15:06:40.158 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not report events. Error -107.
2021-08-27 15:06:41.131 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Async lock drop. Old state 5
2021-08-27 15:06:41.147 8:cmdproc-uio9 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.132 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0002.iscsi-p0002-img-02: 
Could not update service status. (Err -107)
2021-08-27 15:06:42.132 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Could not report events. Error -107.
2021-08-27 15:06:42.216 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/2251669337}
2021-08-27 15:06:42.217 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry '�(~'. 
(Err -13)
2021-08-27 15:06:42.217 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/3276725458}
2021-08-27 15:06:42.218 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry ''. (Err 
-13)
2021-08-27 15:06:42.443 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Async lock drop. Old state 5
2021-08-27 15:06:42.459 8:cmdproc-uio0 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.488 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: removing addrs: 
{10.122.242.197:0/2189482708}
2021-08-27 15:06:42.489 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Could not rm blacklist entry '`"�'. 
(Err -13)

The tcmu_rbd_service_status_update is showing up in there which is the code 
that is affected by this bug. Any idea what the error -107 means? Maybe if I 
fix what is causing some of these errors, it might work around the problem. 
Also if you have thoughts on the other blacklist entry errors and what might be 
causing them, that would be greatly appreciated as well.


There has one way to improve this, which is to make the HA=1, but won't void it 
100%. I found your case is triggered when it's doing active paths switching 
between different gateways, which will do the exclusive lock broke and 
acquiring frequently, the Error -107 means the image has been closed by 
tcmu-runner but another thread is trying to use the freed connection to report 
the status. The blocklist error should be okay, it won't affect anything, it's 
just a waning.


- Xiubo

-Paul


On Aug 26, 2021, at 8:37 PM, Xiubo Li 
mailto:xiu...@redhat.com>> wrote:

On 8/27/21 12:06 AM, Paul Giralt (pgiralt) wrote:
This is great. Is there a way to test the fix in my environment?


It seem you could restart the tcmu-runner service from the container.

Since this change not only in the handler_rbd.so but also the libtcmu.so and 
tcmu-runner binary, the whole tcmu-runner need to 

[ceph-users] mon startup problem on upgrade octopus to pacific

2021-08-29 Thread Chris Dunlop

Hi,

I'm stuck, mid upgrade from octopus to pacific using cephadm, at the point 
of upgrading the mons.


I have 3 mons still on octopus and in quorum. When I try to bring up a 
new pacific mon it stays permanently in "probing" state.


The pacific mon is running off:

docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb

The lead octopus mon is running off:

quay.io/ceph/ceph:v15

The other 2 octopus mons are 15.2.14-1~bpo10+1. These are manually started 
due to the cephadm upgrade failing at the point of upgrading the mons and 
leaving me with only one cephadm mon running.


I've confirmed all mons (current and new) can contact each other on 
ports 3300 and 6789, and max mtu packets (9000) get through in all 
directions.


On the box where I'm trying to start the pacific mon, if I start up an 
octopus mon it happily joins the mon set.


With debug_mon=20 on the pacific mon I see *constant* repeated mon_probe 
reply processing. The first mon_probe reply produces:


e0  got newer/committed monmap epoch 35, mine was 0

Subsequent mon_probe replies produce:

e35 got newer/committed monmap epoch 35, mine was 35

...but this just keeps repeating and it never gets any further - see 
below.


Where to from here?

Cheers,

Chris

--
debug_mon=20 from pacific mon
--
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e0 handle_probe mon_probe(reply 
c6618970-0ce0-4cb2-bc9a-dd5f29b62e24 name b4 quorum 0,1,2 leader 0 paxos( fc 
364908695 lc 364909318 ) mon_release octopus) v7
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e0 handle_probe_reply mon.2 
v2:10.200.63.132:3300/0 mon_probe(reply c6618970-0ce0-4cb2-bc9a-dd5f29b62e24 
name b4 quorum 0,1,2 leader 0 paxos( fc 364908695 lc 364909318 ) mon_release 
octopus) v7
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e0  monmap is e0: 3 mons at 
{noname-a=[v2:10.200.63.130:3300/0,v1:10.200.63.130:6789/0],noname-b=[v2:10.200.63.132:3300/0,v1:10.200.63.132:6789/0],noname-c=[v2:192.168.254.251:3300/0,v1:192.168.254.251:6789/0]}
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e0  got newer/committed monmap epoch 35, 
mine was 0
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 bootstrap
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 sync_reset_requester
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 unregister_cluster_logger - not 
registered
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 cancel_probe_timeout 0x5564a433c900
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 monmap e35: 3 mons at 
{b2=[v2:10.200.63.130:3300/0,v1:10.200.63.130:6789/0],b4=[v2:10.200.63.132:3300/0,v1:10.200.63.132:6789/0],k2=[v2:192.168.254.251:3300/0,v1:192.168.254.251:6789/0]}
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 _reset
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing).auth v0 _set_mon_num_rank num 0 rank 0
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 cancel_probe_timeout (none scheduled)
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 timecheck_finish
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 15 mon.b5@-1(probing) e35 health_tick_stop
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 15 mon.b5@-1(probing) e35 health_interval_stop
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 scrub_event_cancel
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 scrub_reset
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 cancel_probe_timeout (none scheduled)
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 reset_probe_timeout 0x5564a433c900 after 
2 seconds
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 10 mon.b5@-1(probing) e35 probing other monitors
Aug 29 08:25:34 b5 conmon[2648666]: debug 2021-08-28T22:25:34.792+ 
7f74f223a700 20 mon.b5@-1(probing) e35 _ms_dispatch existing session