[ceph-users] Re: tcmu-runner crashing on 16.2.5

Paul Giralt (pgiralt) Fri, 27 Aug 2021 08:13:43 -0700

Ok - thanks Xiubo. Not sure I feel comfortable doing that without breaking 
something else, so will wait for a new release that incorporates the fix. In 
the meantime I’m trying to figure out what might be triggering the issue, since 
this has been running fine for months and just recently started happening. Now 
it happens fairly regularly.


I noticed that in the tcmu logs, I see the following:

2021-08-27 15:06:40.158 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0001.iscsi-p0001-img-01: 
Could not update service status. (Err -107)
2021-08-27 15:06:40.158 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not report events. Error -107.
2021-08-27 15:06:41.131 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Async lock drop. Old state 5
2021-08-27 15:06:41.147 8:cmdproc-uio9 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.132 8:ework-thread [ERROR] 
tcmu_rbd_service_status_update:140 rbd/iscsi-pool-0002.iscsi-p0002-img-02: 
Could not update service status. (Err -107)
2021-08-27 15:06:42.132 8:ework-thread [ERROR] __tcmu_report_event:173 
rbd/iscsi-pool-0002.iscsi-p0002-img-02: Could not report events. Error -107.
2021-08-27 15:06:42.216 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/2251669337}
2021-08-27 15:06:42.217 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry '�(~'. 
(Err -13)
2021-08-27 15:06:42.217 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: removing addrs: 
{10.122.242.197:0/3276725458}
2021-08-27 15:06:42.218 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0001.iscsi-p0001-img-01: Could not rm blacklist entry ''. (Err 
-13)
2021-08-27 15:06:42.443 8:io_context_pool [WARN] tcmu_notify_lock_lost:271 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Async lock drop. Old state 5
2021-08-27 15:06:42.459 8:cmdproc-uio0 [INFO] alua_implicit_transition:592 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Starting write lock acquisition 
operation.
2021-08-27 15:06:42.488 8:ework-thread [INFO] 
tcmu_rbd_rm_stale_entries_from_blacklist:340 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: removing addrs: 
{10.122.242.197:0/2189482708}
2021-08-27 15:06:42.489 8:ework-thread [ERROR] 
tcmu_rbd_rm_stale_entry_from_blacklist:322 
rbd/iscsi-pool-0005.iscsi-p0005-img-01: Could not rm blacklist entry '`"�'. 
(Err -13)

The tcmu_rbd_service_status_update is showing up in there which is the code 
that is affected by this bug. Any idea what the error -107 means? Maybe if I 
fix what is causing some of these errors, it might work around the problem. 
Also if you have thoughts on the other blacklist entry errors and what might be 
causing them, that would be greatly appreciated as well.

-Paul


On Aug 26, 2021, at 8:37 PM, Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:

On 8/27/21 12:06 AM, Paul Giralt (pgiralt) wrote:
This is great. Is there a way to test the fix in my environment?


It seem you could restart the tcmu-runner service from the container.

Since this change not only in the handler_rbd.so but also the libtcmu.so and 
tcmu-runner binary, the whole tcmu-runner need to be built.

That means I am afraid you have to build and install it from source on the host 
and then restart the tcmu container.



-Paul


On Aug 26, 2021, at 11:05 AM, Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:


Hi Paul, Ilya,

I have fixed it in [1], please help review.

Thanks

[1] https://github.com/open-iscsi/tcmu-runner/pull/667


On 8/26/21 7:34 PM, Paul Giralt (pgiralt) wrote:
Thank you for the analysis. Can you think of a workaround for the issue?

-Paul

Sent from my iPhone

On Aug 26, 2021, at 5:17 AM, Xiubo Li 
<xiu...@redhat.com><mailto:xiu...@redhat.com> wrote:



Hi Paul,

There has one racy case when updating the state to ceph cluster and while 
reopening the image, which will close and open the image, the crash should 
happen just after the image was closed and the resources were released and then 
if work queue was trying to update the state to ceph cluster it will trigger 
use-after-free bug.

I will try to fix it.

Thanks


On 8/26/21 10:40 AM, Paul Giralt (pgiralt) wrote:
I will send a unicast email with the link and details.

-Paul


On Aug 25, 2021, at 10:37 PM, Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:


Hi Paul,

Please send me the detail versions of the tcmu-runner and ceph-iscsi packages 
you are using.

Thanks


On 8/26/21 10:21 AM, Paul Giralt (pgiralt) wrote:
Thank you. I did find some coredump files. Is there a way I can send these to 
you to analyze?

[root@cxcto-c240-j27-02 coredump]# ls -asl
total 71292
    0 drwxr-xr-x. 2 root root      176 Aug 25 18:31 .
    0 drwxr-xr-x. 5 root root       70 Aug 10 11:31 ..
34496 -rw-r-----. 1 root root 35316215 Aug 25 18:31 
core.tcmu-runner.0.3083bbc32b6a43acb768b85818414867.4523.1629930681000000.lz4
36796 -rw-r-----. 1 root root 37671322 Aug 24 09:17 
core.tcmu-runner.0.baf25867590c40da87305e67d5b97751.4521.1629811022000000.lz4

[root@cxcto-c240-j27-03 coredump]# ls -asl
total 161188
    4 drwxr-xr-x. 2 root root     4096 Aug 25 19:29 .
    0 drwxr-xr-x. 5 root root       70 Aug 10 11:31 ..
45084 -rw-r-----. 1 root root 46159860 Aug 25 19:29 
core.tcmu-runner.0.a276a2f5ee5a4d279917fd8c335c9b93.5281.1629934190000000.lz4
33468 -rw-r-----. 1 root root 34263834 Aug 24 16:08 
core.tcmu-runner.0.a9df4a27b1ea43d09c6c254bb1e3447a.4209.1629835730000000.lz4
34212 -rw-r-----. 1 root root 35027795 Aug 25 03:43 
core.tcmu-runner.0.cce93af5693444108993f0d48371197d.5564.1629877416000000.lz4
48420 -rw-r-----. 1 root root 49574566 Aug 24 10:03 
core.tcmu-runner.0.e4f4ed6e35154c95b43f87b069380fbe.4091.1629813832000000.lz4

[root@cxcto-c240-j27-04 coredump]# ls -asl
total 359240
     4 drwxr-xr-x. 2 root root      4096 Aug 25 19:20 .
     0 drwxr-xr-x. 5 root root        70 Aug 10 11:31 ..
 31960 -rw-r-----. 1 root root  32720639 Aug 25 00:36 
core.tcmu-runner.0.115ba6ee7acb42b8acfe2a1a958b5367.34161.1629866182000000.lz4
 38516 -rw-r-----. 1 root root  39435484 Aug 25 19:20 
core.tcmu-runner.0.4d43dd5cde9c4d44a96b2c744a9b43f4.4295.1629933615000000.lz4
 81012 -rw-r-----. 1 root root  82951773 Aug 25 14:38 
core.tcmu-runner.0.6998ff9717cf4e96932349eacd1d81bc.4274.1629916720000000.lz4
 95872 -rw-r-----. 1 root root  98165539 Aug 23 17:02 
core.tcmu-runner.0.9a28e301d6604d1a8eafbe12ae896c2f.4269.1629752547000000.lz4
111876 -rw-r-----. 1 root root 114554583 Aug 24 11:41 
core.tcmu-runner.0.f9ea1331105b44f2b2f28dc0c1a7e653.5059.1629819705000000.lz4

[root@cxcto-c240-j27-05 coredump]# ls -asl
total 115720
    0 drwxr-xr-x. 2 root root      261 Aug 25 16:47 .
    0 drwxr-xr-x. 5 root root       70 Aug 10 11:31 ..
44720 -rw-r-----. 1 root root 45786023 Aug 24 09:46 
core.tcmu-runner.0.530b308c30534b9aa4e7619ff1ab869c.4145.1629812787000000.lz4
35032 -rw-r-----. 1 root root 35867165 Aug 24 17:52 
core.tcmu-runner.0.5afb87334bd741699c6fd44ceb031128.5672.1629841939000000.lz4
35968 -rw-r-----. 1 root root 36826770 Aug 25 16:47 
core.tcmu-runner.0.da66f3f24a624426a75cbe20758be879.5339.1629924435000000.lz4



-Paul


On Aug 25, 2021, at 10:14 PM, Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:



On 8/26/21 10:08 AM, Paul Giralt (pgiralt) wrote:
Thanks Xiubo. I will try this. How do I set the log level to 4?


It's in the /etc/tcmu/tcmu.cfg in the tcmu container. No need to restart the 
tcmu-runner service, the changes will be loaded by tcmu-runner daemon after the 
tcmu.cfg closed.



-Paul


On Aug 25, 2021, at 9:30 PM, Xiubo Li 
<xiu...@redhat.com<mailto:xiu...@redhat.com>> wrote:

It's buggy, we need one way to export the tcmu-runner log to the host.

Could you see any crash coredump from the host ?

Without that could you keep running some commands like '$ tail -f 
XYZ/tcmu-runner.log' in a console from the tcmu containers, let's see could we 
get any useful logs ? At the same time please set the log_level to 4. If it's 
an experimental setup then you can just set the log_level to 5.

I am not confident we can get any coredump from tcmu-runner.log, but at least 
we can get something else which may give us a clue.

- Xiubo







_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: tcmu-runner crashing on 16.2.5

Reply via email to