Problem Analysis
================

The dmesg log 'crash/201809061748/dmesg.201809061748' shows the CPU soft lockup 
occurs 25 seconds after the 'sdb' virtio-scsi drive is removed.
This seems to indicate the events are related (there's usually an extra 2s-3s 
between an event and the report of the 22s or 23s stuck, for some reason).

        [ 3002.697474] sd 0:0:2:0: [sdb] Synchronizing SCSI cache
        [ 3002.697545] sd 0:0:2:0: [sdb] Synchronize Cache(10) failed: Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
        [ 3028.294602] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! 
[id:2887]


CPU 1 is waiting on another CPU (to finish something)
-------------------------------

The locked up 'id' process is in page fault handling stack (which is
OK/normal), calling a function on many (SMP) cpus (see
smp_call_function_many() in the top of the stack), and is specifically
in the call to one CPU (see smp_call_function_single+0xae/0x110() in the
RIP register).

        [ 3028.301755] CPU: 1 PID: 2887 Comm: id Not tainted 4.4.0-133-generic 
#159~14.04.1-Ubuntu
        ...
        [ 3028.301760] RIP: 0010:[<ffffffff81102b1e>]  [<ffffffff81102b1e>] 
smp_call_function_single+0xae/0x110
        ...
        [ 3028.301797] Call Trace:
        [ 3028.301803]  [<ffffffff81071f60>] ? do_kernel_range_flush+0x40/0x40
        [ 3028.301805]  [<ffffffff81102e9e>] smp_call_function_many+0x22e/0x270
        [ 3028.301808]  [<ffffffff810723f8>] native_flush_tlb_others+0x48/0x120
        [ 3028.301810]  [<ffffffff8107256d>] flush_tlb_mm_range+0x9d/0x180
        [ 3028.301815]  [<ffffffff811cb1e3>] ptep_clear_flush+0x53/0x60
        [ 3028.301819]  [<ffffffff811b75ed>] wp_page_copy.isra.58+0x29d/0x530
        [ 3028.301822]  [<ffffffff811b955d>] do_wp_page+0x8d/0x590
        [ 3028.301824]  [<ffffffff811bb826>] handle_mm_fault+0xd86/0x1ac0
        [ 3028.301829]  [<ffffffff81192c53>] ? free_pages+0x13/0x20
        [ 3028.301835]  [<ffffffff810aa754>] ? finish_task_switch+0x244/0x2a0
        [ 3028.301840]  [<ffffffff8106b0fb>] __do_page_fault+0x19b/0x430
        [ 3028.301843]  [<ffffffff8106b3b2>] do_page_fault+0x22/0x30
        [ 3028.301847]  [<ffffffff81822928>] page_fault+0x28/0x30

The smp_call_function_many() address line in the stack trace
(smp_call_function_many+0x22e / ffffffff81102e9e) reflects it's called
smp_call_function_single():

        ffffffff81102c70 <smp_call_function_many>:
        ...
        ffffffff81102e99:       e8 d2 fb ff ff          callq  ffffffff81102a70 
<smp_call_function_single>
        ffffffff81102e9e:       48 83 c4 10             add    $0x10,%rsp

... which per the address in the RIP register (smp_call_function_single+0xae 
ffffffff81102b1e)
is in the (inlined) call to csd_lock_wait() after generic_exec_single().

csd_lock_wait() spins in the value of csd->flags with cpu_relax() / 'pause' 
instruction,
waiting for it to be unlocked (i.e., not have the CSD_FLAG_LOCK flag / 0x1 
value in the
flags field / offset 0x18)

        ffffffff81102a70 <smp_call_function_single>:
        ...
        ffffffff81102b0f:       e8 0c fe ff ff          callq  ffffffff81102920 
<generic_exec_single>
        ffffffff81102b14:       8b 55 e8                mov    -0x18(%rbp),%edx
        ffffffff81102b17:       83 e2 01                and    $0x1,%edx
        ffffffff81102b1a:       74 de                   je     ffffffff81102afa 
<smp_call_function_single+0x8a>
        ffffffff81102b1c:       f3 90                   pause
        ffffffff81102b1e:       8b 55 e8                mov    -0x18(%rbp),%edx
        ffffffff81102b21:       83 e2 01                and    $0x1,%edx
        ffffffff81102b24:       75 f6                   jne    ffffffff81102b1c 
<smp_call_function_single+0xac>


        """
        smp_call_function_single()
        ...
                err = generic_exec_single(cpu, csd, func, info);

                if (wait)
                        csd_lock_wait(csd);
        ...
        """


        """
        static void csd_lock_wait(struct call_single_data *csd)
        {
                while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
                        cpu_relax();
        }
        """

        kernel/smp.c:   CSD_FLAG_LOCK           = 0x01,

        include/linux/smp.h:

        struct call_single_data {
                struct llist_node llist;
                smp_call_func_t func;
                void *info;
                unsigned int flags;
        };

        $ pahole -C call_single_data vmlinux-4.4.0-133-generic 
        ...
        struct call_single_data {
                struct llist_node          llist;                /*     0     8 
*/
                smp_call_func_t            func;                 /*     8     8 
*/
                void *                     info;                 /*    16     8 
*/
                unsigned int               flags;                /*    24     4 
*/ 
                        ### ^ 24 == 0x18
        ...
        };

So this shows that CPU 1 is spinning/waiting on another CPU to finish
something.


What the (an)other CPU is doing
-------------------------------

This guest has only 2 CPUs (confirmed with the crashdump file 
'dump.201809061748'),
so CPU 0 is the only other CPU that could be doing something.

      KERNEL: /home/mfo/<...>/vmlinux-4.4.0-133-generic
    DUMPFILE: dump.201809061748  [PARTIAL DUMP]
        CPUS: 2

The NMI to other CPUs in the dmesg file reveals it's (still -- 20ish seconds 
later) 
in the SCSI drive removal stack, specifically in the virtio-scsi target 
removal/destroy handler
(see virtscsi_target_destroy() in RIP register).

        [ 3028.301874] Sending NMI to other CPUs:
        [ 3028.302876] NMI backtrace for cpu 0
        [ 3028.302882] CPU: 0 PID: 2246 Comm: kworker/0:0 Not tainted 
4.4.0-133-generic #159~14.04.1-Ubuntu
        ...
        [ 3028.302887] RIP: 0010:[<ffffffffc0002039>]  [<ffffffffc0002039>] 
virtscsi_target_destroy+0x19/0x30 [virtio_scsi]
        ...
        [ 3028.302907] Call Trace:
        [ 3028.302908]  [<ffffffff815ad4f2>] scsi_target_destroy+0x82/0xd0
        [ 3028.302909]  [<ffffffff815ad575>] 
scsi_target_reap_ref_release+0x35/0x40
        [ 3028.302910]  [<ffffffff815aefac>] scsi_target_reap+0x2c/0x30
        [ 3028.302911]  [<ffffffff815b14f9>] __scsi_remove_device+0x89/0xe0
        [ 3028.302912]  [<ffffffff815b1576>] scsi_remove_device+0x26/0x40
        [ 3028.302914]  [<ffffffffc0002bc7>] virtscsi_handle_event+0x127/0x1e0 
[virtio_scsi]
        [ 3028.302915]  [<ffffffff8109b0b6>] process_one_work+0x156/0x400
        [ 3028.302916]  [<ffffffff8109ba9a>] worker_thread+0x11a/0x480
        [ 3028.302918]  [<ffffffff8109b980>] ? rescuer_thread+0x310/0x310
        [ 3028.302919]  [<ffffffff810a0f78>] kthread+0xd8/0xf0
        [ 3028.302920]  [<ffffffff8181c092>] ? __schedule+0x2a2/0x820
        [ 3028.302921]  [<ffffffff810a0ea0>] ? kthread_park+0x60/0x60
        [ 3028.302922]  [<ffffffff818203b5>] ret_from_fork+0x55/0x80
        [ 3028.302924]  [<ffffffff810a0ea0>] ? kthread_park+0x60/0x60

Looking at the region pointed to by the RIP register, 
virtscsi_target_destroy+0x19,
(it's in '0x39' counting from '0x20', just looking at the kernel module 
objdump)..

It's similarly looping/spinning with pause (cpu_relax()), checking for the 
'tgt->reqs' counter until it becomes zero (i.e., completion of all requests in
the virtio SCSI target).

        0000000000000020 <virtscsi_target_destroy>:
              20:       e8 00 00 00 00          callq  25 
<virtscsi_target_destroy+0x5>
              25:       55                      push   %rbp
              26:       48 8b bf 28 03 00 00    mov    0x328(%rdi),%rdi
              2d:       48 89 e5                mov    %rsp,%rbp
              30:       8b 47 04                mov    0x4(%rdi),%eax
              33:       85 c0                   test   %eax,%eax
              35:       74 09                   je     40 
<virtscsi_target_destroy+0x20>
              37:       f3 90                   pause
              39:       8b 47 04                mov    0x4(%rdi),%eax
              3c:       85 c0                   test   %eax,%eax
              3e:       75 f7                   jne    37 
<virtscsi_target_destroy+0x17>
              40:       e8 00 00 00 00          callq  45 
<virtscsi_target_destroy+0x25>
              45:       5d                      pop    %rbp
              46:       c3                      retq
              47:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
              4e:       00 00
 
        """
        static void virtscsi_target_destroy(struct scsi_target *starget)
        {
                struct virtio_scsi_target_state *tgt = starget->hostdata;

                /* we can race with concurrent virtscsi_complete_cmd */
                while (atomic_read(&tgt->reqs))
                        cpu_relax();
                kfree(tgt);
        }
        """

Let's check what that 'tgt->regs' counter value is with the crashdump.

        $ crash ~/ddebs/linux-image-4.4.0-133-generic-
dbgsym_4.4.0-133.159~14.04.1_amd64/usr/lib/debug/boot/vmlinux-4.4.0-133-generic
crash/201809061748/dump.201809061748

        crash> set -c 0
            PID: 2246
        COMMAND: "kworker/0:0"
           TASK: ffff880035960000  [THREAD_INFO: ffff880212a90000]
            CPU: 0
          STATE: TASK_RUNNING (ACTIVE)

        crash> bt
        PID: 2246   TASK: ffff880035960000  CPU: 0   COMMAND: "kworker/0:0"
         #0 [ffff88021fc08e38] crash_nmi_callback at ffffffff8104f727
         #1 [ffff88021fc08e48] nmi_handle at ffffffff81031407
         #2 [ffff88021fc08ea0] default_do_nmi at ffffffff810319e0
         #3 [ffff88021fc08ec0] do_nmi at ffffffff81031b19
         #4 [ffff88021fc08ee8] end_repeat_nmi at ffffffff81822e17
            [exception RIP: virtscsi_target_destroy+25]
            RIP: ffffffffc0002039  RSP: ffff880212a93d38  RFLAGS: 00000006
            RAX: 0000000000001eb2  RBX: ffff880216be7000  RCX: ffff8802179e0a20
            RDX: ffffffffc0002020  RSI: ffff880035bddc28  RDI: ffff880211ea3420
            RBP: ffff880212a93d38   R8: 0000000000000000   R9: ffffffff813eb3e5
            R10: ffff88021fc1a6c0  R11: 0000000000000000  R12: ffff880035bddc00
            R13: ffff880035bddc28  R14: 0000000000000282  R15: 0000000000000002
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
        --- <NMI exception stack> ---
         #5 [ffff880212a93d38] virtscsi_target_destroy at ffffffffc0002039 
[virtio_scsi]
         #6 [ffff880212a93d40] scsi_target_destroy at ffffffff815ad4f2
         #7 [ffff880212a93d70] scsi_target_reap_ref_release at ffffffff815ad575
         #8 [ffff880212a93d90] scsi_target_reap at ffffffff815aefac
         #9 [ffff880212a93da0] __scsi_remove_device at ffffffff815b14f9
        #10 [ffff880212a93dc0] scsi_remove_device at ffffffff815b1576
        #11 [ffff880212a93de0] virtscsi_handle_event at ffffffffc0002bc7 
[virtio_scsi]
        #12 [ffff880212a93e20] process_one_work at ffffffff8109b0b6
        #13 [ffff880212a93e68] worker_thread at ffffffff8109ba9a
        #14 [ffff880212a93ec8] kthread at ffffffff810a0f78
        #15 [ffff880212a93f50] ret_from_fork at ffffffff818203b5

the stack trace in crash gives us the register values at the time 
virtscsi_target_destroy()
was interrupted while running, and RIP (instruction pointer / next instruction 
to execute)
shows it's about to reload the EAX register with the value it loops for while 
checking.

EAX is tgt->reqs, which loaded as: scsi_target is the first argument, thus in 
RDI register,
its 'hostdata' field is in offset 0x328 (which is cast to 
virtio_scsi_target_state), 
and its 'reqs' field is in offset 0x4:

        $ pahole -C scsi_target vmlinux-4.4.0-133-generic | grep hostdata
        ...
                void *                     hostdata;             /*   808     8 
*/
                        ### ^ 808 = 0x328
        ...

        $ pahole -C virtio_scsi_target_state 
lib/modules/4.4.0-133-generic/kernel/drivers/scsi/virtio_scsi.ko | grep reqs
        ...
                atomic_t                   reqs;                 /*     4     4 
*/
        ...

So, let's get the value from memory, based in the RDI register (which at that 
point
has the tgt pointer in it):

        crash> mod -s virtio_scsi /home/mfo/ddebs/linux-image-4.4.0-133
-generic-
dbgsym_4.4.0-133.159~14.04.1_amd64/usr/lib/debug/lib/modules/4.4.0-133-generic/kernel/drivers/scsi/virtio_scsi.ko

        crash> struct virtio_scsi_target_state.reqs ffff880211ea3420
          reqs = {
            counter = 0x1eb2
          }

        crash> eval 0x1eb2
        hexadecimal: 1eb2  
            decimal: 7858 

And it's indeed a non-zero value, which seems quite high actually,
around ~8k requests.

This is why it's looping.

The value does not seem to decrease, though.


Bug in virtio-scsi driver
-------------------------

It's unlikely that all these were pending and did not finish after all this 
time 
(~25 seconds since the SCSI removal started, at least +12 seconds since 'sbd' 
was
unmounted -- sbd1 and sdc1 are mounted, data is copied, both are unmounted, 
sdc1 is mounted).

        [ 2215.675484] EXT4-fs (sdb1): mounted filesystem with ordered data 
mode. Opts: (null)
        [ 2222.563236] EXT4-fs (sdc1): mounted filesystem with ordered data 
mode. Opts: (null)
        [ 2990.023442] EXT4-fs (sdc1): mounted filesystem with ordered data 
mode. Opts: (null)
        [ 3002.697474] sd 0:0:2:0: [sdb] Synchronizing SCSI cache
        [ 3002.697545] sd 0:0:2:0: [sdb] Synchronize Cache(10) failed: Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
        [ 3028.294602] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! 
[id:2887]

So this look a problem in the driver.

The wait/spin loop is introduced by a downstream/SAUCE commit to avoid a race 
with virtscsi_complete_cmd().
This is not present upstream, which dropped the tgt-reqs counter altogether.

However, per the commit description, we still need to synchronize with that 
function in Xenial.
So, instead of finding an alternative way to sync for now, let's check why the 
reqs counter has problems.

Studying the driver code for references to tgt->reqs, we see it's only
changes in 3 sites>

1) decremented in one site, virtscsi_complete_cmd(), 
   which is a callback fired after a SCSI command is completed
  by the virtio-scsi adapter),

2) incremented in the .queuecommand() path, before calling 
virtscsi_queuecommand()
   in the only two callers of the .queuecommand() implementation 
(virtscsi_queuecommand()).
2.1)  - virtscsi_pick_vq_mq()
2.2)  - virtscsi_pick_vq()

Complete command path:

        virtscsi_complete_cmd()
        ...
                sc->scsi_done(sc);

                atomic_dec(&tgt->reqs);
        }

In the 1-queue path:

        virtscsi_queuecommand_single():
        ...
                atomic_inc(&tgt->reqs);
                return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);
        }

In the >1-queues (not blk-mq) path:

        virtio_scsi_vq *virtscsi_pick_vq_mq()
        ...
                atomic_inc(&tgt->reqs);
                return &vscsi->req_vqs[hwq];


        struct virtio_scsi_vq()
        ...
                if (atomic_inc_return(&tgt->reqs) > 1) {
        ...

... with their only caller being:

        virtscsi_queuecommand_multi(struct Scsi_Host *sh,
                                               struct scsi_cmnd *sc)
        {
        ...
                if (shost_use_blk_mq(sh))
                        req_vq = virtscsi_pick_vq_mq(vscsi, tgt, sc);
                else
                        req_vq = virtscsi_pick_vq(vscsi, tgt);

                return virtscsi_queuecommand(vscsi, req_vq, sc);


These are the SCSI host templates registering those 2 functions:

        static struct scsi_host_template virtscsi_host_template_single = {
        ...
                .queuecommand = virtscsi_queuecommand_single,
        ...

        static struct scsi_host_template virtscsi_host_template_multi = {
        ...
                .queuecommand = virtscsi_queuecommand_multi,
        ...


Now, observing the implementation of virtscsi_queuecommand(), the problem
happens when the command fails to be kicked to the virtio-scsi adapter:

The function is called with the tgt-reqs counter already incremented.

But if virtscsi_kick_cmd() fails with anything other than -EIO,
the SCSI command will be requeued, and will go through the .queuecommand()
path again, which increments the tgt->reqs counter _again_, for the same
SCSI command (note this may repeat multiple times for the same command,
depending on the virtio-scsi adapter/queue condition/processing/etc).

(The EIO case is not a problem because it completes the SCSI command,
 which decrements the tgt-reqs counter).

        virtscsi_queuecommand()
        ...
                ret = virtscsi_kick_cmd(req_vq, cmd, req_size, 
sizeof(cmd->resp.cmd));
                if (ret == -EIO) {
                        cmd->resp.cmd.response = VIRTIO_SCSI_S_BAD_TARGET;
                        spin_lock_irqsave(&req_vq->vq_lock, flags);
                        virtscsi_complete_cmd(vscsi, cmd);
                        spin_unlock_irqrestore(&req_vq->vq_lock, flags);
                } else if (ret != 0) {
                        return SCSI_MLQUEUE_HOST_BUSY;
                }
                return 0;
        }

So, the solution to keep the tgt-reqs counter correct in this case
is to decrement the tgt-reqs counter in that error condition, when
we know the command will be requeued (and thus tgt->reqs will be
incremented _again_).


Synthetic test-case
-------------------

I could successfully reproduce this problem by attaching GDB to the
QEMU GDB server to debug/play with the Linux kernel, and simulated
that the check for the available elements in the virtio ring/queue
were zero, so that error case happened.  I'll detail the steps too.

This was also used to verify the proposed fix.

The fix also has been tested with the test-case reported by the
customer, which is mounting the virtio-scsi disk, copying out of
it, unmounting it, and dettaching it from the guest instance.

This test-case is a loop doing that procedure; it usually fails
in the first or second iteration, and with the fix it has been
running for 35 iterations without problems.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1798110

Title:
  xenial: virtio-scsi: CPU soft lockup due to loop in
  virtscsi_target_destroy()

Status in linux package in Ubuntu:
  New

Bug description:
  (I'll add the SRU template + testing steps and post to ML shortly.)

  A customer reported a CPU soft lockup on Trusty HWE kernel from Xenial
  when detaching a virtio-scsi drive, and provided a crashdump that shows
  2 things:

  1) The soft locked up CPU is waiting for another CPU to finish something,
  and that does not happen because the other CPU is infinitely looping in
  virtscsi_target_destroy().

  2) The loop happens because the 'tgt->reqs' counter is non-zero, and that
  probably happened due to a missing decrement in SCSI command requeue path,
  exercised when the virtio ring is full.

  The reported problem itself happens because of a downstream/SAUCE patch,
  coupled with the problem of the missing decrement for the reqs counter.

  Introducing a decrement in the SCSI command requeue path resolves the
  problem, verified synthetically with QEMU+GDB and with test-case/loop
  provided by the customer as problem reproducer.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798110/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to