On 6/14/19 3:10 PM, Himanshu Madhani wrote:
From: Quinn Tran <qut...@marvell.com>

This patch uses kref to protect access between fcp_abort
path and nvme command and LS command completion path.
Stack trace below shows the abort path is accessing stale
memory (nvme_private->sp).

When command kref reaches 0, nvme_private & srb resource will
be disconnected from each other.  Any subsequence nvme abort
request will not be able to reference the original srb.

[ 5631.003998] BUG: unable to handle kernel paging request at 00000010000005d8
[ 5631.004016] IP: [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx]
[ 5631.004086] Workqueue: events qla_nvme_abort_work [qla2xxx]
[ 5631.004097] RIP: 0010:[<ffffffffc087df92>]  [<ffffffffc087df92>] 
qla_nvme_abort_work+0x22/0x100 [qla2xxx]
[ 5631.004109] Call Trace:
[ 5631.004115]  [<ffffffffaa4b8174>] ? pwq_dec_nr_in_flight+0x64/0xb0
[ 5631.004117]  [<ffffffffaa4b9d4f>] process_one_work+0x17f/0x440
[ 5631.004120]  [<ffffffffaa4bade6>] worker_thread+0x126/0x3c0

Signed-off-by: Quinn Tran <qut...@marvell.com>
Signed-off-by: Himanshu Madhani <hmadh...@marvell.com>
---
  drivers/scsi/qla2xxx/qla_def.h  |   2 +
  drivers/scsi/qla2xxx/qla_nvme.c | 164 ++++++++++++++++++++++++++++------------
  drivers/scsi/qla2xxx/qla_nvme.h |   1 +
  3 files changed, 117 insertions(+), 50 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 602ed24bb806..85a27ee5d647 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -532,6 +532,8 @@ typedef struct srb {
        uint8_t cmd_type;
        uint8_t pad[3];
        atomic_t ref_count;
+       struct kref cmd_kref;   /* need to migrate ref_count over to this */
+       void *priv;
        wait_queue_head_t nvme_ls_waitq;
        struct fc_port *fcport;
        struct scsi_qla_host *vha;

My feedback about this patch is as follows:
- I think that having two atomic counters in struct srb with a very
  similar purpose is overkill and also that a single atomic counter
  should be sufficient in this structure.
- The problem fixed by this patch not only can happen with NVMe-FC but
  also with FC. I would prefer to see all code paths fixed for which a
  race between completion and abort can occur.

Thanks,

Bart.

Reply via email to