Re: [PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list

2017-04-10 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared

2017-04-10 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list()

2017-04-10 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()

2017-04-10 Thread Christoph Hellwig
> + if (msecs == 0)
> + kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
> +  &hctx->run_work);
> + else
> + kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> +  &hctx->delayed_run_work,
> +  msecs_to_jiffies(msecs));
> +}

I'd rather make run_work a delayed_work (again) and use
kblockd_schedule_delayed_work_on with a timeout of zero for the immediate
run case instead of having two competing work structs.


Re: [PATCH v4 5/6] scsi: Avoid that SCSI queues get stuck

2017-04-10 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 1/2] scsi: sd: Separate zeroout and discard command choices

2017-04-10 Thread h...@lst.de
On Fri, Apr 07, 2017 at 07:59:08PM +, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 07:41 -0400, Martin K. Petersen wrote:
> > +static ssize_t
> > +zeroing_mode_store(struct device *dev, struct device_attribute *attr,
> > +  const char *buf, size_t count)
> > +{
> > +   struct scsi_disk *sdkp = to_scsi_disk(dev);
> > +
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EACCES;
> > +
> > +   if (!strncmp(buf, zeroing_mode[SD_ZERO_WRITE], 20))
> > +   sdkp->zeroing_mode = SD_ZERO_WRITE;
> > +   else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS], 20))
> > +   sdkp->zeroing_mode = SD_ZERO_WS;
> > +   else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS16_UNMAP], 20))
> > +   sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP;
> > +   else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS10_UNMAP], 20))
> > +   sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP;
> > +   else
> > +   return -EINVAL;
> > +
> > +   return count;
> > +}
> 
> An additional question about this function: if the shell command "echo" is 
> used
> without command-line option -n to modify the "zeroing_mode" sysfs attribute 
> then
> a newline character will be present in buf. Does the above code handle newline
> characters correctly?

It ignores the newlines.  But we have a helper called sysfs_streq
to possible ignore it.  It might be a good idea to move the various
sysfs files in scsi to use it.


Re: [PATCHv6 8/8] scsi: inline command aborts

2017-04-10 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-10 Thread Sagi Grimberg



Sagi

As long as legA, legB and the RC are all connected to the same switch then 
ordering will be preserved (I think many other topologies also work). Here is 
how it would work for the problem case you are concerned about (which is a read 
from the NVMe drive).

1. Disk device DMAs out the data to the p2pmem device via a string of PCIe 
MemWr TLPs.
2. Disk device writes to the completion queue (in system memory) via a MemWr 
TLP.
3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch 
due to congestion but if so they are stalled in the egress path of the switch 
for the p2pmem port.
4. The RC determines the IO is complete when the TLP associated with step 2 
updates the memory associated with the CQ. It issues some operation to read the 
p2pmem.
5. Regardless of whether the MemRd TLP comes from the RC or another device 
connected to the switch it is queued in the egress queue for the p2pmem FIO 
behind the last DMA TLP (from step 1).
PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can 
never pass writes).
Therefore the MemRd can never get to the p2pmem device until after the last DMA 
MemWr has.


What you are saying is surprising to me. The switch needs to preserve
ordering across different switch ports ??

You are suggesting that there is a *switch-wide* state that tracks
MemRds never pass MemWrs across all the switch ports? That is a very
non-trivial statement...


[PATCH 5/6] scsi: hisi_sas: fix NULL deference when TMF timeouts

2017-04-10 Thread John Garry
If a TMF timeouts (maybe due to unlikely scenario of an expander
being unplugged when TMF for remote device is active), when we
eventually try to free the slot, we crash as we dereference the
slot's task, which has already been released.

As a fix, add checks in the slot release code for a NULL task.

Signed-off-by: John Garry 
---
 drivers/scsi/hisi_sas/hisi_sas_main.c | 60 +++
 1 file changed, 33 insertions(+), 27 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c 
b/drivers/scsi/hisi_sas/hisi_sas_main.c
index a5c6d06..7e6e882 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -77,17 +77,22 @@ static void hisi_sas_slot_index_init(struct hisi_hba 
*hisi_hba)
 void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba, struct sas_task *task,
 struct hisi_sas_slot *slot)
 {
-   struct device *dev = &hisi_hba->pdev->dev;
-   struct domain_device *device = task->dev;
-   struct hisi_sas_device *sas_dev = device->lldd_dev;
 
-   if (!slot->task)
-   return;
+   if (task) {
+   struct device *dev = &hisi_hba->pdev->dev;
+   struct domain_device *device = task->dev;
+   struct hisi_sas_device *sas_dev = device->lldd_dev;
 
-   if (!sas_protocol_ata(task->task_proto))
-   if (slot->n_elem)
-   dma_unmap_sg(dev, task->scatter, slot->n_elem,
-task->data_dir);
+   if (!sas_protocol_ata(task->task_proto))
+   if (slot->n_elem)
+   dma_unmap_sg(dev, task->scatter, slot->n_elem,
+task->data_dir);
+
+   task->lldd_task = NULL;
+
+   if (sas_dev)
+   atomic64_dec(&sas_dev->running_req);
+   }
 
if (slot->command_table)
dma_pool_free(hisi_hba->command_table_pool,
@@ -102,12 +107,10 @@ void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba, 
struct sas_task *task,
  slot->sge_page_dma);
 
list_del_init(&slot->entry);
-   task->lldd_task = NULL;
slot->task = NULL;
slot->port = NULL;
hisi_sas_slot_index_free(hisi_hba, slot->idx);
-   if (sas_dev)
-   atomic64_dec(&sas_dev->running_req);
+
/* slot memory is fully zeroed when it is reused */
 }
 EXPORT_SYMBOL_GPL(hisi_sas_slot_task_free);
@@ -569,25 +572,23 @@ static void hisi_sas_port_notify_formed(struct 
asd_sas_phy *sas_phy)
spin_unlock_irqrestore(&hisi_hba->lock, flags);
 }
 
-static void hisi_sas_do_release_task(struct hisi_hba *hisi_hba,
-struct sas_task *task,
+static void hisi_sas_do_release_task(struct hisi_hba *hisi_hba, struct 
sas_task *task,
 struct hisi_sas_slot *slot)
 {
-   struct task_status_struct *ts;
-   unsigned long flags;
-
-   if (!task)
-   return;
+   if (task) {
+   unsigned long flags;
+   struct task_status_struct *ts;
 
-   ts = &task->task_status;
+   ts = &task->task_status;
 
-   ts->resp = SAS_TASK_COMPLETE;
-   ts->stat = SAS_ABORTED_TASK;
-   spin_lock_irqsave(&task->task_state_lock, flags);
-   task->task_state_flags &=
-   ~(SAS_TASK_STATE_PENDING | SAS_TASK_AT_INITIATOR);
-   task->task_state_flags |= SAS_TASK_STATE_DONE;
-   spin_unlock_irqrestore(&task->task_state_lock, flags);
+   ts->resp = SAS_TASK_COMPLETE;
+   ts->stat = SAS_ABORTED_TASK;
+   spin_lock_irqsave(&task->task_state_lock, flags);
+   task->task_state_flags &=
+   ~(SAS_TASK_STATE_PENDING | SAS_TASK_AT_INITIATOR);
+   task->task_state_flags |= SAS_TASK_STATE_DONE;
+   spin_unlock_irqrestore(&task->task_state_lock, flags);
+   }
 
hisi_sas_slot_task_free(hisi_hba, task, slot);
 }
@@ -742,7 +743,12 @@ static int hisi_sas_exec_internal_tmf_task(struct 
domain_device *device,
/* Even TMF timed out, return direct. */
if ((task->task_state_flags & SAS_TASK_STATE_ABORTED)) {
if (!(task->task_state_flags & SAS_TASK_STATE_DONE)) {
+   struct hisi_sas_slot *slot = task->lldd_task;
+
dev_err(dev, "abort tmf: TMF task timeout\n");
+   if (slot)
+   slot->task = NULL;
+
goto ex_err;
}
}
-- 
1.9.1



[PATCH 2/6] scsi: hisi_sas: workaround a SoC SATA IO processing bug

2017-04-10 Thread John Garry
From: Xiaofei Tan 

This patch provides a workaround a SoC bug where SATA IPTTs for
different devices may conflict.

The workaround solution requests the following:
1. SATA device id must be even and not equal to SAS IPTT.
2. SATA device can not share the same IPTT with other SAS or
SATA device.

Besides we shall consider IPTT value 0 is reserved for another SoC
bug (STP device open link at firstly after SAS controller reset).

To sum up, the solution is:
Each SATA device uses independent and continuous 32 even IPTT from
64 to 4094, then v2 hw can only support 63 SATA devices.
All SAS device(SSP/SMP devices) share odd IPTT value from 1 to
4095.

Signed-off-by: Xiaofei Tan 
Signed-off-by: John Garry 
---
 drivers/scsi/hisi_sas/hisi_sas.h   |  2 +
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 83 --
 2 files changed, 72 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index 72347b6..c80ca83 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -115,6 +115,7 @@ struct hisi_sas_device {
atomic64_t running_req;
struct list_headlist;
u8 dev_status;
+   int sata_idx;
 };
 
 struct hisi_sas_slot {
@@ -238,6 +239,7 @@ struct hisi_hba {
struct hisi_sas_slot*slot_info;
unsigned long flags;
const struct hisi_sas_hw *hw;   /* Low level hw interface */
+   unsigned long sata_dev_bitmap[BITS_TO_LONGS(HISI_SAS_MAX_DEVICES)];
struct work_struct rst_work;
 };
 
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c 
b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index c550cc4..fc8d8296 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -537,6 +537,7 @@ enum {
 };
 
 #define HISI_SAS_COMMAND_ENTRIES_V2_HW 4096
+#define HISI_MAX_SATA_SUPPORT_V2_HW(HISI_SAS_COMMAND_ENTRIES_V2_HW/64 - 1)
 
 #define DIR_NO_DATA 0
 #define DIR_TO_INI 1
@@ -597,39 +598,86 @@ static u32 hisi_sas_phy_read32(struct hisi_hba *hisi_hba,
 /* This function needs to be protected from pre-emption. */
 static int
 slot_index_alloc_quirk_v2_hw(struct hisi_hba *hisi_hba, int *slot_idx,
-  struct domain_device *device)
+struct domain_device *device)
 {
-   /* STP link chip bug workaround:index start from 1 */
-   unsigned int index = 1;
-   void *bitmap = hisi_hba->slot_index_tags;
int sata_dev = dev_is_sata(device);
+   void *bitmap = hisi_hba->slot_index_tags;
+   struct hisi_sas_device *sas_dev = device->lldd_dev;
+   int sata_idx = sas_dev->sata_idx;
+   int start, end;
+
+   if (!sata_dev) {
+   /*
+* STP link SoC bug workaround: index starts from 1.
+* additionally, we can only allocate odd IPTT(1~4095)
+* for SAS/SMP device.
+*/
+   start = 1;
+   end = hisi_hba->slot_index_count;
+   } else {
+   if (sata_idx >= HISI_MAX_SATA_SUPPORT_V2_HW)
+   return -EINVAL;
+
+   /*
+* For SATA device: allocate even IPTT in this interval
+* [64*(sata_idx+1), 64*(sata_idx+2)], then each SATA device
+* own 32 IPTTs. IPTT 0 shall not be used duing to STP link
+* SoC bug workaround. So we ignore the first 32 even IPTTs.
+*/
+   start = 64 * (sata_idx + 1);
+   end = 64 * (sata_idx + 2);
+   }
 
while (1) {
-   index = find_next_zero_bit(bitmap, hisi_hba->slot_index_count,
-  index);
-   if (index >= hisi_hba->slot_index_count)
+   start = find_next_zero_bit(bitmap,
+   hisi_hba->slot_index_count, start);
+   if (start >= end)
return -SAS_QUEUE_FULL;
/*
-* SAS IPTT bit0 should be 1
-*/
-   if (sata_dev || (index & 1))
+ * SAS IPTT bit0 should be 1, and SATA IPTT bit0 should be 0.
+ */
+   if (sata_dev ^ (start & 1))
break;
-   index++;
+   start++;
}
 
-   set_bit(index, bitmap);
-   *slot_idx = index;
+   set_bit(start, bitmap);
+   *slot_idx = start;
return 0;
 }
 
+static bool sata_index_alloc_v2_hw(struct hisi_hba *hisi_hba, int *idx)
+{
+   unsigned int index;
+   struct device *dev = &hisi_hba->pdev->dev;
+   void *bitmap = hisi_hba->sata_dev_bitmap;
+
+   index = find_first_zero_bit(bitmap, HISI_MAX_SATA_SUPPORT_V2_HW);
+   if (index >= HISI_MAX_SATA_SUPPORT_V2_HW) {
+   dev_warn(dev, "alloc sata index failed, index=%d\n", index);
+   return false;
+   }
+
+   set_bit(index, bitmap);
+   *idx =

[PATCH 1/6] scsi: hisi_sas: workaround STP link SoC bug

2017-04-10 Thread John Garry
From: Xiaofei Tan 

After resetting the controller, the process of scanning SATA disks
attached to an expander may fail occasionally. The issue is that
the controller can't close the STP link created by target if the
max link time is 0.

To workaround this issue, we reject STP link after resetting the
controller, and change the corresponding PHY to accept STP link
only after receiving data.

We do this check in cq interrupt handler. In order not to reduce
efficiency, we use an variable to control whether we should check
and change PHY to accept STP link.

The function phys_reject_stp_links_v2_hw() should be called after
resetting the controller.

The solution of another SoC bug "SATA IO timeout", that also uses
the same register to control STP link, is not effective before
the PHY accepts STP link.

Signed-off-by: Xiaofei Tan 
Signed-off-by: John Garry 
---
 drivers/scsi/hisi_sas/hisi_sas.h   |  1 +
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 60 +-
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index 6aa0b62..72347b6 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -203,6 +203,7 @@ struct hisi_hba {
 
int slot_index_count;
unsigned long *slot_index_tags;
+   unsigned long reject_stp_links_msk;
 
/* SCSI/SAS glue */
struct sas_ha_struct sha;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c 
b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index 66c2de8..c550cc4 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -218,6 +218,9 @@
 #define RX_IDAF_DWORD6 (PORT_BASE + 0xdc)
 #define RXOP_CHECK_CFG_H   (PORT_BASE + 0xfc)
 #define CON_CONTROL(PORT_BASE + 0x118)
+#define CON_CONTROL_CFG_OPEN_ACC_STP_OFF   0
+#define CON_CONTROL_CFG_OPEN_ACC_STP_MSK   \
+   (0x01 << CON_CONTROL_CFG_OPEN_ACC_STP_OFF)
 #define DONE_RECEIVED_TIME (PORT_BASE + 0x11c)
 #define CHL_INT0   (PORT_BASE + 0x1b4)
 #define CHL_INT0_HOTPLUG_TOUT_OFF  0
@@ -240,6 +243,9 @@
 #define CHL_INT1_MSK   (PORT_BASE + 0x1c4)
 #define CHL_INT2_MSK   (PORT_BASE + 0x1c8)
 #define CHL_INT_COAL_EN(PORT_BASE + 0x1d0)
+#define DMA_TX_DFX1(PORT_BASE + 0x204)
+#define DMA_TX_DFX1_IPTT_OFF   0
+#define DMA_TX_DFX1_IPTT_MSK   (0x << DMA_TX_DFX1_IPTT_OFF)
 #define PHY_CTRL_RDY_MSK   (PORT_BASE + 0x2b0)
 #define PHYCTRL_NOT_RDY_MSK(PORT_BASE + 0x2b4)
 #define PHYCTRL_DWS_RESET_MSK  (PORT_BASE + 0x2b8)
@@ -593,7 +599,8 @@ static u32 hisi_sas_phy_read32(struct hisi_hba *hisi_hba,
 slot_index_alloc_quirk_v2_hw(struct hisi_hba *hisi_hba, int *slot_idx,
   struct domain_device *device)
 {
-   unsigned int index = 0;
+   /* STP link chip bug workaround:index start from 1 */
+   unsigned int index = 1;
void *bitmap = hisi_hba->slot_index_tags;
int sata_dev = dev_is_sata(device);
 
@@ -875,6 +882,46 @@ static int reset_hw_v2_hw(struct hisi_hba *hisi_hba)
return 0;
 }
 
+/* This function needs to be called after resetting SAS controller. */
+static void phys_reject_stp_links_v2_hw(struct hisi_hba *hisi_hba)
+{
+   u32 cfg;
+   int phy_no;
+
+   hisi_hba->reject_stp_links_msk = (1 << hisi_hba->n_phy) - 1;
+   for (phy_no = 0; phy_no < hisi_hba->n_phy; phy_no++) {
+   cfg = hisi_sas_phy_read32(hisi_hba, phy_no, CON_CONTROL);
+   if (!(cfg & CON_CONTROL_CFG_OPEN_ACC_STP_MSK))
+   continue;
+
+   cfg &= ~CON_CONTROL_CFG_OPEN_ACC_STP_MSK;
+   hisi_sas_phy_write32(hisi_hba, phy_no, CON_CONTROL, cfg);
+   }
+}
+
+static void phys_try_accept_stp_links_v2_hw(struct hisi_hba *hisi_hba)
+{
+   int phy_no;
+   u32 dma_tx_dfx1;
+
+   for (phy_no = 0; phy_no < hisi_hba->n_phy; phy_no++) {
+   if (!(hisi_hba->reject_stp_links_msk & BIT(phy_no)))
+   continue;
+
+   dma_tx_dfx1 = hisi_sas_phy_read32(hisi_hba, phy_no,
+   DMA_TX_DFX1);
+   if (dma_tx_dfx1 & DMA_TX_DFX1_IPTT_MSK) {
+   u32 cfg = hisi_sas_phy_read32(hisi_hba,
+   phy_no, CON_CONTROL);
+
+   cfg |= CON_CONTROL_CFG_OPEN_ACC_STP_MSK;
+   hisi_sas_phy_write32(hisi_hba, phy_no,
+   CON_CONTROL, cfg);
+   clear_bit(phy_no, &hisi_hba->reject_stp_links_msk);
+   }
+   }
+}
+
 static void init_reg_v2_hw(struct hisi_hba *hisi_hba)
 {
struct device *dev = &hisi_hba->pdev->dev;
@@ -1012,6 +1059,9 @@ static void link_timeout_enable_link(unsigned long data)
int i, 

[PATCH 4/6] scsi: hisi_sas: add v2 hw internal abort timeout workaround

2017-04-10 Thread John Garry
This patch is a workaround for a SoC bug where an internal abort
command may timeout. In v2 hw, the channel should become idle in
order to finish abort process. If the target side has been sending
HOLD, host side channel failed to complete the frame to send, and
can not enter the idle state. Then internal abort command will
timeout.

As this issue is only in v2 hw, we deal with it in the hw layer.
Our workaround solution is:
If abort is not finished within a certain period of time, we will check
HOLD status. If HOLD has been sending, we will send break command.

Signed-off-by: John Garry 
Signed-off-by: Xiaofei Tan 
---
 drivers/scsi/hisi_sas/hisi_sas.h   |  1 +
 drivers/scsi/hisi_sas/hisi_sas_main.c  |  2 +-
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 62 +-
 3 files changed, 55 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index c80ca83..4e28f32 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -138,6 +138,7 @@ struct hisi_sas_slot {
struct hisi_sas_sge_page *sge_page;
dma_addr_t sge_page_dma;
struct work_struct abort_slot;
+   struct timer_list internal_abort_timer;
 };
 
 struct hisi_sas_tmf_task {
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c 
b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 9890dfd..a5c6d06 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -1224,7 +1224,7 @@ static int hisi_sas_query_task(struct sas_task *task)
task->task_done = hisi_sas_task_done;
task->slow_task->timer.data = (unsigned long)task;
task->slow_task->timer.function = hisi_sas_tmf_timedout;
-   task->slow_task->timer.expires = jiffies + 20*HZ;
+   task->slow_task->timer.expires = jiffies + msecs_to_jiffies(110);
add_timer(&task->slow_task->timer);
 
/* Lock as we are alloc'ing a slot, which cannot be interrupted */
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c 
b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index a1eb88d..cc5675b 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -194,9 +194,9 @@
 #define SL_CONTROL_NOTIFY_EN_MSK   (0x1 << SL_CONTROL_NOTIFY_EN_OFF)
 #define SL_CONTROL_CTA_OFF 17
 #define SL_CONTROL_CTA_MSK (0x1 << SL_CONTROL_CTA_OFF)
-#define RX_PRIMS_STATUS (PORT_BASE + 0x98)
-#define RX_BCAST_CHG_OFF1
-#define RX_BCAST_CHG_MSK(0x1 << RX_BCAST_CHG_OFF)
+#define RX_PRIMS_STATUS(PORT_BASE + 0x98)
+#define RX_BCAST_CHG_OFF   1
+#define RX_BCAST_CHG_MSK   (0x1 << RX_BCAST_CHG_OFF)
 #define TX_ID_DWORD0   (PORT_BASE + 0x9c)
 #define TX_ID_DWORD1   (PORT_BASE + 0xa0)
 #define TX_ID_DWORD2   (PORT_BASE + 0xa4)
@@ -209,8 +209,8 @@
 #define TXID_AUTO_CT3_MSK  (0x1 << TXID_AUTO_CT3_OFF)
 #define TXID_AUTO_CTB_OFF  11
 #define TXID_AUTO_CTB_MSK  (0x1 << TXID_AUTO_CTB_OFF)
-#define TX_HARDRST_OFF  2
-#define TX_HARDRST_MSK  (0x1 << TX_HARDRST_OFF)
+#define TX_HARDRST_OFF 2
+#define TX_HARDRST_MSK (0x1 << TX_HARDRST_OFF)
 #define RX_IDAF_DWORD0 (PORT_BASE + 0xc4)
 #define RX_IDAF_DWORD1 (PORT_BASE + 0xc8)
 #define RX_IDAF_DWORD2 (PORT_BASE + 0xcc)
@@ -245,13 +245,13 @@
 #define CHL_INT1_MSK   (PORT_BASE + 0x1c4)
 #define CHL_INT2_MSK   (PORT_BASE + 0x1c8)
 #define CHL_INT_COAL_EN(PORT_BASE + 0x1d0)
-#define DMA_TX_DFX0(PORT_BASE + 0x200)
-#define DMA_TX_DFX1(PORT_BASE + 0x204)
+#define DMA_TX_DFX0(PORT_BASE + 0x200)
+#define DMA_TX_DFX1(PORT_BASE + 0x204)
 #define DMA_TX_DFX1_IPTT_OFF   0
 #define DMA_TX_DFX1_IPTT_MSK   (0x << DMA_TX_DFX1_IPTT_OFF)
 #define DMA_TX_FIFO_DFX0   (PORT_BASE + 0x240)
-#define PORT_DFX0  (PORT_BASE + 0x258)
-#define LINK_DFX2  (PORT_BASE + 0X264)
+#define PORT_DFX0  (PORT_BASE + 0x258)
+#define LINK_DFX2  (PORT_BASE + 0X264)
 #define LINK_DFX2_RCVR_HOLD_STS_OFF9
 #define LINK_DFX2_RCVR_HOLD_STS_MSK(0x1 << LINK_DFX2_RCVR_HOLD_STS_OFF)
 #define LINK_DFX2_SEND_HOLD_STS_OFF10
@@ -2260,15 +2260,18 @@ static void slot_err_v2_hw(struct hisi_hba *hisi_hba,
case STAT_IO_COMPLETE:
/* internal abort command complete */
ts->stat = TMF_RESP_FUNC_SUCC;
+   del_timer(&slot->internal_abort_timer);
goto out;
case STAT_IO_NO_DEVICE:
ts->stat = TMF_RESP_FUNC_COMPLETE;
+   del_timer(&slot->internal_abort_timer);
g

[PATCH 3/6] scsi: hisi_sas: workaround SoC about abort timeout bug

2017-04-10 Thread John Garry
From: Xiaofei Tan 

This patch adds a workaround solution for a SoC bug which
may cause SoC logic fatal error when disabling a PHY.
Then we find internal abort IO timeout may occur, and the
controller IO breakpoint may be corrupted.

We work around this SoC bug by optimizing the flow of disabling
a PHY.

Signed-off-by: Xiaofei Tan 
Signed-off-by: John Garry 
---
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 127 -
 1 file changed, 126 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c 
b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index fc8d8296..a1eb88d 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -207,6 +207,8 @@
 #define TXID_AUTO  (PORT_BASE + 0xb8)
 #define TXID_AUTO_CT3_OFF  1
 #define TXID_AUTO_CT3_MSK  (0x1 << TXID_AUTO_CT3_OFF)
+#define TXID_AUTO_CTB_OFF  11
+#define TXID_AUTO_CTB_MSK  (0x1 << TXID_AUTO_CTB_OFF)
 #define TX_HARDRST_OFF  2
 #define TX_HARDRST_MSK  (0x1 << TX_HARDRST_OFF)
 #define RX_IDAF_DWORD0 (PORT_BASE + 0xc4)
@@ -243,9 +245,17 @@
 #define CHL_INT1_MSK   (PORT_BASE + 0x1c4)
 #define CHL_INT2_MSK   (PORT_BASE + 0x1c8)
 #define CHL_INT_COAL_EN(PORT_BASE + 0x1d0)
+#define DMA_TX_DFX0(PORT_BASE + 0x200)
 #define DMA_TX_DFX1(PORT_BASE + 0x204)
 #define DMA_TX_DFX1_IPTT_OFF   0
 #define DMA_TX_DFX1_IPTT_MSK   (0x << DMA_TX_DFX1_IPTT_OFF)
+#define DMA_TX_FIFO_DFX0   (PORT_BASE + 0x240)
+#define PORT_DFX0  (PORT_BASE + 0x258)
+#define LINK_DFX2  (PORT_BASE + 0X264)
+#define LINK_DFX2_RCVR_HOLD_STS_OFF9
+#define LINK_DFX2_RCVR_HOLD_STS_MSK(0x1 << LINK_DFX2_RCVR_HOLD_STS_OFF)
+#define LINK_DFX2_SEND_HOLD_STS_OFF10
+#define LINK_DFX2_SEND_HOLD_STS_MSK(0x1 << LINK_DFX2_SEND_HOLD_STS_OFF)
 #define PHY_CTRL_RDY_MSK   (PORT_BASE + 0x2b0)
 #define PHYCTRL_NOT_RDY_MSK(PORT_BASE + 0x2b4)
 #define PHYCTRL_DWS_RESET_MSK  (PORT_BASE + 0x2b8)
@@ -1194,12 +1204,127 @@ static bool is_sata_phy_v2_hw(struct hisi_hba 
*hisi_hba, int phy_no)
return false;
 }
 
+static bool tx_fifo_is_empty_v2_hw(struct hisi_hba *hisi_hba, int phy_no)
+{
+   u32 dfx_val;
+
+   dfx_val = hisi_sas_phy_read32(hisi_hba, phy_no, DMA_TX_DFX1);
+
+   if (dfx_val & BIT(16))
+   return false;
+
+   return true;
+}
+
+static bool axi_bus_is_idle_v2_hw(struct hisi_hba *hisi_hba, int phy_no)
+{
+   int i, max_loop = 1000;
+   struct device *dev = &hisi_hba->pdev->dev;
+   u32 status, axi_status, dfx_val, dfx_tx_val;
+
+   for (i = 0; i < max_loop; i++) {
+   status = hisi_sas_read32_relaxed(hisi_hba,
+   AXI_MASTER_CFG_BASE + AM_CURR_TRANS_RETURN);
+
+   axi_status = hisi_sas_read32(hisi_hba, AXI_CFG);
+   dfx_val = hisi_sas_phy_read32(hisi_hba, phy_no, DMA_TX_DFX1);
+   dfx_tx_val = hisi_sas_phy_read32(hisi_hba,
+   phy_no, DMA_TX_FIFO_DFX0);
+
+   if ((status == 0x3) && (axi_status == 0x0) &&
+   (dfx_val & BIT(20)) && (dfx_tx_val & BIT(10)))
+   return true;
+   udelay(10);
+   }
+   dev_err(dev, "bus is not idle phy%d, axi150:0x%x axi100:0x%x 
port204:0x%x port240:0x%x\n",
+   phy_no, status, axi_status,
+   dfx_val, dfx_tx_val);
+   return false;
+}
+
+static bool wait_io_done_v2_hw(struct hisi_hba *hisi_hba, int phy_no)
+{
+   int i, max_loop = 1000;
+   struct device *dev = &hisi_hba->pdev->dev;
+   u32 status, tx_dfx0;
+
+   for (i = 0; i < max_loop; i++) {
+   status = hisi_sas_phy_read32(hisi_hba, phy_no, LINK_DFX2);
+   status = (status & 0x3fc0) >> 6;
+
+   if (status != 0x1)
+   return true;
+
+   tx_dfx0 = hisi_sas_phy_read32(hisi_hba, phy_no, DMA_TX_DFX0);
+   if ((tx_dfx0 & 0x1ff) == 0x2)
+   return true;
+   udelay(10);
+   }
+   dev_err(dev, "IO not done phy%d, port264:0x%x port200:0x%x\n",
+   phy_no, status, tx_dfx0);
+   return false;
+}
+
+static bool allowed_disable_phy_v2_hw(struct hisi_hba *hisi_hba, int phy_no)
+{
+   if (tx_fifo_is_empty_v2_hw(hisi_hba, phy_no))
+   return true;
+
+   if (!axi_bus_is_idle_v2_hw(hisi_hba, phy_no))
+   return false;
+
+   if (!wait_io_done_v2_hw(hisi_hba, phy_no))
+   return false;
+
+   return true;
+}
+
+
 static void disable_phy_v2_hw(struct hisi_hba *hisi_hba, int phy_no)
 {
-   u32 cfg = hisi_sas_phy_read32(hisi_hba, phy_no, PHY_CFG);
+   u32 cfg, axi_val, dfx0_val, txid_auto;
+   struct de

[PATCH 0/6] hisi_sas: v2 hw SoC bug workarounds

2017-04-10 Thread John Garry
This patchset introduces some v2 hw bug workarounds. Mostly
they are related to SATA devices, but there is also a
workaround for a scenario when internal abort command may
timeout.

The general rule for implementing workarounds was to do it
in the hw layer, as the next hw revision should not
include these bugs, so it is better not pollute the driver
main layer with workarounds

This patchset also includes 2 other patches, one for a
possible NULL pointer deference and another for resetting
the controller for fatal AXI/ECC errors.

John Garry (2):
  scsi: hisi_sas: add v2 hw internal abort timeout workaround
  scsi: hisi_sas: fix NULL deference when TMF timeouts

Xiang Chen (1):
  scsi: hisi_sas: controller reset for multi-bits ECC and AXI fatal
errors

Xiaofei Tan (3):
  scsi: hisi_sas: workaround STP link SoC bug
  scsi: hisi_sas: workaround a SoC SATA IO processing bug
  scsi: hisi_sas: workaround SoC about abort timeout bug

 drivers/scsi/hisi_sas/hisi_sas.h   |   4 +
 drivers/scsi/hisi_sas/hisi_sas_main.c  |  62 ++---
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 413 -
 3 files changed, 396 insertions(+), 83 deletions(-)

-- 
1.9.1



[PATCH 6/6] scsi: hisi_sas: controller reset for multi-bits ECC and AXI fatal errors

2017-04-10 Thread John Garry
From: Xiang Chen 

For 1 bit ECC errors, those errors can be recovered by hw. But for
multi-bits ECC and AXI errors, there are something wrong with whole
module or system, so try reset the controller to recover those errors
instead of calling panic().

Signed-off-by: Xiang Chen 
Signed-off-by: John Garry 
---
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 93 --
 1 file changed, 56 insertions(+), 37 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c 
b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index cc5675b..89d0959 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -2918,94 +2918,105 @@ static void multi_bit_ecc_error_process_v2_hw(struct 
hisi_hba *hisi_hba,
 
if (irq_value & BIT(SAS_ECC_INTR_DQE_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HGC_DQE_ECC_ADDR);
-   panic("%s: hgc_dqe_accbad_intr (0x%x) found: \
+   dev_warn(dev, "hgc_dqe_accbad_intr (0x%x) found: \
   Ram address is 0x%08X\n",
- dev_name(dev), irq_value,
+ irq_value,
  (reg_val & HGC_DQE_ECC_MB_ADDR_MSK) >>
  HGC_DQE_ECC_MB_ADDR_OFF);
+   queue_work(hisi_hba->wq, &hisi_hba->rst_work);
}
 
if (irq_value & BIT(SAS_ECC_INTR_IOST_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HGC_IOST_ECC_ADDR);
-   panic("%s: hgc_iost_accbad_intr (0x%x) found: \
+   dev_warn(dev, "hgc_iost_accbad_intr (0x%x) found: \
   Ram address is 0x%08X\n",
- dev_name(dev), irq_value,
+ irq_value,
  (reg_val & HGC_IOST_ECC_MB_ADDR_MSK) >>
  HGC_IOST_ECC_MB_ADDR_OFF);
+   queue_work(hisi_hba->wq, &hisi_hba->rst_work);
}
 
if (irq_value & BIT(SAS_ECC_INTR_ITCT_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HGC_ITCT_ECC_ADDR);
-   panic("%s: hgc_itct_accbad_intr (0x%x) found: \
+   dev_warn(dev,"hgc_itct_accbad_intr (0x%x) found: \
   Ram address is 0x%08X\n",
- dev_name(dev), irq_value,
+ irq_value,
  (reg_val & HGC_ITCT_ECC_MB_ADDR_MSK) >>
  HGC_ITCT_ECC_MB_ADDR_OFF);
+   queue_work(hisi_hba->wq, &hisi_hba->rst_work);
}
 
if (irq_value & BIT(SAS_ECC_INTR_IOSTLIST_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HGC_LM_DFX_STATUS2);
-   panic("%s: hgc_iostl_accbad_intr (0x%x) found: \
+   dev_warn(dev, "hgc_iostl_accbad_intr (0x%x) found: \
   memory address is 0x%08X\n",
- dev_name(dev), irq_value,
+ irq_value,
  (reg_val & HGC_LM_DFX_STATUS2_IOSTLIST_MSK) >>
  HGC_LM_DFX_STATUS2_IOSTLIST_OFF);
+   queue_work(hisi_hba->wq, &hisi_hba->rst_work);
}
 
if (irq_value & BIT(SAS_ECC_INTR_ITCTLIST_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HGC_LM_DFX_STATUS2);
-   panic("%s: hgc_itctl_accbad_intr (0x%x) found: \
+   dev_warn(dev, "hgc_itctl_accbad_intr (0x%x) found: \
   memory address is 0x%08X\n",
- dev_name(dev), irq_value,
+ irq_value,
  (reg_val & HGC_LM_DFX_STATUS2_ITCTLIST_MSK) >>
  HGC_LM_DFX_STATUS2_ITCTLIST_OFF);
+   queue_work(hisi_hba->wq, &hisi_hba->rst_work);
}
 
if (irq_value & BIT(SAS_ECC_INTR_CQE_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HGC_CQE_ECC_ADDR);
-   panic("%s: hgc_cqe_accbad_intr (0x%x) found: \
+   dev_warn(dev, "hgc_cqe_accbad_intr (0x%x) found: \
   Ram address is 0x%08X\n",
- dev_name(dev), irq_value,
+ irq_value,
  (reg_val & HGC_CQE_ECC_MB_ADDR_MSK) >>
  HGC_CQE_ECC_MB_ADDR_OFF);
+   queue_work(hisi_hba->wq, &hisi_hba->rst_work);
}
 
if (irq_value & BIT(SAS_ECC_INTR_NCQ_MEM0_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HGC_RXM_DFX_STATUS14);
-   panic("%s: rxm_mem0_accbad_intr (0x%x) found: \
+   dev_warn(dev, "rxm_mem0_accbad_intr (0x%x) found: \
   memory address is 0x%08X\n",
- dev_name(dev), irq_value,
+ irq_value,
  (reg_val & HGC_RXM_DFX_STATUS14_MEM0_MSK) >>
  HGC_RXM_DFX_STATUS14_MEM0_OFF);
+   queue_work(hisi_hba->wq, &hisi_hba->rst_work);
}
 
if (irq_value & BIT(SAS_ECC_INTR_NCQ_MEM1_ECC_MB_OFF)) {
reg_val = hisi_sas_read32(hisi_hba, HG

Re: [PATCH 1/6] scsi: hisi_sas: workaround STP link SoC bug

2017-04-10 Thread Johannes Thumshirn
On Mon, Apr 10, 2017 at 09:21:56PM +0800, John Garry wrote:
> From: Xiaofei Tan 
> 
> After resetting the controller, the process of scanning SATA disks
> attached to an expander may fail occasionally. The issue is that
> the controller can't close the STP link created by target if the
> max link time is 0.
> 
> To workaround this issue, we reject STP link after resetting the
> controller, and change the corresponding PHY to accept STP link
> only after receiving data.
> 
> We do this check in cq interrupt handler. In order not to reduce
> efficiency, we use an variable to control whether we should check
> and change PHY to accept STP link.
> 
> The function phys_reject_stp_links_v2_hw() should be called after
> resetting the controller.
> 
> The solution of another SoC bug "SATA IO timeout", that also uses
> the same register to control STP link, is not effective before
> the PHY accepts STP link.
> 
> Signed-off-by: Xiaofei Tan 
> Signed-off-by: John Garry 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 2/6] scsi: hisi_sas: workaround a SoC SATA IO processing bug

2017-04-10 Thread Johannes Thumshirn
On Mon, Apr 10, 2017 at 09:21:57PM +0800, John Garry wrote:
> From: Xiaofei Tan 
> 
> This patch provides a workaround a SoC bug where SATA IPTTs for
> different devices may conflict.
> 
> The workaround solution requests the following:
> 1. SATA device id must be even and not equal to SAS IPTT.
> 2. SATA device can not share the same IPTT with other SAS or
> SATA device.
> 
> Besides we shall consider IPTT value 0 is reserved for another SoC
> bug (STP device open link at firstly after SAS controller reset).
> 
> To sum up, the solution is:
> Each SATA device uses independent and continuous 32 even IPTT from
> 64 to 4094, then v2 hw can only support 63 SATA devices.
> All SAS device(SSP/SMP devices) share odd IPTT value from 1 to
> 4095.
> 
> Signed-off-by: Xiaofei Tan 
> Signed-off-by: John Garry 
> ---

Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()

2017-04-10 Thread Jens Axboe
On 04/10/2017 01:12 AM, Christoph Hellwig wrote:
>> +if (msecs == 0)
>> +kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
>> + &hctx->run_work);
>> +else
>> +kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
>> + &hctx->delayed_run_work,
>> + msecs_to_jiffies(msecs));
>> +}
> 
> I'd rather make run_work a delayed_work (again) and use
> kblockd_schedule_delayed_work_on with a timeout of zero for the immediate
> run case instead of having two competing work structs.

Yeah that's a good point, it'd have to be an incremental patch at this
point though. Also note that blk_mq_stop_hw_queue() isn't currently
canceling the new ->delayed_run_work, that looks like a bug.

And looking at it, right now we have 3 (three!) work items in the
hardware queue. The two delayed items differ in that one of them only
runs the queue it was previously stopped, that's it. The non-delayed one
is identical to the non stopped checking delayed variant.

I'll send out a patch.

-- 
Jens Axboe



Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-10 Thread Logan Gunthorpe


On 10/04/17 02:29 AM, Sagi Grimberg wrote:
> What you are saying is surprising to me. The switch needs to preserve
> ordering across different switch ports ??

> You are suggesting that there is a *switch-wide* state that tracks
> MemRds never pass MemWrs across all the switch ports? That is a very
> non-trivial statement...

Yes, it is a requirement of the PCIe spec for transactions to be
strongly ordered throughout the fabric so switches must have an internal
state across all ports. Without that, it would be impossible to have PCI
cards work together even if they are using system memory to do so. Also,
I believe, it was done this way to maintain maximum compatibility with
the legacy PCI bus. There is also a relaxed ordering bit that allows
specific transactions to ignore ordering which can help performance.

Obviously this becomes impossible if you have some kind of complex
multi-path fabric.

Logan


[PATCH 3/8] sd: remove write same support

2017-04-10 Thread Christoph Hellwig
There are no more end-users of REQ_OP_WRITE_SAME left, so we can start
deleting it.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 70 ---
 drivers/scsi/sd_zbc.c |  1 -
 2 files changed, 71 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8cf34a8e3eea..a905802e927e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -878,77 +878,10 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
sdkp->zeroing_mode = SD_ZERO_WRITE;
 
 out:
-   blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
-(logical_block_size >> 9));
blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
 (logical_block_size >> 9));
 }
 
-/**
- * sd_setup_write_same_cmnd - write the same data to multiple blocks
- * @cmd: command to prepare
- *
- * Will issue either WRITE SAME(10) or WRITE SAME(16) depending on
- * preference indicated by target device.
- **/
-static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd)
-{
-   struct request *rq = cmd->request;
-   struct scsi_device *sdp = cmd->device;
-   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
-   struct bio *bio = rq->bio;
-   sector_t sector = blk_rq_pos(rq);
-   unsigned int nr_sectors = blk_rq_sectors(rq);
-   unsigned int nr_bytes = blk_rq_bytes(rq);
-   int ret;
-
-   if (sdkp->device->no_write_same)
-   return BLKPREP_INVALID;
-
-   BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);
-
-   if (sd_is_zoned(sdkp)) {
-   ret = sd_zbc_setup_write_cmnd(cmd);
-   if (ret != BLKPREP_OK)
-   return ret;
-   }
-
-   sector >>= ilog2(sdp->sector_size) - 9;
-   nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
-   rq->timeout = SD_WRITE_SAME_TIMEOUT;
-
-   if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) {
-   cmd->cmd_len = 16;
-   cmd->cmnd[0] = WRITE_SAME_16;
-   put_unaligned_be64(sector, &cmd->cmnd[2]);
-   put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
-   } else {
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = WRITE_SAME;
-   put_unaligned_be32(sector, &cmd->cmnd[2]);
-   put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
-   }
-
-   cmd->transfersize = sdp->sector_size;
-   cmd->allowed = SD_MAX_RETRIES;
-
-   /*
-* For WRITE SAME the data transferred via the DATA OUT buffer is
-* different from the amount of data actually written to the target.
-*
-* We set up __data_len to the amount of data transferred via the
-* DATA OUT buffer so that blk_rq_map_sg sets up the proper S/G list
-* to transfer a single sector of data first, but then reset it to
-* the amount of data to be written right after so that the I/O path
-* knows how much to actually write.
-*/
-   rq->__data_len = sdp->sector_size;
-   ret = scsi_init_io(cmd);
-   rq->__data_len = nr_bytes;
-   return ret;
-}
-
 static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
 {
struct request *rq = cmd->request;
@@ -1232,8 +1165,6 @@ static int sd_init_command(struct scsi_cmnd *cmd)
}
case REQ_OP_WRITE_ZEROES:
return sd_setup_write_zeroes_cmnd(cmd);
-   case REQ_OP_WRITE_SAME:
-   return sd_setup_write_same_cmnd(cmd);
case REQ_OP_FLUSH:
return sd_setup_flush_cmnd(cmd);
case REQ_OP_READ:
@@ -1872,7 +1803,6 @@ static int sd_done(struct scsi_cmnd *SCpnt)
switch (req_op(req)) {
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
-   case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
if (!result) {
good_bytes = blk_rq_bytes(req);
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 1994f7799fce..8af6c9cd30ca 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -330,7 +330,6 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
switch (req_op(rq)) {
case REQ_OP_WRITE:
case REQ_OP_WRITE_ZEROES:
-   case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
 
/* Unlock the zone */
-- 
2.11.0



[PATCH 6/8] block: remove REQ_OP_WRITE_SAME support

2017-04-10 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 block/bio.c |  3 --
 block/blk-core.c| 11 +-
 block/blk-lib.c | 90 -
 block/blk-merge.c   | 32 
 block/blk-settings.c| 16 
 block/blk-sysfs.c   | 12 --
 include/linux/bio.h |  3 --
 include/linux/blk_types.h   |  4 +-
 include/linux/blkdev.h  | 26 -
 include/trace/events/f2fs.h |  1 -
 kernel/trace/blktrace.c |  1 -
 11 files changed, 2 insertions(+), 197 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f4d207180266..b310e7ef3fbf 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -684,9 +684,6 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, 
gfp_t gfp_mask,
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
break;
-   case REQ_OP_WRITE_SAME:
-   bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
-   break;
default:
__bio_for_each_segment(bv, bio_src, iter, iter_src)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
diff --git a/block/blk-core.c b/block/blk-core.c
index 8654aa0cef6d..92336bc8495c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1929,10 +1929,6 @@ generic_make_request_checks(struct bio *bio)
if (!blk_queue_secure_erase(q))
goto not_supported;
break;
-   case REQ_OP_WRITE_SAME:
-   if (!bdev_write_same(bio->bi_bdev))
-   goto not_supported;
-   break;
case REQ_OP_ZONE_REPORT:
case REQ_OP_ZONE_RESET:
if (!bdev_is_zoned(bio->bi_bdev))
@@ -2100,12 +2096,7 @@ blk_qc_t submit_bio(struct bio *bio)
 * go through the normal accounting stuff before submission.
 */
if (bio_has_data(bio)) {
-   unsigned int count;
-
-   if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
-   count = bdev_logical_block_size(bio->bi_bdev) >> 9;
-   else
-   count = bio_sectors(bio);
+   unsigned int count = bio_sectors(bio);
 
if (op_is_write(bio_op(bio))) {
count_vm_events(PGPGOUT, count);
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e8caecd71688..57c99b9b3b78 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -131,96 +131,6 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
-/**
- * __blkdev_issue_write_same - generate number of bios with same page
- * @bdev:  target blockdev
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @page:  page containing data to write
- * @biop:  pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_SAME) with same page.
- */
-static int __blkdev_issue_write_same(struct block_device *bdev, sector_t 
sector,
-   sector_t nr_sects, gfp_t gfp_mask, struct page *page,
-   struct bio **biop)
-{
-   struct request_queue *q = bdev_get_queue(bdev);
-   unsigned int max_write_same_sectors;
-   struct bio *bio = *biop;
-   sector_t bs_mask;
-
-   if (!q)
-   return -ENXIO;
-
-   bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
-   if ((sector | nr_sects) & bs_mask)
-   return -EINVAL;
-
-   if (!bdev_write_same(bdev))
-   return -EOPNOTSUPP;
-
-   /* Ensure that max_write_same_sectors doesn't overflow bi_size */
-   max_write_same_sectors = UINT_MAX >> 9;
-
-   while (nr_sects) {
-   bio = next_bio(bio, 1, gfp_mask);
-   bio->bi_iter.bi_sector = sector;
-   bio->bi_bdev = bdev;
-   bio->bi_vcnt = 1;
-   bio->bi_io_vec->bv_page = page;
-   bio->bi_io_vec->bv_offset = 0;
-   bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);
-   bio_set_op_attrs(bio, REQ_OP_WRITE_SAME, 0);
-
-   if (nr_sects > max_write_same_sectors) {
-   bio->bi_iter.bi_size = max_write_same_sectors << 9;
-   nr_sects -= max_write_same_sectors;
-   sector += max_write_same_sectors;
-   } else {
-   bio->bi_iter.bi_size = nr_sects << 9;
-   nr_sects = 0;
-   }
-   cond_resched();
-   }
-
-   *biop = bio;
-   return 0;
-}
-
-/**
- * blkdev_issue_write_same - queue a write same operation
- * @bdev:  target blockdev
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @page:  page containing data
- *
- * Description:
- *Issue a write same request for the sectors in question.

[PATCH 2/8] target: remove iblock WRITE_SAME passthrough support

2017-04-10 Thread Christoph Hellwig
Use the pscsi driver to support arbitrary command passthrough
instead.

Signed-off-by: Christoph Hellwig 
---
 drivers/target/target_core_iblock.c | 34 --
 1 file changed, 34 deletions(-)

diff --git a/drivers/target/target_core_iblock.c 
b/drivers/target/target_core_iblock.c
index d316ed537d59..9da31970a004 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -415,39 +415,8 @@ iblock_execute_unmap(struct se_cmd *cmd, sector_t lba, 
sector_t nolb)
 }
 
 static sense_reason_t
-iblock_execute_write_same_direct(struct block_device *bdev, struct se_cmd *cmd)
-{
-   struct se_device *dev = cmd->se_dev;
-   struct scatterlist *sg = &cmd->t_data_sg[0];
-   struct page *page = NULL;
-   int ret;
-
-   if (sg->offset) {
-   page = alloc_page(GFP_KERNEL);
-   if (!page)
-   return TCM_OUT_OF_RESOURCES;
-   sg_copy_to_buffer(sg, cmd->t_data_nents, page_address(page),
- dev->dev_attrib.block_size);
-   }
-
-   ret = blkdev_issue_write_same(bdev,
-   target_to_linux_sector(dev, cmd->t_task_lba),
-   target_to_linux_sector(dev,
-   sbc_get_write_same_sectors(cmd)),
-   GFP_KERNEL, page ? page : sg_page(sg));
-   if (page)
-   __free_page(page);
-   if (ret)
-   return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
-
-   target_complete_cmd(cmd, GOOD);
-   return 0;
-}
-
-static sense_reason_t
 iblock_execute_write_same(struct se_cmd *cmd)
 {
-   struct block_device *bdev = IBLOCK_DEV(cmd->se_dev)->ibd_bd;
struct iblock_req *ibr;
struct scatterlist *sg;
struct bio *bio;
@@ -472,9 +441,6 @@ iblock_execute_write_same(struct se_cmd *cmd)
return TCM_INVALID_CDB_FIELD;
}
 
-   if (bdev_write_same(bdev))
-   return iblock_execute_write_same_direct(bdev, cmd);
-
ibr = kzalloc(sizeof(struct iblock_req), GFP_KERNEL);
if (!ibr)
goto fail;
-- 
2.11.0



[PATCH 4/8] md: drop WRITE_SAME support

2017-04-10 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 drivers/md/linear.c| 1 -
 drivers/md/md.h| 7 ---
 drivers/md/multipath.c | 1 -
 drivers/md/raid0.c | 2 --
 drivers/md/raid1.c | 4 +---
 drivers/md/raid10.c| 1 -
 drivers/md/raid5.c | 1 -
 7 files changed, 1 insertion(+), 16 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 377a8a3672e3..da363f5d54b0 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -292,7 +292,6 @@ static void linear_make_request(struct mddev *mddev, struct 
bio *bio)

trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
-   mddev_check_writesame(mddev, split);
mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1e76d64ce180..d82b11b5ae5a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -703,13 +703,6 @@ static inline void mddev_clear_unsupported_flags(struct 
mddev *mddev,
mddev->flags &= ~unsupported_flags;
 }
 
-static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
-{
-   if (bio_op(bio) == REQ_OP_WRITE_SAME &&
-   !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
-   mddev->queue->limits.max_write_same_sectors = 0;
-}
-
 static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio 
*bio)
 {
if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index e95d521d93e9..68d67a404aab 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -138,7 +138,6 @@ static void multipath_make_request(struct mddev *mddev, 
struct bio * bio)
mp_bh->bio.bi_opf |= REQ_FAILFAST_TRANSPORT;
mp_bh->bio.bi_end_io = multipath_end_request;
mp_bh->bio.bi_private = mp_bh;
-   mddev_check_writesame(mddev, &mp_bh->bio);
mddev_check_write_zeroes(mddev, &mp_bh->bio);
generic_make_request(&mp_bh->bio);
return;
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ce7a6a56cf73..c094749c11e5 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -382,7 +382,6 @@ static int raid0_run(struct mddev *mddev)
bool discard_supported = false;
 
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
-   blk_queue_max_write_same_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_write_zeroes_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_discard_sectors(mddev->queue, 
mddev->chunk_sectors);
 
@@ -504,7 +503,6 @@ static void raid0_make_request(struct mddev *mddev, struct 
bio *bio)

trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
-   mddev_check_writesame(mddev, split);
mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index b59cc100320a..ac9ef686e625 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,10 +3177,8 @@ static int raid1_run(struct mddev *mddev)
if (IS_ERR(conf))
return PTR_ERR(conf);
 
-   if (mddev->queue) {
-   blk_queue_max_write_same_sectors(mddev->queue, 0);
+   if (mddev->queue)
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
-   }
 
rdev_for_each(rdev, mddev) {
if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 28ec3a93acee..79988908f862 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3748,7 +3748,6 @@ static int raid10_run(struct mddev *mddev)
if (mddev->queue) {
blk_queue_max_discard_sectors(mddev->queue,
  mddev->chunk_sectors);
-   blk_queue_max_write_same_sectors(mddev->queue, 0);
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
blk_queue_io_min(mddev->queue, chunk_size);
if (conf->geo.raid_disks % conf->geo.near_copies)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2efdb0d67460..04fd6a946825 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7262,7 +7262,6 @@ static int raid5_run(struct mddev *mddev)
blk_queue_max_discard_sectors(mddev->queue,
  0xfffe * STRIPE_SECTORS);
 
-   blk_queue_max_write_same_sectors(mddev->queue, 0);
blk_queue_max_write_zeroes_sectors(m

[PATCH 5/8] dm: remove write same support

2017-04-10 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-core.h  |  1 -
 drivers/md/dm-io.c| 21 +
 drivers/md/dm-linear.c|  1 -
 drivers/md/dm-mpath.c |  1 -
 drivers/md/dm-rq.c|  3 ---
 drivers/md/dm-stripe.c|  4 +---
 drivers/md/dm-table.c | 29 -
 drivers/md/dm.c   | 23 ---
 include/linux/device-mapper.h |  6 --
 9 files changed, 2 insertions(+), 87 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index fea5bd52ada8..d661801d72e7 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -131,7 +131,6 @@ struct mapped_device {
 void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
-void disable_write_same(struct mapped_device *md);
 void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 3702e502466d..105e68dabd3e 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -303,7 +303,6 @@ static void do_region(int op, int op_flags, unsigned region,
unsigned num_bvecs;
sector_t remaining = where->count;
struct request_queue *q = bdev_get_queue(where->bdev);
-   unsigned short logical_block_size = queue_logical_block_size(q);
sector_t num_sectors;
unsigned int uninitialized_var(special_cmd_max_sectors);
 
@@ -314,10 +313,7 @@ static void do_region(int op, int op_flags, unsigned 
region,
special_cmd_max_sectors = q->limits.max_discard_sectors;
else if (op == REQ_OP_WRITE_ZEROES)
special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
-   else if (op == REQ_OP_WRITE_SAME)
-   special_cmd_max_sectors = q->limits.max_write_same_sectors;
-   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
-op == REQ_OP_WRITE_SAME)  &&
+   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) &&
special_cmd_max_sectors == 0) {
dec_count(io, region, -EOPNOTSUPP);
return;
@@ -336,9 +332,6 @@ static void do_region(int op, int op_flags, unsigned region,
case REQ_OP_WRITE_ZEROES:
num_bvecs = 0;
break;
-   case REQ_OP_WRITE_SAME:
-   num_bvecs = 1;
-   break;
default:
num_bvecs = min_t(int, BIO_MAX_PAGES,
  dm_sector_div_up(remaining, 
(PAGE_SIZE >> SECTOR_SHIFT)));
@@ -355,18 +348,6 @@ static void do_region(int op, int op_flags, unsigned 
region,
num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
remaining -= num_sectors;
-   } else if (op == REQ_OP_WRITE_SAME) {
-   /*
-* WRITE SAME only uses a single page.
-*/
-   dp->get_page(dp, &page, &len, &offset);
-   bio_add_page(bio, page, logical_block_size, offset);
-   num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
-   bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
-
-   offset = 0;
-   remaining -= num_sectors;
-   dp->next_page(dp);
} else while (remaining) {
/*
 * Try and add as many pages as possible.
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index e17fd44ceef5..f928f7e9ee4a 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -58,7 +58,6 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
 
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
-   ti->num_write_same_bios = 1;
ti->num_write_zeroes_bios = 1;
ti->private = lc;
return 0;
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index ab55955ed704..ece53947b99d 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1102,7 +1102,6 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
 
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
-   ti->num_write_same_bios = 1;
ti->num_write_zeroes_bios = 1;
if (m->queue_mode == DM_TYPE_BIO_BASED)
ti->per_io_data_size = multipath_per_bio_data_size();
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index e60f1b6845be..6f8dc99685f2 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -299,9 +299,6 @@ static void dm_done(struct request *clone, int error, bool 
mappe

[PATCH 8/8] block: use bio_has_data to check if a bio has bvecs

2017-04-10 Thread Christoph Hellwig
Now that Write Same is gone and discard bios never have a payload we
can simply use bio_has_data as an indicator that the bio has bvecs
that need to be handled.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c |  8 +---
 block/blk-merge.c   |  9 +
 include/linux/bio.h | 21 +
 3 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b310e7ef3fbf..1c9f04c30ba9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,15 +679,9 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, 
gfp_t gfp_mask,
bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
 
-   switch (bio_op(bio)) {
-   case REQ_OP_DISCARD:
-   case REQ_OP_SECURE_ERASE:
-   case REQ_OP_WRITE_ZEROES:
-   break;
-   default:
+   if (bio_has_data(bio)) {
__bio_for_each_segment(bv, bio_src, iter, iter_src)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
-   break;
}
 
if (bio_integrity(bio_src)) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index d6c86bfc5722..549d060097f1 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -232,16 +232,9 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
struct bio *fbio, *bbio;
struct bvec_iter iter;
 
-   if (!bio)
+   if (!bio || !bio_has_data(bio))
return 0;
 
-   switch (bio_op(bio)) {
-   case REQ_OP_DISCARD:
-   case REQ_OP_SECURE_ERASE:
-   case REQ_OP_WRITE_ZEROES:
-   return 0;
-   }
-
fbio = bio;
cluster = blk_queue_cluster(q);
seg_size = 0;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7a24a1a24967..86bf531f97aa 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -178,26 +178,15 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
 {
unsigned segs = 0;
-   struct bio_vec bv;
-   struct bvec_iter iter;
 
-   /*
-* We special case discard/write same/write zeroes, because they
-* interpret bi_size differently:
-*/
+   if (bio_has_data(bio)) {
+   struct bio_vec bv;
+   struct bvec_iter iter;
 
-   switch (bio_op(bio)) {
-   case REQ_OP_DISCARD:
-   case REQ_OP_SECURE_ERASE:
-   case REQ_OP_WRITE_ZEROES:
-   return 0;
-   default:
-   break;
+   __bio_for_each_segment(bv, bio, iter, *bvec)
+   segs++;
}
 
-   __bio_for_each_segment(bv, bio, iter, *bvec)
-   segs++;
-
return segs;
 }
 
-- 
2.11.0



[PATCH 7/8] block: remove bio_no_advance_iter

2017-04-10 Thread Christoph Hellwig
Now that we don't have to support the odd Write Same special case
we can simply increment the iter if the bio has data, else just
manipulate bi_size directly.

Signed-off-by: Christoph Hellwig 
---
 include/linux/bio.h | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 96a20afb8575..7a24a1a24967 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -83,13 +83,6 @@ static inline bool bio_has_data(struct bio *bio)
return false;
 }
 
-static inline bool bio_no_advance_iter(struct bio *bio)
-{
-   return bio_op(bio) == REQ_OP_DISCARD ||
-  bio_op(bio) == REQ_OP_SECURE_ERASE ||
-  bio_op(bio) == REQ_OP_WRITE_ZEROES;
-}
-
 static inline bool bio_mergeable(struct bio *bio)
 {
if (bio->bi_opf & REQ_NOMERGE_FLAGS)
@@ -165,10 +158,10 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 {
iter->bi_sector += bytes >> 9;
 
-   if (bio_no_advance_iter(bio))
-   iter->bi_size -= bytes;
-   else
+   if (bio_has_data(bio))
bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else
+   iter->bi_size -= bytes;
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
-- 
2.11.0



[PATCH 1/8] drbd: drop REQ_OP_WRITE_SAME support

2017-04-10 Thread Christoph Hellwig
Linux only used it for zeroing, for which we have better methods now.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_main.c | 28 ++
 drivers/block/drbd/drbd_nl.c   | 60 --
 drivers/block/drbd/drbd_receiver.c | 38 +++-
 drivers/block/drbd/drbd_req.c  |  1 -
 drivers/block/drbd/drbd_worker.c   |  4 ---
 5 files changed, 7 insertions(+), 124 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 84455c365f57..183468e0b959 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -931,7 +931,7 @@ void assign_p_sizes_qlim(struct drbd_device *device, struct 
p_sizes *p, struct r
p->qlim->io_min = cpu_to_be32(queue_io_min(q));
p->qlim->io_opt = cpu_to_be32(queue_io_opt(q));
p->qlim->discard_enabled = blk_queue_discard(q);
-   p->qlim->write_same_capable = 
!!q->limits.max_write_same_sectors;
+   p->qlim->write_same_capable = 0;
} else {
q = device->rq_queue;
p->qlim->physical_block_size = 
cpu_to_be32(queue_physical_block_size(q));
@@ -1610,9 +1610,6 @@ static int _drbd_send_bio(struct drbd_peer_device 
*peer_device, struct bio *bio)
 ? 0 : MSG_MORE);
if (err)
return err;
-   /* REQ_OP_WRITE_SAME has only one segment */
-   if (bio_op(bio) == REQ_OP_WRITE_SAME)
-   break;
}
return 0;
 }
@@ -1631,9 +1628,6 @@ static int _drbd_send_zc_bio(struct drbd_peer_device 
*peer_device, struct bio *b
  bio_iter_last(bvec, iter) ? 0 : MSG_MORE);
if (err)
return err;
-   /* REQ_OP_WRITE_SAME has only one segment */
-   if (bio_op(bio) == REQ_OP_WRITE_SAME)
-   break;
}
return 0;
 }
@@ -1665,7 +1659,6 @@ static u32 bio_flags_to_wire(struct drbd_connection 
*connection,
return  (bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0) |
(bio->bi_opf & REQ_FUA ? DP_FUA : 0) |
(bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) |
-   (bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) |
(bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) |
(bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0);
else
@@ -1680,7 +1673,6 @@ int drbd_send_dblock(struct drbd_peer_device 
*peer_device, struct drbd_request *
struct drbd_device *device = peer_device->device;
struct drbd_socket *sock;
struct p_data *p;
-   struct p_wsame *wsame = NULL;
void *digest_out;
unsigned int dp_flags = 0;
int digest_size;
@@ -1717,27 +1709,13 @@ int drbd_send_dblock(struct drbd_peer_device 
*peer_device, struct drbd_request *
err = __send_command(peer_device->connection, device->vnr, 
sock, P_TRIM, sizeof(*t), NULL, 0);
goto out;
}
-   if (dp_flags & DP_WSAME) {
-   /* this will only work if DRBD_FF_WSAME is set AND the
-* handshake agreed that all nodes and backend devices are
-* WRITE_SAME capable and agree on logical_block_size */
-   wsame = (struct p_wsame*)p;
-   digest_out = wsame + 1;
-   wsame->size = cpu_to_be32(req->i.size);
-   } else
-   digest_out = p + 1;
+   digest_out = p + 1;
 
/* our digest is still only over the payload.
 * TRIM does not carry any payload. */
if (digest_size)
drbd_csum_bio(peer_device->connection->integrity_tfm, 
req->master_bio, digest_out);
-   if (wsame) {
-   err =
-   __send_command(peer_device->connection, device->vnr, sock, 
P_WSAME,
-  sizeof(*wsame) + digest_size, NULL,
-  bio_iovec(req->master_bio).bv_len);
-   } else
-   err =
+   err =
__send_command(peer_device->connection, device->vnr, sock, 
P_DATA,
   sizeof(*p) + digest_size, NULL, req->i.size);
if (!err) {
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 02255a0d68b9..53aeed040eb4 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1234,65 +1234,6 @@ static void fixup_discard_if_not_supported(struct 
request_queue *q)
}
 }
 
-static void decide_on_write_same_support(struct drbd_device *device,
-   struct request_queue *q,
-   struct request_queue *b, struct o_qlim *o)
-{
-   struct drbd_peer_device *peer_device = first_peer_device(device);
-   struct drbd_connection *connection = peer_devic

RFC: remove REQ_OP_WRITE_SAME

2017-04-10 Thread Christoph Hellwig
Now that we are using REQ_OP_WRITE_ZEROES for all zeroing needs in the
kernel there is very little use left for REQ_OP_WRITE_SAME.  We only
have two callers left, and both just export optional protocol features
to remote systems: DRBD and the target code.

Do we have any major users of those?  If not removing it will clean up
a few warts in the block layer.

git://git.infradead.org/users/hch/block.git delete-write-same

Gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/delete-write-same


[PATCH 1/5] scsi: bnx2i: convert to workqueue

2017-04-10 Thread Sebastian Andrzej Siewior
The driver creates its own per-CPU threads which are updated based on CPU
hotplug events. It is also possible to use kworkers and remove some of the
infrastructure get the same job done while saving a few lines of code.

The DECLARE_PER_CPU() definition is moved into the header file where it
belongs. bnx2i_percpu_io_thread() becomes bnx2i_percpu_io_work() which is
mostly the same code. The outer loop (kthread_should_stop()) gets removed and
the remaining code is shifted to the left.
bnx2i_queue_scsi_cmd_resp() is mostly the same. The code checked ->iothread to
decide if there is an active per-CPU thread. With the kworkers this is no
longer possible nor required.
The allocation of struct bnx2i_work does not happen with ->p_work_lock held
which is not required. I am unsure about the call-stack so I can't say
if this qualifies it for the allocation with GFP_KERNEL instead of
GFP_ATOMIC (it is not _bh lock but as I said, I don't know the context).
The allocation case has been reversed so the inner if case is called on
!bnx2i_work and is just the invocation one function since the lock is not
held during allocation. The init of the new bnx2i_work struct is now
done also without the ->p_work_lock held: it is a new object, nobody
knows about it yet. It should be enough to hold the lock while adding
this item to the list. I am unsure about that atomic_inc() so I keep
things as they were.

The remaining part is the removal CPU hotplug notifier since it is taken
care by the workqueue code.

This patch was only compile-tested due to -ENODEV.

Cc: qlogic-storage-upstr...@qlogic.com
Cc: Christoph Hellwig 
Signed-off-by: Sebastian Andrzej Siewior 
---
 drivers/scsi/bnx2i/bnx2i.h  |  11 ++---
 drivers/scsi/bnx2i/bnx2i_hwi.c  | 101 +-
 drivers/scsi/bnx2i/bnx2i_init.c | 104 +++-
 3 files changed, 53 insertions(+), 163 deletions(-)

diff --git a/drivers/scsi/bnx2i/bnx2i.h b/drivers/scsi/bnx2i/bnx2i.h
index 89ef1a1678d1..78f67542cbd3 100644
--- a/drivers/scsi/bnx2i/bnx2i.h
+++ b/drivers/scsi/bnx2i/bnx2i.h
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -775,12 +774,11 @@ struct bnx2i_work {
 };
 
 struct bnx2i_percpu_s {
-   struct task_struct *iothread;
+   struct work_struct work;
struct list_head work_list;
spinlock_t p_work_lock;
 };
 
-
 /* Global variables */
 extern unsigned int error_mask1, error_mask2;
 extern u64 iscsi_error_mask;
@@ -797,7 +795,7 @@ extern unsigned int rq_size;
 
 extern struct device_attribute *bnx2i_dev_attributes[];
 
-
+DECLARE_PER_CPU(struct bnx2i_percpu_s, bnx2i_percpu);
 
 /*
  * Function Prototypes
@@ -875,8 +873,5 @@ extern void bnx2i_print_active_cmd_queue(struct bnx2i_conn 
*conn);
 extern void bnx2i_print_xmit_pdu_queue(struct bnx2i_conn *conn);
 extern void bnx2i_print_recv_state(struct bnx2i_conn *conn);
 
-extern int bnx2i_percpu_io_thread(void *arg);
-extern int bnx2i_process_scsi_cmd_resp(struct iscsi_session *session,
-  struct bnx2i_conn *bnx2i_conn,
-  struct cqe *cqe);
+extern void bnx2i_percpu_io_work(struct work_struct *work);
 #endif
diff --git a/drivers/scsi/bnx2i/bnx2i_hwi.c b/drivers/scsi/bnx2i/bnx2i_hwi.c
index 42921dbba927..9be58f6523b3 100644
--- a/drivers/scsi/bnx2i/bnx2i_hwi.c
+++ b/drivers/scsi/bnx2i/bnx2i_hwi.c
@@ -19,8 +19,6 @@
 #include 
 #include "bnx2i.h"
 
-DECLARE_PER_CPU(struct bnx2i_percpu_s, bnx2i_percpu);
-
 /**
  * bnx2i_get_cid_num - get cid from ep
  * @ep:endpoint pointer
@@ -1350,9 +1348,9 @@ int bnx2i_send_fw_iscsi_init_msg(struct bnx2i_hba *hba)
  *
  * process SCSI CMD Response CQE & complete the request to SCSI-ML
  */
-int bnx2i_process_scsi_cmd_resp(struct iscsi_session *session,
-   struct bnx2i_conn *bnx2i_conn,
-   struct cqe *cqe)
+static int bnx2i_process_scsi_cmd_resp(struct iscsi_session *session,
+  struct bnx2i_conn *bnx2i_conn,
+  struct cqe *cqe)
 {
struct iscsi_conn *conn = bnx2i_conn->cls_conn->dd_data;
struct bnx2i_hba *hba = bnx2i_conn->hba;
@@ -1862,45 +1860,37 @@ static void bnx2i_process_cmd_cleanup_resp(struct 
iscsi_session *session,
 
 
 /**
- * bnx2i_percpu_io_thread - thread per cpu for ios
+ * bnx2i_percpu_io_work - thread per cpu for ios
  *
- * @arg:   ptr to bnx2i_percpu_info structure
+ * @work_s:The work struct
  */
-int bnx2i_percpu_io_thread(void *arg)
+void bnx2i_percpu_io_work(struct work_struct *work_s)
 {
-   struct bnx2i_percpu_s *p = arg;
+   struct bnx2i_percpu_s *p;
struct bnx2i_work *work, *tmp;
LIST_HEAD(work_list);
 
-   set_user_nice(current, MIN_NICE);
+   p = container_of(work_s, struct bnx2i_percpu_s, work);
 
-   while (!kthread_should_stop()) {
-   spin_lock_bh(&p->p_work_lock);
-

[PATCH 5/5] scsi: bnx2fc: convert bnx2fc_l2_rcv_thread() to workqueue

2017-04-10 Thread Sebastian Andrzej Siewior
This is not driven by the hotplug conversation but while I am at it
looks like a good candidate. Converting the thread to a workqueue user
removes also the kthread member from struct fcoe_percpu_s.

This driver uses the struct fcoe_percpu_s but it does not need the
crc_eof_page member, only the work item and fcoe_rx_list. So it is
removed there.

We had one thread so we only use the workqueue on the current CPU. If
someone knows how spread this nicely, it would only require the usage of
schedule_work_on() instead schedule_work() :)

This patch was only compile-tested due to -ENODEV.

Cc: qlogic-storage-upstr...@qlogic.com
Cc: Christoph Hellwig 
Cc: fcoe-de...@open-fcoe.org
Signed-off-by: Sebastian Andrzej Siewior 
---
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 59 ---
 include/scsi/libfcoe.h|  1 -
 2 files changed, 12 insertions(+), 48 deletions(-)

diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c 
b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 2f66c2ea093c..ca1b5908114d 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -479,7 +479,7 @@ static int bnx2fc_rcv(struct sk_buff *skb, struct 
net_device *dev,
 
__skb_queue_tail(&bg->fcoe_rx_list, skb);
if (bg->fcoe_rx_list.qlen == 1)
-   wake_up_process(bg->kthread);
+   schedule_work(&bg->work);
 
spin_unlock(&bg->fcoe_rx_list.lock);
 
@@ -489,26 +489,20 @@ static int bnx2fc_rcv(struct sk_buff *skb, struct 
net_device *dev,
return -1;
 }
 
-static int bnx2fc_l2_rcv_thread(void *arg)
+static void bnx2fc_l2_rcv_work(struct work_struct *work_s)
 {
-   struct fcoe_percpu_s *bg = arg;
+   struct fcoe_percpu_s *bg;
struct sk_buff *skb;
 
-   set_user_nice(current, MIN_NICE);
-   set_current_state(TASK_INTERRUPTIBLE);
-   while (!kthread_should_stop()) {
-   schedule();
-   spin_lock_bh(&bg->fcoe_rx_list.lock);
-   while ((skb = __skb_dequeue(&bg->fcoe_rx_list)) != NULL) {
-   spin_unlock_bh(&bg->fcoe_rx_list.lock);
-   bnx2fc_recv_frame(skb);
-   spin_lock_bh(&bg->fcoe_rx_list.lock);
-   }
-   __set_current_state(TASK_INTERRUPTIBLE);
+   bg = container_of(work_s, struct fcoe_percpu_s, work);
+
+   spin_lock_bh(&bg->fcoe_rx_list.lock);
+   while ((skb = __skb_dequeue(&bg->fcoe_rx_list)) != NULL) {
spin_unlock_bh(&bg->fcoe_rx_list.lock);
+   bnx2fc_recv_frame(skb);
+   spin_lock_bh(&bg->fcoe_rx_list.lock);
}
-   __set_current_state(TASK_RUNNING);
-   return 0;
+   spin_unlock_bh(&bg->fcoe_rx_list.lock);
 }
 
 
@@ -2564,8 +2558,6 @@ static int bnx2fc_slave_configure(struct scsi_device 
*sdev)
return 0;
 }
 
-static enum cpuhp_state bnx2fc_online_state;
-
 /**
  * bnx2fc_mod_init - module init entry point
  *
@@ -2575,7 +2567,6 @@ static enum cpuhp_state bnx2fc_online_state;
 static int __init bnx2fc_mod_init(void)
 {
struct fcoe_percpu_s *bg;
-   struct task_struct *l2_thread;
int rc = 0;
unsigned int cpu = 0;
struct bnx2fc_percpu_s *p;
@@ -2608,17 +2599,7 @@ static int __init bnx2fc_mod_init(void)
 
bg = &bnx2fc_global;
skb_queue_head_init(&bg->fcoe_rx_list);
-   l2_thread = kthread_create(bnx2fc_l2_rcv_thread,
-  (void *)bg,
-  "bnx2fc_l2_thread");
-   if (IS_ERR(l2_thread)) {
-   rc = PTR_ERR(l2_thread);
-   goto free_wq;
-   }
-   wake_up_process(l2_thread);
-   spin_lock_bh(&bg->fcoe_rx_list.lock);
-   bg->kthread = l2_thread;
-   spin_unlock_bh(&bg->fcoe_rx_list.lock);
+   INIT_WORK(&bg->work, bnx2fc_l2_rcv_work);
 
for_each_possible_cpu(cpu) {
p = &per_cpu(bnx2fc_percpu, cpu);
@@ -2631,8 +2612,6 @@ static int __init bnx2fc_mod_init(void)
 
return 0;
 
-free_wq:
-   destroy_workqueue(bnx2fc_wq);
 release_bt:
bnx2fc_release_transport();
 detach_ft:
@@ -2645,9 +2624,6 @@ static void __exit bnx2fc_mod_exit(void)
 {
LIST_HEAD(to_be_deleted);
struct bnx2fc_hba *hba, *next;
-   struct fcoe_percpu_s *bg;
-   struct task_struct *l2_thread;
-   struct sk_buff *skb;
unsigned int cpu = 0;
 
/*
@@ -2676,18 +2652,7 @@ static void __exit bnx2fc_mod_exit(void)
}
cnic_unregister_driver(CNIC_ULP_FCOE);
 
-   /* Destroy global thread */
-   bg = &bnx2fc_global;
-   spin_lock_bh(&bg->fcoe_rx_list.lock);
-   l2_thread = bg->kthread;
-   bg->kthread = NULL;
-   while ((skb = __skb_dequeue(&bg->fcoe_rx_list)) != NULL)
-   kfree_skb(skb);
-
-   spin_unlock_bh(&bg->fcoe_rx_list.lock);
-
-   if (l2_thread)
-   kthread_stop(l2_thread);
+   flush_work(&bnx2fc_global.work);
 
for_each_possible_cpu(cpu) 

[PATCH 4/5] scsi: bnx2fc: annoate unlock + release for sparse

2017-04-10 Thread Sebastian Andrzej Siewior
The caller of bnx2fc_abts_cleanup() holds the tgt->tgt_lock lock and it
is expected to release the lock during wait_for_completion() and acquire
the lock afterwards.

This patch was only compile-tested due to -ENODEV.

Cc: qlogic-storage-upstr...@qlogic.com
Cc: Christoph Hellwig 
Signed-off-by: Sebastian Andrzej Siewior 
---
 drivers/scsi/bnx2fc/bnx2fc_io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/bnx2fc/bnx2fc_io.c b/drivers/scsi/bnx2fc/bnx2fc_io.c
index 898461b146cc..3eed2453648c 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_io.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_io.c
@@ -1080,6 +1080,8 @@ int bnx2fc_eh_device_reset(struct scsi_cmnd *sc_cmd)
 }
 
 static int bnx2fc_abts_cleanup(struct bnx2fc_cmd *io_req)
+__releases(tgt->tgt_lock)
+__acquires(tgt->tgt_lock)
 {
struct bnx2fc_rport *tgt = io_req->tgt;
int rc = SUCCESS;
-- 
2.11.0



[PATCH 2/5] scsi: bnx2fc: convert per-CPU thread to workqueue

2017-04-10 Thread Sebastian Andrzej Siewior
The driver creates its own per-CPU threads which are updated based on CPU
hotplug events. It is also possible to use kworkers and remove some of the
infrastructure get the same job done while saving a few lines of code.

bnx2fc_percpu_io_thread() becomes bnx2fc_percpu_io_work() which is
mostly the same code. The outer loop (kthread_should_stop()) gets
removed and the remaining code is shifted to the left.
In bnx2fc_process_new_cqes() the code checked for ->iothread to
decide if there is an active per-CPU thread. With the kworkers this
is no longer possible nor required. The allocation of a new work item
(via bnx2fc_alloc_work()) does no longer happen with the ->fp_work_lock
lock held. It performs only a memory allocation + initialization which
does not require any kind of serialization. The lock is held while
adding the new member to fps->work_list list.

The remaining part is the removal CPU hotplug notifier since it is taken
care by the workqueue code.

This patch was only compile-tested due to -ENODEV.

Cc: qlogic-storage-upstr...@qlogic.com
Cc: Christoph Hellwig 
Signed-off-by: Sebastian Andrzej Siewior 
---
 drivers/scsi/bnx2fc/bnx2fc.h  |   2 +-
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 144 ++
 drivers/scsi/bnx2fc/bnx2fc_hwi.c  |  22 +++---
 3 files changed, 33 insertions(+), 135 deletions(-)

diff --git a/drivers/scsi/bnx2fc/bnx2fc.h b/drivers/scsi/bnx2fc/bnx2fc.h
index 4fc8ed5fe067..0279cc8de7a0 100644
--- a/drivers/scsi/bnx2fc/bnx2fc.h
+++ b/drivers/scsi/bnx2fc/bnx2fc.h
@@ -168,7 +168,7 @@ extern struct fcoe_percpu_s bnx2fc_global;
 extern struct workqueue_struct *bnx2fc_wq;
 
 struct bnx2fc_percpu_s {
-   struct task_struct *iothread;
+   struct work_struct work;
struct list_head work_list;
spinlock_t fp_work_lock;
 };
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c 
b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 93b5a0012417..329922d51f8a 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -614,39 +614,32 @@ static void bnx2fc_recv_frame(struct sk_buff *skb)
 }
 
 /**
- * bnx2fc_percpu_io_thread - thread per cpu for ios
+ * bnx2fc_percpu_io_work - work per cpu for ios
  *
- * @arg:   ptr to bnx2fc_percpu_info structure
+ * @work_s:The work struct
  */
-static int bnx2fc_percpu_io_thread(void *arg)
+static void bnx2fc_percpu_io_work(struct work_struct *work_s)
 {
-   struct bnx2fc_percpu_s *p = arg;
+   struct bnx2fc_percpu_s *p;
struct bnx2fc_work *work, *tmp;
LIST_HEAD(work_list);
 
-   set_user_nice(current, MIN_NICE);
-   set_current_state(TASK_INTERRUPTIBLE);
-   while (!kthread_should_stop()) {
-   schedule();
-   spin_lock_bh(&p->fp_work_lock);
-   while (!list_empty(&p->work_list)) {
-   list_splice_init(&p->work_list, &work_list);
-   spin_unlock_bh(&p->fp_work_lock);
+   p = container_of(work_s, struct bnx2fc_percpu_s, work);
 
-   list_for_each_entry_safe(work, tmp, &work_list, list) {
-   list_del_init(&work->list);
-   bnx2fc_process_cq_compl(work->tgt, work->wqe);
-   kfree(work);
-   }
-
-   spin_lock_bh(&p->fp_work_lock);
-   }
-   __set_current_state(TASK_INTERRUPTIBLE);
+   spin_lock_bh(&p->fp_work_lock);
+   while (!list_empty(&p->work_list)) {
+   list_splice_init(&p->work_list, &work_list);
spin_unlock_bh(&p->fp_work_lock);
-   }
-   __set_current_state(TASK_RUNNING);
 
-   return 0;
+   list_for_each_entry_safe(work, tmp, &work_list, list) {
+   list_del_init(&work->list);
+   bnx2fc_process_cq_compl(work->tgt, work->wqe);
+   kfree(work);
+   }
+
+   spin_lock_bh(&p->fp_work_lock);
+   }
+   spin_unlock_bh(&p->fp_work_lock);
 }
 
 static struct fc_host_statistics *bnx2fc_get_host_stats(struct Scsi_Host 
*shost)
@@ -2563,73 +2556,6 @@ static struct fcoe_transport bnx2fc_transport = {
.disable = bnx2fc_disable,
 };
 
-/**
- * bnx2fc_percpu_thread_create - Create a receive thread for an
- *  online CPU
- *
- * @cpu: cpu index for the online cpu
- */
-static void bnx2fc_percpu_thread_create(unsigned int cpu)
-{
-   struct bnx2fc_percpu_s *p;
-   struct task_struct *thread;
-
-   p = &per_cpu(bnx2fc_percpu, cpu);
-
-   thread = kthread_create_on_node(bnx2fc_percpu_io_thread,
-   (void *)p, cpu_to_node(cpu),
-   "bnx2fc_thread/%d", cpu);
-   /* bind thread to the cpu */
-   if (likely(!IS_ERR(thread))) {
-   kthread_bind(thread, cpu);
-   p->iothread = thread;
-   wake_up_process(thread);
-

[REEEEPOST] bnx2i + bnx2fc: convert to generic workqueue (#3)

2017-04-10 Thread Sebastian Andrzej Siewior
This is a repost to get the patches applied against v4.11-rc6. mkp's scsi
for-next tree can be merged with no conflicts.

The last repost [0] was not merged and stalled after Martin pinged Chad
[1]. He didn't even reply after tglx pinged him approx two weeks later.

Johannes Thumshirn was so kind to test the FCoE part of the old series
[2]. I did not add a tested-by tag due to the rebase.

If my memory is correct then my first attempt on this (with hotplug
threads back then) was around December 2015.

The whole series is also available at
 git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/hotplug-staging.git 
scsi/bnx2_v2

[0] http://www.spinics.net/lists/linux-scsi/msg102143.html
[1] http://www.spinics.net/lists/linux-scsi/msg102295.html
[2] http://www.spinics.net/lists/linux-scsi/msg102716.html

Sebastian



[PATCH 3/5] scsi: bnx2fc: clean up header definitions

2017-04-10 Thread Sebastian Andrzej Siewior
- All symbols which are only used within one .c file are marked static
  and removed from the bnx2fc.h file if possible.

- the declarion of bnx2fc_percpu is moved into the header file

This patch was only compile-tested due to -ENODEV.

Cc: qlogic-storage-upstr...@qlogic.com
Cc: Christoph Hellwig 
Signed-off-by: Sebastian Andrzej Siewior 
---
 drivers/scsi/bnx2fc/bnx2fc.h  |  5 +
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 19 +--
 drivers/scsi/bnx2fc/bnx2fc_hwi.c  |  6 ++
 3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/bnx2fc/bnx2fc.h b/drivers/scsi/bnx2fc/bnx2fc.h
index 0279cc8de7a0..5b2151e7d894 100644
--- a/drivers/scsi/bnx2fc/bnx2fc.h
+++ b/drivers/scsi/bnx2fc/bnx2fc.h
@@ -172,6 +172,7 @@ struct bnx2fc_percpu_s {
struct list_head work_list;
spinlock_t fp_work_lock;
 };
+DECLARE_PER_CPU(struct bnx2fc_percpu_s, bnx2fc_percpu);
 
 struct bnx2fc_fw_stats {
u64 fc_crc_cnt;
@@ -513,7 +514,6 @@ void bnx2fc_cmd_mgr_free(struct bnx2fc_cmd_mgr *cmgr);
 void bnx2fc_get_link_state(struct bnx2fc_hba *hba);
 char *bnx2fc_get_next_rqe(struct bnx2fc_rport *tgt, u8 num_items);
 void bnx2fc_return_rqe(struct bnx2fc_rport *tgt, u8 num_items);
-int bnx2fc_get_paged_crc_eof(struct sk_buff *skb, int tlen);
 int bnx2fc_send_rrq(struct bnx2fc_cmd *aborted_io_req);
 int bnx2fc_send_adisc(struct bnx2fc_rport *tgt, struct fc_frame *fp);
 int bnx2fc_send_logo(struct bnx2fc_rport *tgt, struct fc_frame *fp);
@@ -537,7 +537,6 @@ void bnx2fc_init_task(struct bnx2fc_cmd *io_req,
 void bnx2fc_add_2_sq(struct bnx2fc_rport *tgt, u16 xid);
 void bnx2fc_ring_doorbell(struct bnx2fc_rport *tgt);
 int bnx2fc_eh_abort(struct scsi_cmnd *sc_cmd);
-int bnx2fc_eh_host_reset(struct scsi_cmnd *sc_cmd);
 int bnx2fc_eh_target_reset(struct scsi_cmnd *sc_cmd);
 int bnx2fc_eh_device_reset(struct scsi_cmnd *sc_cmd);
 void bnx2fc_rport_event_handler(struct fc_lport *lport,
@@ -570,8 +569,6 @@ struct fc_seq *bnx2fc_elsct_send(struct fc_lport *lport, 
u32 did,
   struct fc_frame *,
   void *),
  void *arg, u32 timeout);
-void bnx2fc_arm_cq(struct bnx2fc_rport *tgt);
-int bnx2fc_process_new_cqes(struct bnx2fc_rport *tgt);
 void bnx2fc_process_cq_compl(struct bnx2fc_rport *tgt, u16 wqe);
 struct bnx2fc_rport *bnx2fc_tgt_lookup(struct fcoe_port *port,
 u32 port_id);
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c 
b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 329922d51f8a..2f66c2ea093c 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -49,7 +49,7 @@ struct workqueue_struct *bnx2fc_wq;
  * Here the io threads are per cpu but the l2 thread is just one
  */
 struct fcoe_percpu_s bnx2fc_global;
-DEFINE_SPINLOCK(bnx2fc_global_lock);
+static DEFINE_SPINLOCK(bnx2fc_global_lock);
 
 static struct cnic_ulp_ops bnx2fc_cnic_cb;
 static struct libfc_function_template bnx2fc_libfc_fcn_templ;
@@ -107,22 +107,22 @@ MODULE_PARM_DESC(debug_logging,
"\t\t0x10 - fcoe L2 fame related logs.\n"
"\t\t0xff - LOG all messages.");
 
-uint bnx2fc_devloss_tmo;
+static uint bnx2fc_devloss_tmo;
 module_param_named(devloss_tmo, bnx2fc_devloss_tmo, uint, S_IRUGO);
 MODULE_PARM_DESC(devloss_tmo, " Change devloss_tmo for the remote ports "
"attached via bnx2fc.");
 
-uint bnx2fc_max_luns = BNX2FC_MAX_LUN;
+static uint bnx2fc_max_luns = BNX2FC_MAX_LUN;
 module_param_named(max_luns, bnx2fc_max_luns, uint, S_IRUGO);
 MODULE_PARM_DESC(max_luns, " Change the default max_lun per SCSI host. Default 
"
"0x.");
 
-uint bnx2fc_queue_depth;
+static uint bnx2fc_queue_depth;
 module_param_named(queue_depth, bnx2fc_queue_depth, uint, S_IRUGO);
 MODULE_PARM_DESC(queue_depth, " Change the default queue depth of SCSI devices 
"
"attached via bnx2fc.");
 
-uint bnx2fc_log_fka;
+static uint bnx2fc_log_fka;
 module_param_named(log_fka, bnx2fc_log_fka, uint, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(log_fka, " Print message to kernel log when fcoe is "
"initiating a FIP keep alive when debug logging is enabled.");
@@ -167,7 +167,7 @@ static void bnx2fc_clean_rx_queue(struct fc_lport *lp)
spin_unlock_bh(&bg->fcoe_rx_list.lock);
 }
 
-int bnx2fc_get_paged_crc_eof(struct sk_buff *skb, int tlen)
+static int bnx2fc_get_paged_crc_eof(struct sk_buff *skb, int tlen)
 {
int rc;
spin_lock(&bnx2fc_global_lock);
@@ -1395,10 +1395,9 @@ static struct bnx2fc_hba *bnx2fc_hba_create(struct 
cnic_dev *cnic)
return NULL;
 }
 
-static struct bnx2fc_interface *
-bnx2fc_interface_create(struct bnx2fc_hba *hba,
-   struct net_device *netdev,
-   enum fip_state fip_mode)
+static struct bnx2fc_interface *bnx2fc_interface_create(struct bnx2fc_hba *hba,
+ struct net_device *netdev,
+ 

[PATCH for-4.4 15/16] blk-mq: Avoid memory reclaim when remapping queues

2017-04-10 Thread Sumit Semwal
From: Gabriel Krisman Bertazi 

[ Upstream commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 ]

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Brian King 
Cc: Douglas Miller 
Cc: linux-bl...@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sumit Semwal 
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d8d63c3..0d1af3e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1470,7 +1470,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(&tags->page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1496,7 +1496,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1517,7 +1517,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 * Allow kmemleak to scan these pages as they contain pointers
 * to additional allocations like via ops->init_request().
 */
-   kmemleak_alloc(p, order_to_size(this_order

Re: [PATCH 0/3] Unblock SCSI devices even if the LLD is being unloaded

2017-04-10 Thread Bart Van Assche
On Fri, 2017-03-17 at 05:54 -0700, James Bottomley wrote:
> but if you want to pursue your approach fixing the
> race with module exit is a requirement.

Hello James,

Sorry that it took so long but I finally found the time to implement and
test an alternative. I will post the patches that implement that new approach.

Bart.

[PATCH v2 3/4] sd: Make synchronize cache upon shutdown asynchronous

2017-04-10 Thread Bart Van Assche
This patch avoids that sd_shutdown() hangs on the SYNCHRONIZE CACHE
command if the block layer queue has been stopped by
scsi_target_block().

Signed-off-by: Bart Van Assche 
Cc: Israel Rukshin 
Cc: Max Gurtovoy 
Cc: Hannes Reinecke 
---
 drivers/scsi/sd.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fe0f7997074e..8e98b7684893 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1489,6 +1489,22 @@ static unsigned int sd_check_events(struct gendisk 
*disk, unsigned int clearing)
return retval;
 }
 
+/*
+ * Issue a SYNCHRONIZE CACHE command asynchronously. Since blk_cleanup_queue()
+ * waits for all commands to finish, __scsi_remove_device() will wait for the
+ * SYNCHRONIZE CACHE command to finish.
+ */
+static int sd_sync_cache_async(struct scsi_disk *sdkp)
+{
+   const struct scsi_device *sdp = sdkp->device;
+   const int timeout = sdp->request_queue->rq_timeout *
+   SD_FLUSH_TIMEOUT_MULTIPLIER;
+   const unsigned char cmd[10] = { SYNCHRONIZE_CACHE };
+
+   return scsi_execute_async(sdp, cmd, DMA_NONE, NULL, 0, timeout,
+ SD_MAX_RETRIES, 0, 0);
+}
+
 static int sd_sync_cache(struct scsi_disk *sdkp)
 {
int retries, res;
@@ -3356,6 +3372,8 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, 
int start)
 static void sd_shutdown(struct device *dev)
 {
struct scsi_disk *sdkp = dev_get_drvdata(dev);
+   const bool stop_disk = system_state != SYSTEM_RESTART &&
+   sdkp->device->manage_start_stop;
 
if (!sdkp)
return; /* this can happen */
@@ -3365,10 +3383,13 @@ static void sd_shutdown(struct device *dev)
 
if (sdkp->WCE && sdkp->media_present) {
sd_printk(KERN_NOTICE, sdkp, "Synchronizing SCSI cache\n");
-   sd_sync_cache(sdkp);
+   if (stop_disk)
+   sd_sync_cache(sdkp);
+   else
+   sd_sync_cache_async(sdkp);
}
 
-   if (system_state != SYSTEM_RESTART && sdkp->device->manage_start_stop) {
+   if (stop_disk) {
sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n");
sd_start_stop_device(sdkp, 0);
}
-- 
2.12.0



[PATCH v2 1/4] Introduce scsi_start_queue()

2017-04-10 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Israel Rukshin 
Cc: Max Gurtovoy 
Cc: Hannes Reinecke 
---
 drivers/scsi/scsi_lib.c  | 25 +++--
 drivers/scsi/scsi_priv.h |  1 +
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f41e6b84a1bd..4893f2908edf 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2987,6 +2987,20 @@ scsi_internal_device_block(struct scsi_device *sdev, 
bool wait)
 }
 EXPORT_SYMBOL_GPL(scsi_internal_device_block);
  
+void scsi_start_queue(struct scsi_device *sdev)
+{
+   struct request_queue *q = sdev->request_queue;
+   unsigned long flags;
+
+   if (q->mq_ops) {
+   blk_mq_start_stopped_hw_queues(q, false);
+   } else {
+   spin_lock_irqsave(q->queue_lock, flags);
+   blk_start_queue(q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+   }
+}
+
 /**
  * scsi_internal_device_unblock - resume a device after a block request
  * @sdev:  device to resume
@@ -3007,9 +3021,6 @@ int
 scsi_internal_device_unblock(struct scsi_device *sdev,
 enum scsi_device_state new_state)
 {
-   struct request_queue *q = sdev->request_queue; 
-   unsigned long flags;
-
/*
 * Try to transition the scsi device to SDEV_RUNNING or one of the
 * offlined states and goose the device queue if successful.
@@ -3027,13 +3038,7 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
 sdev->sdev_state != SDEV_OFFLINE)
return -EINVAL;
 
-   if (q->mq_ops) {
-   blk_mq_start_stopped_hw_queues(q, false);
-   } else {
-   spin_lock_irqsave(q->queue_lock, flags);
-   blk_start_queue(q);
-   spin_unlock_irqrestore(q->queue_lock, flags);
-   }
+   scsi_start_queue(sdev);
 
return 0;
 }
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index f11bd102d6d5..c7629e31a75b 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -89,6 +89,7 @@ extern void scsi_run_host_queues(struct Scsi_Host *shost);
 extern void scsi_requeue_run_queue(struct work_struct *work);
 extern struct request_queue *scsi_alloc_queue(struct scsi_device *sdev);
 extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
+extern void scsi_start_queue(struct scsi_device *sdev);
 extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
 extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
 extern int scsi_init_queue(void);
-- 
2.12.0



[PATCH v2 0/4] Avoid that __scsi_remove_device() hangs

2017-04-10 Thread Bart Van Assche
Several weeks ago Israel Rukshin reported that __scsi_remove_device()
hangs if it is waiting for the SYNCHRONIZE CACHE command submitted by
the sd driver to finish if the block layer queue is stopped and does
not get restarted. This patch series avoids that that hang occurs.

Bart Van Assche (4):
  Introduce scsi_start_queue()
  Introduce scsi_execute_async()
  sd: Make synchronize cache upon shutdown asynchronous
  Avoid that __scsi_remove_device() hangs

 drivers/scsi/scsi_lib.c| 114 ++---
 drivers/scsi/scsi_priv.h   |   1 +
 drivers/scsi/scsi_sysfs.c  |   9 
 drivers/scsi/sd.c  |  25 +-
 include/scsi/scsi_device.h |   4 ++
 5 files changed, 124 insertions(+), 29 deletions(-)

-- 
2.12.0



[PATCH v2 4/4] Avoid that __scsi_remove_device() hangs

2017-04-10 Thread Bart Van Assche
Since scsi_target_unblock() uses starget_for_each_device(), since
starget_for_each_device() uses scsi_device_get(), since
scsi_device_get() fails after unloading of the LLD kernel module
has been started scsi_target_unblock() may skip devices that were
affected by scsi_target_block(). Ensure that __scsi_remove_device()
does not hang for blocked SCSI devices. This patch avoids that
unloading the ib_srp kernel module can trigger the following hang:

Call Trace:
 schedule+0x35/0x80
 schedule_timeout+0x237/0x2d0
 io_schedule_timeout+0xa6/0x110
 wait_for_completion_io+0xa3/0x110
 blk_execute_rq+0xdf/0x120
 scsi_execute+0xce/0x150 [scsi_mod]
 scsi_execute_req_flags+0x8f/0xf0 [scsi_mod]
 sd_sync_cache+0xa9/0x190 [sd_mod]
 sd_shutdown+0x6a/0x100 [sd_mod]
 sd_remove+0x64/0xc0 [sd_mod]
 __device_release_driver+0x8d/0x120
 device_release_driver+0x1e/0x30
 bus_remove_device+0xf9/0x170
 device_del+0x127/0x240
 __scsi_remove_device+0xc1/0xd0 [scsi_mod]
 scsi_forget_host+0x57/0x60 [scsi_mod]
 scsi_remove_host+0x72/0x110 [scsi_mod]
 srp_remove_work+0x8b/0x200 [ib_srp]

Reported-by: Israel Rukshin 
Signed-off-by: Bart Van Assche 
Cc: Max Gurtovoy 
Cc: Hannes Reinecke 
---
 drivers/scsi/scsi_sysfs.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 82dfe07b1d47..e090c35ba6ee 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1299,6 +1299,15 @@ void __scsi_remove_device(struct scsi_device *sdev)
 * device.
 */
scsi_device_set_state(sdev, SDEV_DEL);
+   /*
+* Since scsi_target_unblock() is a no-op after unloading of the SCSI
+* LLD has started, explicitly restart the queue. Do this after the
+* device state has been changed into SDEV_DEL because
+* scsi_prep_state_check() returns BLKPREP_KILL for the SDEV_DEL state
+* Do this before calling blk_cleanup_queue() to avoid that that
+* function encounters a stopped queue.
+*/
+   scsi_start_queue(sdev);
blk_cleanup_queue(sdev->request_queue);
cancel_work_sync(&sdev->requeue_work);
 
-- 
2.12.0



[PATCH v2 2/4] Introduce scsi_execute_async()

2017-04-10 Thread Bart Van Assche
Move the code for submitting a SCSI command from scsi_execute()
into scsi_build_rq(). Introduce scsi_execute_async(). This patch
does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Israel Rukshin 
Cc: Max Gurtovoy 
Cc: Hannes Reinecke 
---
 drivers/scsi/scsi_lib.c| 89 +-
 include/scsi/scsi_device.h |  4 +++
 2 files changed, 76 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 4893f2908edf..b68699cba6a2 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -213,6 +213,73 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
__scsi_queue_insert(cmd, reason, 1);
 }
 
+static struct request *scsi_build_rq(const struct scsi_device *sdev,
+   const unsigned char *cmd, int data_direction, void *buffer,
+   unsigned bufflen, int timeout, int retries, u64 flags,
+   req_flags_t rq_flags)
+{
+   struct request *req;
+   struct scsi_request *rq;
+   int ret;
+
+   req = blk_get_request(sdev->request_queue,
+   data_direction == DMA_TO_DEVICE ?
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   if (IS_ERR(req))
+   return req;
+   rq = scsi_req(req);
+   scsi_req_init(req);
+
+   if (bufflen) {
+   ret = blk_rq_map_kern(sdev->request_queue, req,
+ buffer, bufflen, __GFP_RECLAIM);
+   if (ret) {
+   blk_put_request(req);
+   return ERR_PTR(ret);
+   }
+   }
+
+   rq->cmd_len = COMMAND_SIZE(cmd[0]);
+   memcpy(rq->cmd, cmd, rq->cmd_len);
+   req->retries = retries;
+   req->timeout = timeout;
+   req->cmd_flags |= flags;
+   req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+
+   return req;
+}
+
+/**
+ * scsi_execute_async - insert a SCSI request
+ * @sdev:  scsi device
+ * @cmd:   scsi command
+ * @data_direction: data direction
+ * @buffer:data buffer
+ * @bufflen:   length of buffer
+ * @timeout:   request timeout in seconds
+ * @retries:   number of times to retry request
+ * @flags: flags for ->cmd_flags
+ * @rq_flags:  flags for ->rq_flags
+ */
+int scsi_execute_async(const struct scsi_device *sdev,
+   const unsigned char *cmd, int data_direction, void *buffer,
+   unsigned bufflen, int timeout, int retries, u64 flags,
+   req_flags_t rq_flags)
+{
+   struct request *req;
+
+   req = scsi_build_rq(sdev, cmd, data_direction, buffer, bufflen, timeout,
+   retries, flags, rq_flags);
+   if (IS_ERR(req))
+   return PTR_ERR(req);
+   /*
+* head injection *required* here otherwise quiesce won't work
+*/
+   blk_execute_rq_nowait(req->q, NULL, req, 1, NULL);
+
+   return 0;
+}
+EXPORT_SYMBOL(scsi_execute_async);
 
 /**
  * scsi_execute - insert request and wait for the result
@@ -242,24 +309,12 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
-   data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   req = scsi_build_rq(sdev, cmd, data_direction, buffer, bufflen,
+   timeout, retries, flags, rq_flags);
if (IS_ERR(req))
-   return ret;
-   rq = scsi_req(req);
-   scsi_req_init(req);
+   return PTR_ERR(req);
 
-   if (bufflen &&  blk_rq_map_kern(sdev->request_queue, req,
-   buffer, bufflen, __GFP_RECLAIM))
-   goto out;
-
-   rq->cmd_len = COMMAND_SIZE(cmd[0]);
-   memcpy(rq->cmd, cmd, rq->cmd_len);
-   req->retries = retries;
-   req->timeout = timeout;
-   req->cmd_flags |= flags;
-   req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+   rq = scsi_req(req);
 
/*
 * head injection *required* here otherwise quiesce won't work
@@ -282,7 +337,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
if (sshdr)
scsi_normalize_sense(rq->sense, rq->sense_len, sshdr);
ret = req->errors;
- out:
+
blk_put_request(req);
 
return ret;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 080c7ce9bae8..cdff28519393 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -408,6 +408,10 @@ extern const char *scsi_device_state_name(enum 
scsi_device_state);
 extern int scsi_is_sdev_device(const struct device *);
 extern int scsi_is_target_device(const struct device *);
 extern void scsi_sanitize_inquiry_string(unsigned char *s, int len);
+extern int scsi_execute_async(const struct scsi_device *sdev,
+  

tcm_qla2xxx and target mode not stable at all on 4.10+

2017-04-10 Thread Laurence Oberman
Hello

I have had issues with the target mode working since moving to 4.10+.
I am using a qla25xx card at 8Gbit
Latest testing with 4.11 RC6 sees the same issue.

Going back to 4.10.4 I can map targets but when I use my jammer I get into 
other issues.

Its rock solid on 4.9 with the jammer.

I will be going back now and testing from 4.9+ and later and report back but 
wanted
to make the folks on this list aware of it. 

We loop in 

[ 1654.895663] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue ().


Log

[ 1617.374267] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1617.627264] ignoring deprecated emulate_fua_read attribute
[ 1617.654380] ignoring deprecated emulate_dpo attribute
[ 1622.822301] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1627.522678] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1632.209884] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1636.895416] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1642.271975] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1645.478152] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1648.609477] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1650.192484] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1651.771651] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1653.350823] Rounding down aligned max_sectors from 4294967295 to 4294967288
[ 1653.552445] qla2xxx [:0a:00.0]-00af:4: Performing ISP error recovery - 
ha=8817c8e0.
[ 1654.802967] qla2xxx [:0a:00.0]-500a:4: LOOP UP detected (4 Gbps).
[ 1654.857970] qla2xxx [:0a:00.0]-5032:4: Invalid completion handle (0) -- 
timed-out.
[ 1654.895663] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue ().
[ 1654.933551] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (0501).
[ 1654.971256] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (5c03).
[ 1655.009369] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (6400).
[ 1655.047874] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (1c01).
[ 1655.086435] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (5834).
[ 1655.124292] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (6407).
[ 1655.162292] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (1c01).
[ 1655.200088] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (0002).
[ 1655.237939] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (5838).
[ 1684.982064] qla2xxx [:0a:00.0]-1119:4: MBX Command timeout for cmd 27, 
iocontrol=8 jiffies=1001521c0 mb[0-3]=[0x4000 0x0 0xc8f1 0xc8] mb7 0x17 
host_status 0x14 hccr 0x40
[ 1685.070674] qla2xxx [:0a:00.0]-d001:4: Firmware dump saved to temp 
buffer (4/c9001bf31000), dump status flags (0x3f).
[ 1685.126059] qla2xxx [:0a:00.0]-8033:4: Unable to reinitialize FCE (256).
[ 1685.172066] qla2xxx [:0a:00.0]-8034:4: Unable to reinitialize EFT (258).
[ 1715.253118] qla2xxx [:0a:00.0]-1015:4: cmd=0x69, waited 30049 msecs
[ 1715.284860] qla2xxx [:0a:00.0]-1119:4: MBX Command timeout for cmd 69, 
iocontrol=0 jiffies=10015981f mb[0-3]=[0x4001 0x4953 0x5020 0x2532] mb7 0x1 
host_status 0x40018002 hccr 0x0
[ 1715.362716] qla2xxx [:0a:00.0]-d009:4: Firmware has been previously 
dumped (c9001bf31000) -- ignoring request.
[ 1715.414674] qla2xxx [:0a:00.0]-101e:4: Mailbox cmd timeout occurred, 
cmd=0x69, mb[0]=0x69. Scheduling ISP abort 
[ 1715.414679] qla2xxx [:0a:00.0]-00af:4: Performing ISP error recovery - 
ha=8817c8e0.
[ 1716.713072] qla2xxx [:0a:00.0]-500a:4: LOOP UP detected (4 Gbps).
[ 1716.803096] qla2xxx [:0a:00.0]-803b:4: Firmware ready  FAILED .
[ 1717.202630] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (1c01).
[ 1717.210468] qla2xxx [:0a:00.1]-00af:6: Performing ISP error recovery - 
ha=8817c88e.
[ 1717.282428] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (5834).
[ 1717.31] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (6407).
[ 1717.359180] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (1c01).
[ 1717.396856] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (0002).
[ 1717.434517] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (5838).
[ 1717.472756] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (0502).
[ 1717.511664] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (4c0c).
[ 1717.549481] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (580c).
[ 1717.587457] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queue (0105).
[ 1717.625668] qla2xxx [:0a:00.0]-5030:4: Error entry - invalid 
handle/queu

Re: [REEEEPOST] bnx2i + bnx2fc: convert to generic workqueue (#3)

2017-04-10 Thread Chad Dupuis

On Mon, 10 Apr 2017, 5:12pm -, Sebastian Andrzej Siewior wrote:

> This is a repost to get the patches applied against v4.11-rc6. mkp's scsi
> for-next tree can be merged with no conflicts.
> 
> The last repost [0] was not merged and stalled after Martin pinged Chad
> [1]. He didn't even reply after tglx pinged him approx two weeks later.
> 
> Johannes Thumshirn was so kind to test the FCoE part of the old series
> [2]. I did not add a tested-by tag due to the rebase.
> 
> If my memory is correct then my first attempt on this (with hotplug
> threads back then) was around December 2015.
> 
> The whole series is also available at
>  git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/hotplug-staging.git 
> scsi/bnx2_v2
> 
> [0] http://www.spinics.net/lists/linux-scsi/msg102143.html
> [1] http://www.spinics.net/lists/linux-scsi/msg102295.html
> [2] http://www.spinics.net/lists/linux-scsi/msg102716.html
> 
> Sebastian
> 

Sebastian, will take a look. 


[PATCH] ibmvscsis: Do not send aborted task response

2017-04-10 Thread Bryant G. Ly
The driver is sending a response to the aborted task response
along with LIO sending the tmr response.
ibmvscsis_tgt does not send the response to the client until
release_cmd time. The reason for this was because if we did it
at queue_status time, then the client would be free to reuse the
tag for that command, but we're still using the tag until the
command is released at release_cmd time, so we chose to delay
sending the response until then. That then caused this issue, because
release_cmd is always called, even if queue_status is not.

SCSI spec says that the initiator that sends the abort task
TM NEVER gets a response to the aborted op and with the current
code it will send a response. Thus this fix will remove that response
if the TAS bit is set.

Cc:  # v4.8+
Signed-off-by: Bryant G. Ly 
---
 drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 60 +---
 1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c 
b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
index 4bb5635..f75948a 100644
--- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
+++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
@@ -1758,33 +1758,42 @@ static void ibmvscsis_send_messages(struct scsi_info 
*vscsi)
 
if (!(vscsi->flags & RESPONSE_Q_DOWN)) {
list_for_each_entry_safe(cmd, nxt, &vscsi->waiting_rsp, list) {
-   iue = cmd->iue;
+   /*
+* If Task Abort Status Bit is set, then dont send a
+* response.
+*/
+   if (cmd->se_cmd.transport_state & CMD_T_TAS) {
+   list_del(&cmd->list);
+   ibmvscsis_free_cmd_resources(vscsi, cmd);
+   } else {
+   iue = cmd->iue;
 
-   crq->valid = VALID_CMD_RESP_EL;
-   crq->format = cmd->rsp.format;
+   crq->valid = VALID_CMD_RESP_EL;
+   crq->format = cmd->rsp.format;
 
-   if (cmd->flags & CMD_FAST_FAIL)
-   crq->status = VIOSRP_ADAPTER_FAIL;
+   if (cmd->flags & CMD_FAST_FAIL)
+   crq->status = VIOSRP_ADAPTER_FAIL;
 
-   crq->IU_length = cpu_to_be16(cmd->rsp.len);
+   crq->IU_length = cpu_to_be16(cmd->rsp.len);
 
-   rc = h_send_crq(vscsi->dma_dev->unit_address,
-   be64_to_cpu(msg_hi),
-   be64_to_cpu(cmd->rsp.tag));
+   rc = h_send_crq(vscsi->dma_dev->unit_address,
+   be64_to_cpu(msg_hi),
+   be64_to_cpu(cmd->rsp.tag));
 
-   pr_debug("send_messages: cmd %p, tag 0x%llx, rc %ld\n",
-cmd, be64_to_cpu(cmd->rsp.tag), rc);
+   pr_debug("send_messages: cmd %p, tag 0x%llx, rc 
%ld\n",
+cmd, be64_to_cpu(cmd->rsp.tag), rc);
 
-   /* if all ok free up the command element resources */
-   if (rc == H_SUCCESS) {
-   /* some movement has occurred */
-   vscsi->rsp_q_timer.timer_pops = 0;
-   list_del(&cmd->list);
+   /* if all ok free up the command element 
resources */
+   if (rc == H_SUCCESS) {
+   /* some movement has occurred */
+   vscsi->rsp_q_timer.timer_pops = 0;
+   list_del(&cmd->list);
 
-   ibmvscsis_free_cmd_resources(vscsi, cmd);
-   } else {
-   srp_snd_msg_failed(vscsi, rc);
-   break;
+   ibmvscsis_free_cmd_resources(vscsi, 
cmd);
+   } else {
+   srp_snd_msg_failed(vscsi, rc);
+   break;
+   }
}
}
 
@@ -3581,9 +3590,20 @@ static int ibmvscsis_write_pending(struct se_cmd *se_cmd)
 {
struct ibmvscsis_cmd *cmd = container_of(se_cmd, struct ibmvscsis_cmd,
 se_cmd);
+   struct scsi_info *vscsi = cmd->adapter;
struct iu_entry *iue = cmd->iue;
int rc;
 
+   /*
+* If CLIENT_FAILED OR RESPONSE_Q_DOWN, then just return success
+* since LIO can't do anything about it, and we dont want to
+* attempt an srp_transfer_data.
+*/
+   

[Bug 195285] qla2xxx FW immediatly crashing after target start

2017-04-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=195285

himanshu.madh...@cavium.com (himanshu.madh...@qlogic.com) changed:

   What|Removed |Added

 CC||himanshu.madh...@qlogic.com

--- Comment #1 from himanshu.madh...@cavium.com (himanshu.madh...@qlogic.com) 
---
Hi Anthony, 

(In reply to Anthony from comment #0)
> System always become unresposive after target start with messages
> 
> qla2xxx [:07:00.0]-00fb:1: QLogic QLE2564 - PCI-Express Quad Channel 8Gb
> Fibre Channel HBA.
> qla2xxx [:07:00.0]-00fc:1: ISP2532: PCIe (5.0GT/s x8) @ :07:00.0
> hdma+ host#=1 fw=8.06.00 (90d5).
> qla2xxx [:07:00.1]-001a: : MSI-X vector count: 32.
> qla2xxx [:07:00.1]-001d: : Found an ISP2532 irq 103 iobase
> 0xb830c62d5000.
> qla2xxx [:07:00.1]-504b:2: RISC paused -- HCCR=40, Dumping firmware.
> qla2xxx [:07:00.1]-8033:2: Unable to reinitialize FCE (258).
> qla2xxx [:07:00.1]-8034:2: Unable to reinitialize EFT (258).
> qla2xxx [:07:00.1]-00af:2: Performing ISP error recovery -
> ha=88a2624e.
> qla2xxx [:07:00.1]-504b:2: RISC paused -- HCCR=40, Dumping firmware.
> 
> trying - kernel 4.11-rc5
> 
> Apr 07 23:39:58 : [ cut here ]
> Apr 07 23:39:58 : WARNING: CPU: 0 PID: 1468 at lib/dma-debug.c:519
> add_dma_entry+0x176/0x180
> Apr 07 23:39:58 : DMA-API: exceeded 7 overlapping mappings of cacheline
> 0x13e77000
> Apr 07 23:39:58 : Modules linked in: vhost_net vhost tap tun ebtable_filter
> ebtables ip6table_filter ip6_tables tcm_qla2xxx target_core_user uio targ
> Apr 07 23:39:58 :  nvme_core scsi_transport_sas
> Apr 07 23:39:58 : CPU: 0 PID: 1468 Comm: qemu-system-x86 Tainted: GW
> I 4.11.0-0.rc5.git3.1.fc27.x86_64 #1
> Apr 07 23:39:58 : Hardware name: HP ProLiant DL180 G6  , BIOS O20 07/01/2013
> Apr 07 23:39:58 : Call Trace:
> Apr 07 23:39:58 :  dump_stack+0x8e/0xd1
> Apr 07 23:39:58 :  __warn+0xcb/0xf0
> Apr 07 23:39:58 :  warn_slowpath_fmt+0x5a/0x80
> Apr 07 23:39:58 :  ? active_cacheline_read_overlap+0x2e/0x60
> Apr 07 23:39:58 :  add_dma_entry+0x176/0x180
> Apr 07 23:39:58 :  debug_dma_map_sg+0x11a/0x170
> Apr 07 23:39:58 :  nvme_queue_rq+0x513/0x950 [nvme]
> Apr 07 23:39:58 :  blk_mq_try_issue_directly+0xbb/0x110
> Apr 07 23:39:58 :  blk_mq_make_request+0x3a9/0xa70
> Apr 07 23:39:58 :  ? blk_queue_enter+0xa3/0x2c0
> Apr 07 23:39:58 :  ? blk_queue_enter+0x39/0x2c0
> Apr 07 23:39:58 :  ? generic_make_request+0xf9/0x3b0
> Apr 07 23:39:58 :  generic_make_request+0x126/0x3b0
> Apr 07 23:39:58 :  ? iov_iter_get_pages+0xc9/0x330
> Apr 07 23:39:58 :  submit_bio+0x73/0x150
> Apr 07 23:39:58 :  ? submit_bio+0x73/0x150
> Apr 07 23:39:58 :  ? bio_iov_iter_get_pages+0xe0/0x120
> Apr 07 23:39:58 :  blkdev_direct_IO+0x1f7/0x3e0
> Apr 07 23:39:58 :  ? SYSC_io_destroy+0x1d0/0x1d0
> Apr 07 23:39:58 :  ? __atime_needs_update+0x7f/0x1a0
> Apr 07 23:39:58 :  generic_file_read_iter+0x2e5/0xad0
> Apr 07 23:39:58 :  ? generic_file_read_iter+0x2e5/0xad0
> Apr 07 23:39:58 :  ? rw_copy_check_uvector+0x8a/0x180
> Apr 07 23:39:58 :  blkdev_read_iter+0x35/0x40
> Apr 07 23:39:58 :  aio_read+0xeb/0x150
> Apr 07 23:39:58 :  ? sched_clock+0x9/0x10
> Apr 07 23:39:58 :  ? sched_clock_cpu+0x11/0xc0
> Apr 07 23:39:58 :  ? __might_fault+0x3e/0x90
> Apr 07 23:39:58 :  ? __might_fault+0x3e/0x90
> Apr 07 23:39:58 :  do_io_submit+0x5f8/0x920
> Apr 07 23:39:58 :  ? do_io_submit+0x5f8/0x920
> Apr 07 23:39:58 :  SyS_io_submit+0x10/0x20
> Apr 07 23:39:58 :  ? SyS_io_submit+0x10/0x20
> Apr 07 23:39:58 :  entry_SYSCALL_64_fastpath+0x1f/0xc2
> Apr 07 23:39:58 : RIP: 0033:0x7f73766216a7
> Apr 07 23:39:58 : RSP: 002b:7ffc9aac6108 EFLAGS: 0246 ORIG_RAX:
> 00d1
> Apr 07 23:39:58 : RAX: ffda RBX: 55617d90b900 RCX:
> 7f73766216a7
> Apr 07 23:39:58 : RDX: 7ffc9aac6120 RSI: 0002 RDI:
> 7f737780
> Apr 07 23:39:58 : RBP: 0258 R08: 7ffc9aac6440 R09:
> 55617d9a2000
> Apr 07 23:39:58 : R10: 556188f93cf0 R11: 0246 R12:
> 0280
> Apr 07 23:39:58 : R13: 0130 R14: 0001 R15:
> 0011
> Apr 07 23:39:58 : ---[ end trace 81f169903702b67d ]---

We are working internally to reproduce this issue. we'll report what we find
out from reproduction. 

Thanks,
Himanshu

-- 
You are receiving this mail because:
You are watching the assignee of the bug.


Re: [PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios

2017-04-10 Thread Anthony Youngman

s/past/paste/

On 05/04/17 15:21, Christoph Hellwig wrote:

Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.


Cheers,
Wol


Race to power off harming SATA SSDs

2017-04-10 Thread Henrique de Moraes Holschuh
Summary:

Linux properly issues the SSD prepare-to-poweroff command to SATA SSDs,
but it does not wait for long enough to ensure the SSD has carried it
through.

This causes a race between the platform power-off path, and the SSD
device.  When the SSD loses the race, its power is cut while it is still
doing its final book-keeping for poweroff.  This is known to be harmful
to most SSDs, and there is a non-zero chance of it even bricking.

Apparently, it is enough to wait a few seconds before powering off the
platform to give the SSDs enough time to fully enter STANDBY IMMEDIATE.

This issue was verified to exist on SATA SSDs made by at least Crucial
(and thus likely also Micron), Intel, and Samsung.  It was verified to
exist on several 3.x to 4.9 kernels, both distro (Debian) and also
upstream stable/longterm kernels from kernel.org.   Only x86-64 was
tested.

A proof of concept patch is attached, which was sufficient to
*completely* avoid the issue on the test set, for a perid of six to
eight weeks of testing.


Details and hypothesis:

For a long while I have noticed that S.M.A.R.T-provided attributes for
SSDs related to "unit was powered off unexpectedly" under my care where
raising on several boxes, without any unexpected power cuts being
accounted for.

This has been going for a *long* time (several years, since the first
SSD I got).  But it was too rare an event for me to try to track down
the root cause...  until a friend reported his SSD was already reporting
several hundred unclean power-offs on his laptop.  That made it much
easier to track down.

Per spec (and device manuals), SCSI, SATA and ATA-attached SSDs must be
informed of an imminent poweroff to checkpoing background tasks, flush
RAM caches and close logs.  For SCSI SSDs, you must tissue a
START_STOP_UNIT (stop) command.  For SATA, you must issue a STANDBY
IMMEDIATE command.  I haven't checked ATA, but it should be the same as
SATA.

In order to comply with this requirement, the Linux SCSI "sd" device
driver issues a START_STOP_UNIT command when the device is shutdown[1].
For SATA SSD devices, the SCSI START_STOP_UNIT command is properly
translated by the kernel SAT layer to STANDBY IMMEDIATE for SSDs.

After issuing the command, the kernel properly waits for the device to
report that the command has been completed before it proceeds.

However, *IN PRACTICE*, SATA STANDBY IMMEDIATE command completion
[often?] only indicates that the device is now switching to the target
power management state, not that it has reached the target state.  Any
further device status inquires would return that it is in STANDBY mode,
even if it is still entering that state.

The kernel then continues the shutdown path while the SSD is still
preparing itself to be powered off, and it becomes a race.  When the
kernel + firmware wins, platform power is cut before the SSD has
finished (i.e. the SSD is subject to an unclean power-off).

Evidently, how often the SSD will lose the race depends on a platform
and SSD combination, and also on how often the system is powered off.
A sluggish firmware that takes its time to cut power can save the day...


Observing the effects:

An unclean SSD power-off will be signaled by the SSD device through an
increase on a specific S.M.A.R.T attribute.  These SMART attributes can
be read using the smartmontools package from www.smartmontools.org,
which should be available in just about every Linux distro.

smartctl -A /dev/sd#

The SMART attribute related to unclean power-off is vendor-specific, so
one might have to track down the SSD datasheet to know which attribute a
particular SSD uses.  The naming of the attribute also varies.

For a Crucial M500 SSD with up-to-date firmware, this would be attribute
174 "Unexpect_Power_Loss_Ct", for example.

NOTE: unclean SSD power-offs are dangerous and may brick the device in
the worst case, or otherwise harm it (reduce longevity, damage flash
blocks).  It is also not impossible to get data corruption.


Testing, and working around the issue:

I've asked for several Debian developers to test a patch (attached) in
any of their boxes that had SSDs complaining of unclean poweroffs.  This
gave us a test corpus of Intel, Crucial and Samsung SSDs, on laptops,
desktops, and a few workstations.

The proof-of-concept patch adds a delay of one second to the SD-device
shutdown path.

Previously, the more sensitive devices/platforms in the test set would
report at least one or two unclean SSD power-offs a month.  With the
patch, there was NOT a single increase reported after several weeks of
testing.

This is obviously not a test with 100% confidence, but it indicates very
strongly that the above analysis was correct, and that an added delay
was enough to work around the issue in the entire test set.



Fixing the issue properly:

The proof of concept patch works fine, but it "punishes" the system with
too much delay.  Also, if sd device shutdown is serialized, it will
punish systems with many /dev/

Re: Race to power off harming SATA SSDs

2017-04-10 Thread Bart Van Assche
On Mon, 2017-04-10 at 20:21 -0300, Henrique de Moraes Holschuh wrote:
> A proof of concept patch is attached

Thank you for the very detailed write-up. Sorry but no patch was attached
to the e-mail I received from you ...

Bart.

sd: wait for slow devices on shutdown path

2017-04-10 Thread Henrique de Moraes Holschuh
Author: Henrique de Moraes Holschuh 
Date:   Wed Feb 1 20:42:02 2017 -0200

sd: wait for slow devices on shutdown path

Wait 1s during suspend/shutdown for the device to settle after
we issue the STOP command.

Otherwise we race ATA SSDs to powerdown, possibly causing damage to
FLASH/data and even bricking the device.

This is an experimental patch, there are likely better ways of doing
this that don't punish non-SSDs.

Signed-off-by: Henrique de Moraes Holschuh 

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4e08d1cd..3c6d5d3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3230,6 +3230,38 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, 
int start)
res = 0;
}
 
+   /*
+* Wait for slow devices that signal they have fully entered
+* the stopped state before they actully did it.
+*
+* This behavior is apparently allowed per-spec for ATA
+* devices, and our SAT layer does not account for it.
+* Thus, on return, the device might still be in the process
+* of entering STANDBY state.
+*
+* Worse, apparently the ATA spec also says the unit should
+* return that it is already in STANDBY state *while still
+* entering that state*.
+*
+* SSDs absolutely depend on receiving a STANDBY IMMEDIATE
+* command prior to power off for a clean shutdown (and
+* likely we don't want to send them *anything else* in-
+* between either, to be on the safe side).
+*
+* As things stand, we are racing the SSD's firmware.  If it
+* finishes first, nothing bad happens.  If it doesn't, we
+* cut power while it is still saving metadata, and not only
+* this will cause extra FLASH wear (and maybe even damage
+* some cells), it also has a non-zero chance of bricking the
+* SSD.
+*
+* Issue reported on Intel, Crucial and Micron SSDs.
+* Issue can be detected by S.M.A.R.T. signaling unexpected
+* power cuts.
+*/
+   if (!res && !start)
+   msleep(1000);
+
/* SCSI error codes must not go to the generic layer */
if (res)
return -EIO;

-- 
  Henrique Holschuh


Re: Race to power off harming SATA SSDs

2017-04-10 Thread Henrique de Moraes Holschuh
On Mon, 10 Apr 2017, Bart Van Assche wrote:
> On Mon, 2017-04-10 at 20:21 -0300, Henrique de Moraes Holschuh wrote:
> > A proof of concept patch is attached
> 
> Thank you for the very detailed write-up. Sorry but no patch was attached
> to the e-mail I received from you ...

Indeed.  It should arrive shortly, helpfully undamaged.  Otherwise I
will resend with git-send-email :p

-- 
  Henrique Holschuh


Re: Race to power off harming SATA SSDs

2017-04-10 Thread Tejun Heo
Hello,

On Mon, Apr 10, 2017 at 08:21:19PM -0300, Henrique de Moraes Holschuh wrote:
...
> Per spec (and device manuals), SCSI, SATA and ATA-attached SSDs must be
> informed of an imminent poweroff to checkpoing background tasks, flush
> RAM caches and close logs.  For SCSI SSDs, you must tissue a
> START_STOP_UNIT (stop) command.  For SATA, you must issue a STANDBY
> IMMEDIATE command.  I haven't checked ATA, but it should be the same as
> SATA.

Yeah, it's the same.  Even hard drives are expected to survive a lot
of unexpected power losses tho.  They have to do emergency head
unloads but they're designed to withstand a healthy number of them.

> In order to comply with this requirement, the Linux SCSI "sd" device
> driver issues a START_STOP_UNIT command when the device is shutdown[1].
> For SATA SSD devices, the SCSI START_STOP_UNIT command is properly
> translated by the kernel SAT layer to STANDBY IMMEDIATE for SSDs.
> 
> After issuing the command, the kernel properly waits for the device to
> report that the command has been completed before it proceeds.
> 
> However, *IN PRACTICE*, SATA STANDBY IMMEDIATE command completion
> [often?] only indicates that the device is now switching to the target
> power management state, not that it has reached the target state.  Any
> further device status inquires would return that it is in STANDBY mode,
> even if it is still entering that state.
> 
> The kernel then continues the shutdown path while the SSD is still
> preparing itself to be powered off, and it becomes a race.  When the
> kernel + firmware wins, platform power is cut before the SSD has
> finished (i.e. the SSD is subject to an unclean power-off).

At that point, the device is fully flushed and in terms of data
integrity should be fine with losing power at any point anyway.

> Evidently, how often the SSD will lose the race depends on a platform
> and SSD combination, and also on how often the system is powered off.
> A sluggish firmware that takes its time to cut power can save the day...
> 
> 
> Observing the effects:
> 
> An unclean SSD power-off will be signaled by the SSD device through an
> increase on a specific S.M.A.R.T attribute.  These SMART attributes can
> be read using the smartmontools package from www.smartmontools.org,
> which should be available in just about every Linux distro.
> 
>   smartctl -A /dev/sd#
> 
> The SMART attribute related to unclean power-off is vendor-specific, so
> one might have to track down the SSD datasheet to know which attribute a
> particular SSD uses.  The naming of the attribute also varies.
> 
> For a Crucial M500 SSD with up-to-date firmware, this would be attribute
> 174 "Unexpect_Power_Loss_Ct", for example.
> 
> NOTE: unclean SSD power-offs are dangerous and may brick the device in
> the worst case, or otherwise harm it (reduce longevity, damage flash
> blocks).  It is also not impossible to get data corruption.

I get that the incrementing counters might not be pretty but I'm a bit
skeptical about this being an actual issue.  Because if that were
true, the device would be bricking itself from any sort of power
losses be that an actual power loss, battery rundown or hard power off
after crash.

> Testing, and working around the issue:
> 
> I've asked for several Debian developers to test a patch (attached) in
> any of their boxes that had SSDs complaining of unclean poweroffs.  This
> gave us a test corpus of Intel, Crucial and Samsung SSDs, on laptops,
> desktops, and a few workstations.
> 
> The proof-of-concept patch adds a delay of one second to the SD-device
> shutdown path.
> 
> Previously, the more sensitive devices/platforms in the test set would
> report at least one or two unclean SSD power-offs a month.  With the
> patch, there was NOT a single increase reported after several weeks of
> testing.
> 
> This is obviously not a test with 100% confidence, but it indicates very
> strongly that the above analysis was correct, and that an added delay
> was enough to work around the issue in the entire test set.
> 
> 
> 
> Fixing the issue properly:
> 
> The proof of concept patch works fine, but it "punishes" the system with
> too much delay.  Also, if sd device shutdown is serialized, it will
> punish systems with many /dev/sd devices severely.
> 
> 1. The delay needs to happen only once right before powering down for
>hibernation/suspend/power-off.  There is no need to delay per-device
>for platform power off/suspend/hibernate.
> 
> 2. A per-device delay needs to happen before signaling that a device
>can be safely removed when doing controlled hotswap (e.g. when
>deleting the SD device due to a sysfs command).
> 
> I am unsure how much *total* delay would be enough.  Two seconds seems
> like a safe bet.
> 
> Any comments?  Any clues on how to make the delay "smarter" to trigger
> only once during platform shutdown, but still trigger per-device when
> doing per-device hotswapping ?

So, if this is actually an issue, sure, we

Re: [PATCHv6 2/8] sd: Return SUCCESS in sd_eh_action() after device offline

2017-04-10 Thread Bart Van Assche
On Thu, 2017-04-06 at 15:36 +0200, Hannes Reinecke wrote:
> If sd_eh_action() decides to take the device offline there is
> no point in returning FAILED, as taking the device offline
> is the ultimate step in SCSI EH anyway.
> So further escalation via SCSI EH is not likely to make a
> difference and we can as well return SUCCESS.

Reviewed-by: Bart Van Assche 

Re: [PATCHv6 3/8] scsi: always send command aborts

2017-04-10 Thread Bart Van Assche
On Thu, 2017-04-06 at 15:36 +0200, Hannes Reinecke wrote:
> When a command has timed out we always should be sending an
> abort; with the previous code a failed abort might signal
> SCSI EH to start, and all other timed out commands will
> never be aborted, even though they might belong to a
> different ITL nexus.

Reviewed-by: Bart Van Assche 

Re: Race to power off harming SATA SSDs

2017-04-10 Thread James Bottomley
On Tue, 2017-04-11 at 08:52 +0900, Tejun Heo wrote:
[...]
> > Any comments?  Any clues on how to make the delay "smarter" to 
> > trigger only once during platform shutdown, but still trigger per
> > -device when doing per-device hotswapping ?
> 
> So, if this is actually an issue, sure, we can try to work around;
> however, can we first confirm that this has any other consequences
> than a SMART counter being bumped up?  I'm not sure how meaningful
> that is in itself.

Seconded; especially as the proposed patch is way too invasive: we run
single threaded on shutdown and making every disk wait 1s is going to
drive enterprises crazy.  I'm with Tejun: If the device replies GOOD to
SYNCHRONIZE CACHE, that means we're entitled to assume all written data
is safely on non-volatile media and any "essential housekeeping" can be
redone if the power goes away.

James




Re: [PATCHv6 5/8] scsi: make eh_eflags persistent

2017-04-10 Thread Bart Van Assche
On Thu, 2017-04-06 at 15:36 +0200, Hannes Reinecke wrote:
> +is invoked to schedule an asynchrous abort.
 ^^
Sorry that I hadn't noticed this before but if you have to repost this patch
please fix the spelling of this word.

Bart.

Re: [PATCHv6 8/8] scsi: inline command aborts

2017-04-10 Thread Bart Van Assche
On Thu, 2017-04-06 at 15:36 +0200, Hannes Reinecke wrote:
> The block layer always calls the timeout function from a workqueue
> context, so there is no need to have yet another workqueue for
> running command aborts.
> 
> [ ... ]
> @@ -271,10 +266,14 @@ enum blk_eh_timer_return scsi_times_out(struct request 
> *req)
>   rtn = host->hostt->eh_timed_out(scmd);
>  
>   if (rtn == BLK_EH_NOT_HANDLED) {
> - if (scsi_abort_command(scmd) != SUCCESS) {
> + int ret;
> +
> + ret = scsi_abort_command(scmd);
> + if (ret == FAILED) {
>   set_host_byte(scmd, DID_TIME_OUT);
>   scsi_eh_scmd_add(scmd);
> - }
> + } else if (ret == FAST_IO_FAIL)
> + rtn = BLK_EH_RESET_TIMER;
>   }

Has this patch been tested with the traditional block layer? For the
traditional block layer scsi_times_out() is called with the queue lock
held. Does this patch cause .eh_abort_handler(), a function that may
sleep, to be called with the queue lock held?

Bart.

[PATCH 0/4] scsi: scsi_dh_alua: handle target port unavailable state

2017-04-10 Thread Mauricio Faria de Oliveira
This patch series resolves a problem in which all paths of a multipath device
became _permanently_ failed after a storage system had moved both controllers
into a _temporarily_ unavailable state (that is SCSI_ACCESS_STATE_UNAVAILABLE).

This happened because once scsi_dh_alua had set the 'pg->state' to that value,
any IO coming to that PG via alua_prep_fn() would be immediately failed there.

It was possible to confirm that IO coming to that PG by another function path
(e.g., SG_IO) would perform normally once that PG's respective storage system
controller had transitioned back to an active state.

- Patch 1 essentially resolves that problem by allowing IO requests coming in
  the SCSI_ACCESS_STATE_UNAVAILABLE to actually proceed in alua_prep_fn(). It
  also schedules a recheck in alua_check_sense() to update pg->state properly.
  The problem/debug test-case is included in its commit message for reference.

- Patch 2 and Patch 3 address uncertainty & potentially incorrect assumptions
  when trying to reconcile the alua: RTPG information in the kernel logs with
  the actual port groups state at a given point in time and to multipath/path
  checkers status/failed/reinstated messages, since scsi_dh_alua could update
  the PG state for the 'other' PG (i.e., not the PG by which the RTPG request
  was sent to) but only present an updated state message for the 'current' PG.

- Patch 4 silences the scsi_dh_alua messages about RTPG state/information for
  the unavailable state if it is no news (i.e., not a transition to/out-of it),
  only keeping the first and (potentially) last message (when it is some news).
  That's because during the period in which the unavailable state is in place,
  the path checkers will naturally have to go through alua_check_sense() path,
  which schedules a recheck and thus alua_rtpg() goes through the sdev_printk.

This patch series has been tested with the 4.11-rc4 kernel.

For documentation purposes, I'll reply to this cover letter with the analysis
of such cases of this problem, and the accompanying messages from kernel logs.

Mauricio Faria de Oliveira (4):
  scsi: scsi_dh_alua: allow I/O in the target port unavailable state
  scsi: scsi_dh_alua: create alua_rtpg_print() for alua_rtpg()
sdev_printk
  scsi: scsi_dh_alua: print changes to RTPG state of other PGs too
  scsi: scsi_dh_alua: do not print target port group state if it remains
unavailable

 drivers/scsi/device_handler/scsi_dh_alua.c | 99 ++
 1 file changed, 88 insertions(+), 11 deletions(-)

-- 
1.8.3.1



[PATCH 1/4] scsi: scsi_dh_alua: allow I/O in the target port unavailable state

2017-04-10 Thread Mauricio Faria de Oliveira
According to SPC-4 (5.15.2.4.5 Unavailable state), the unavailable
state may (or may not) transition to other states (e.g., microcode
downloading or hardware error, which may be temporary or permanent
conditions, respectively).

But, scsi_dh_alua currently fails the I/O requests early once that
state is established (in alua_prep_fn()), which provides no chance
for path checkers going through that function path to really check
whether the path actually still fails I/O requests or recovered to
an active state.

This might cause device-mapper multipath to fail all paths to some
storage system that moves the controllers to the unavailable state
for firmware upgrades, and never recover regardless of the storage
system doing upgrades one controller at a time and get them online.

Then I/O requests are blocked indefinitely due to queue_if_no_path
but the underlying individual paths are fully operational, and can
be verified as such through other function paths (e.g., SG_IO):

# multipath -l
mpatha (360050764008100dac100) dm-0 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 1:0:1:0 sdf 8:80  failed undef running
| `- 2:0:1:0 sdn 8:208 failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 1:0:0:0 sdb 8:16  failed undef running
  `- 2:0:0:0 sdj 8:144 failed undef running

# strace -e read \
sg_dd if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \
2>&1 | grep 512
read(3, 0x3fff7ba8, 512) = -1 EIO (Input/output error)

# strace -e ioctl \
sg_dd if=/dev/sdj of=/dev/null bs=512 count=1 iflag=direct \
blk_sgio=1 \
2>&1 | grep 512
ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[10]=[28, 00, 00, 00,
00, 00, 00, 00, 01, 00], <...>) = 0

So, allow I/O to target port (groups) in the unavailable state, so the
path checkers can actually check them, and schedule a recheck whenever
the unavailable state is detected so pg->state can be updated properly
(and further SCSI IO error messages then silenced through alua_prep_fn()).

Once a path checker eventually detects an active state again, the port
group state will be updated by the path activation call, alua_activate(),
as it schedules an alua_rtpg() check.

Signed-off-by: Mauricio Faria de Oliveira 
Reported-by: Naresh Bannoth 
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index c01b47e5b55a..5e5a33cac951 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -431,6 +431,20 @@ static int alua_check_sense(struct scsi_device *sdev,
alua_check(sdev, false);
return NEEDS_RETRY;
}
+   if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0c) {
+   /*
+* LUN Not Accessible - target port in unavailable 
state.
+*
+* It may (not) be possible to transition to other 
states;
+* the transition might take a while or not happen at 
all,
+* depending on the storage system model, error type, 
etc.
+*
+* Do not retry, so failover to another target port 
occur.
+* Schedule a recheck to update state for other 
functions.
+*/
+   alua_check(sdev, true);
+   return SUCCESS;
+   }
break;
case UNIT_ATTENTION:
if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00) {
@@ -1057,6 +1071,8 @@ static void alua_check(struct scsi_device *sdev, bool 
force)
  *
  * Fail I/O to all paths not in state
  * active/optimized or active/non-optimized.
+ * Allow I/O to all paths in state unavailable
+ * so path checkers can actually check them.
  */
 static int alua_prep_fn(struct scsi_device *sdev, struct request *req)
 {
@@ -1072,6 +1088,8 @@ static int alua_prep_fn(struct scsi_device *sdev, struct 
request *req)
rcu_read_unlock();
if (state == SCSI_ACCESS_STATE_TRANSITIONING)
ret = BLKPREP_DEFER;
+   else if (state == SCSI_ACCESS_STATE_UNAVAILABLE)
+   req->rq_flags |= RQF_QUIET;
else if (state != SCSI_ACCESS_STATE_OPTIMAL &&
 state != SCSI_ACCESS_STATE_ACTIVE &&
 state != SCSI_ACCESS_STATE_LBA) {
-- 
1.8.3.1



[PATCH 4/4] scsi: scsi_dh_alua: do not print target port group state if it remains unavailable

2017-04-10 Thread Mauricio Faria de Oliveira
Path checkers will periodically check all paths to a target port group
in unavailable state more often (as they are 'failed'), possibly for a
long or indefinite period of time, or for a large number of paths.

That might end up flooding the kernel log with the scsi_dh_alua target
port group state messages, which are triggered by the rtpg recheck
scheduled in alua_check_sense().

So, modify alua_rtpg() not to print such message when that unavailable
state remains (i.e., the target port group state did not transition to
or from the unavailable state). All other cases continue to be printed.

Signed-off-by: Mauricio Faria de Oliveira 
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index c2c9173fd883..8025e024c011 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -554,6 +554,17 @@ static void alua_rtpg_print(struct scsi_device *sdev, 
struct alua_port_group *pg
 }
 
 /*
+ * alua_state_remains - Whether a RTPG state remains the same across 2 values.
+ * @state: the state value to check for.
+ * @old_state: the old state value.
+ * @new_state: the new state value.
+ */
+static bool alua_state_remains(int state, int old_state, int new_state)
+{
+   return ((old_state == state) && (new_state == state));
+}
+
+/*
  * alua_rtpg - Evaluate REPORT TARGET GROUP STATES
  * @sdev: the device to be evaluated.
  *
@@ -571,6 +582,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
unsigned int tpg_desc_tbl_off;
unsigned char orig_transition_tmo;
unsigned long flags;
+   int orig_state;
 
if (!pg->expiry) {
unsigned long transition_tmo = ALUA_FAILOVER_TIMEOUT * HZ;
@@ -656,6 +668,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
goto retry;
}
 
+   orig_state = pg->state;
+
orig_transition_tmo = pg->transition_tmo;
if ((buff[4] & RTPG_FMT_MASK) == RTPG_FMT_EXT_HDR && buff[5] != 0)
pg->transition_tmo = buff[5];
@@ -689,6 +703,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
!(tmp_pg->flags & ALUA_PG_RUNNING)) {
struct alua_dh_data *h;
struct scsi_device *tmp_pg_sdev = NULL;
+   int tmp_pg_orig_state = tmp_pg->state;
 
tmp_pg->state = desc[0] & 0x0f;
tmp_pg->pref = desc[0] >> 7;
@@ -704,8 +719,11 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
 * is not running, 
tmp_pg->rtpg_sdev might be NULL.
 * Use another/one associated 
scsi_device to print
 * its RTPG information.
+* Don't print it if state 
remains 'unavailable'.
 */
-   if ((tmp_pg != pg) && 
!tmp_pg_sdev) {
+   if ((tmp_pg != pg) && 
!tmp_pg_sdev &&
+   
!alua_state_remains(SCSI_ACCESS_STATE_UNAVAILABLE,
+   
tmp_pg_orig_state, tmp_pg->state)) {
tmp_pg_sdev = h->sdev;

alua_rtpg_print(tmp_pg_sdev, tmp_pg, NULL);
}
@@ -722,7 +740,11 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
}
 
spin_lock_irqsave(&pg->lock, flags);
-   alua_rtpg_print(sdev, pg, &valid_states);
+
+   /* Print RTPG information (except if state remains 'unavailable'). */
+   if (likely(!alua_state_remains(SCSI_ACCESS_STATE_UNAVAILABLE,
+   orig_state, pg->state)))
+   alua_rtpg_print(sdev, pg, &valid_states);
 
switch (pg->state) {
case SCSI_ACCESS_STATE_TRANSITIONING:
-- 
1.8.3.1



[PATCH 2/4] scsi: scsi_dh_alua: create alua_rtpg_print() for alua_rtpg() sdev_printk

2017-04-10 Thread Mauricio Faria de Oliveira
Factor out the sdev_printk() statement with the RTPG information
in alua_rtpg() into a new function, alua_rtpg_print(), that will
also be used in the following patch.

The only functional difference is that the 'valid_states' value
is now referenced via a pointer, and can be NULL (optional), in
which case the 'valid_states' information is not printed.  That
is for the following patch too.

Signed-off-by: Mauricio Faria de Oliveira 
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 43 ++
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index 5e5a33cac951..0d3508f2e285 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -523,6 +523,37 @@ static int alua_tur(struct scsi_device *sdev)
 }
 
 /*
+ * alua_rtpg_print - Print REPORT TARGET GROUP STATES information
+ * @sdev: the device evaluated (or associated with this port group).
+ * @pg: the port group the information is associated with.
+ * @valid_states: pointer to valid_states value.
+ *(optional; e.g., obtained via another port group)
+ *
+ * Must be called with pg->lock held.
+ */
+static void alua_rtpg_print(struct scsi_device *sdev, struct alua_port_group 
*pg,
+   int *valid_states)
+{
+   if (valid_states)
+   sdev_printk(KERN_INFO, sdev,
+   "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n",
+   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
+   pg->pref ? "preferred" : "non-preferred",
+   (*valid_states) & TPGS_SUPPORT_TRANSITION ? 'T' : 't',
+   (*valid_states) & TPGS_SUPPORT_OFFLINE ? 'O' : 'o',
+   (*valid_states) & TPGS_SUPPORT_LBA_DEPENDENT ? 'L' : 'l',
+   (*valid_states) & TPGS_SUPPORT_UNAVAILABLE ? 'U' : 'u',
+   (*valid_states) & TPGS_SUPPORT_STANDBY ? 'S' : 's',
+   (*valid_states) & TPGS_SUPPORT_NONOPTIMIZED ? 'N' : 'n',
+   (*valid_states) & TPGS_SUPPORT_OPTIMIZED  ?  'A' : 'a');
+   else
+   sdev_printk(KERN_INFO, sdev,
+   "%s: port group %02x state %c %s\n",
+   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
+   pg->pref ? "preferred" : "non-preferred");
+}
+
+/*
  * alua_rtpg - Evaluate REPORT TARGET GROUP STATES
  * @sdev: the device to be evaluated.
  *
@@ -679,17 +710,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
}
 
spin_lock_irqsave(&pg->lock, flags);
-   sdev_printk(KERN_INFO, sdev,
-   "%s: port group %02x state %c %s supports %c%c%c%c%c%c%c\n",
-   ALUA_DH_NAME, pg->group_id, print_alua_state(pg->state),
-   pg->pref ? "preferred" : "non-preferred",
-   valid_states&TPGS_SUPPORT_TRANSITION?'T':'t',
-   valid_states&TPGS_SUPPORT_OFFLINE?'O':'o',
-   valid_states&TPGS_SUPPORT_LBA_DEPENDENT?'L':'l',
-   valid_states&TPGS_SUPPORT_UNAVAILABLE?'U':'u',
-   valid_states&TPGS_SUPPORT_STANDBY?'S':'s',
-   valid_states&TPGS_SUPPORT_NONOPTIMIZED?'N':'n',
-   valid_states&TPGS_SUPPORT_OPTIMIZED?'A':'a');
+   alua_rtpg_print(sdev, pg, &valid_states);
 
switch (pg->state) {
case SCSI_ACCESS_STATE_TRANSITIONING:
-- 
1.8.3.1



[PATCH 3/4] scsi: scsi_dh_alua: print changes to RTPG state of other PGs too

2017-04-10 Thread Mauricio Faria de Oliveira
Currently, alua_rtpg() can change the 'state' and 'preferred'
values for the current port group _and_ of other port groups
present in the response buffer/descriptors.

However, it reports such changes _only_ for the current port
group (i.e., only for 'pg' but not for 'tmp_pg').

This might cause uncertainty and confusion when going through
the kernel logs for analyzing/debugging scsi_dh_alua behavior,
which is not helpful during support and development scenarios.

So, print such changes for other port groups than the current
one.

This requires finding a scsi_device to call sdev_printk() with
for that other port group. Using 'tmp_pg->rtpg_sdev' is not an
option because in that 'if' conditional the 'tmp_pg' is not in
the ALUA_PG_RUNNING state, so that pointer may be NULL. So the
for-loop over the tmp->pg device_handler structures is used to
find a valid scsi_device that is associated to this port group.

Signed-off-by: Mauricio Faria de Oliveira 
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c 
b/drivers/scsi/device_handler/scsi_dh_alua.c
index 0d3508f2e285..c2c9173fd883 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -688,6 +688,7 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
if ((tmp_pg == pg) ||
!(tmp_pg->flags & ALUA_PG_RUNNING)) {
struct alua_dh_data *h;
+   struct scsi_device *tmp_pg_sdev = NULL;
 
tmp_pg->state = desc[0] & 0x0f;
tmp_pg->pref = desc[0] >> 7;
@@ -697,6 +698,17 @@ static int alua_rtpg(struct scsi_device *sdev, struct 
alua_port_group *pg)
/* h->sdev should always be 
valid */
BUG_ON(!h->sdev);
h->sdev->access_state = desc[0];
+
+   /*
+* If tmp_pg is not the running 
pg, and its worker
+* is not running, 
tmp_pg->rtpg_sdev might be NULL.
+* Use another/one associated 
scsi_device to print
+* its RTPG information.
+*/
+   if ((tmp_pg != pg) && 
!tmp_pg_sdev) {
+   tmp_pg_sdev = h->sdev;
+   
alua_rtpg_print(tmp_pg_sdev, tmp_pg, NULL);
+   }
}
rcu_read_unlock();
}
-- 
1.8.3.1



Re: [PATCH 0/4] scsi: scsi_dh_alua: handle target port unavailable state

2017-04-10 Thread Mauricio Faria de Oliveira

On 04/10/2017 10:17 PM, Mauricio Faria de Oliveira wrote:

For documentation purposes, I'll reply to this cover letter with the analysis
of such cases of this problem, and the accompanying messages from kernel logs.


Here it goes, for anyone interested.
Scenario: 4 LUNs, 2 target port groups (PGs), 2 paths per PG.

mpatha (360050764008100dac100) dm-0 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler' 
hwhandler='1 alua' wp=rw

|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:1:0 sdf 8:80  active undef running
| `- 2:0:1:0 sdn 8:208 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 1:0:0:0 sdb 8:16  active undef running
  `- 2:0:0:0 sdj 8:144 active undef running

This LUN's active port group becomes Unavailable.
The ongoing I/O requests sense it via SCSI command errors.

	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: [sdf] tag#0 Add. 
Sense: Logical unit not accessible, target port in unavailable state
	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:1:0: [sdn] tag#0 Add. 
Sense: Logical unit not accessible, target port in unavailable state


Accordingly, dm-mpath fails both paths:

	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:80.
	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:208.


The SCSI command error has gone through alua_check_sense().
It scheduled an alua_rtpg(), which confirms its port group
(for 1:0:1:0) and the other (for 2:0:0:0) are unavailable.

	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 
state U non-preferred
	Apr  6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 
state U non-preferred supports tolusna


Failing over, dm-mpath switched path groups, to the other/enabled port 
group.


Notice that further SCSI command errors _happen_, but are not _reported_,
as alua_rtpg() has set pg->state = unavailable before more I/O requests
eventually reached alua_prep_fn(), which sets them with RQF_QUIET.

Accordingly, dm-mpath fails both paths:

	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:144.
	Apr  6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:16.


The path checkers are running, and constantly receiving senses of 
unavailable state
(seen through debug messages), and alua_rtpg() rechecks are scheduled, 
but no

repeated messages for 'state U' are printed, as it has not changed at all.

After a few minutes, both port groups become active 
(optimized/non-optimized),

and dm-mpath reinstate all paths:

	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:16.
	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:80.
	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:144.
	Apr  6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath: 
Reinstating path 8:208.


The dm-mpath reinstate calls alua_activate() -> alua_rtpg() for both 
paths in the active port group.
The first alua_rtpg() worker is still running when the second 
alua_rtpg() is scheduled,
so pg->rtpg_sdev does not change, and it's reported as 1:0:1:0 twice 
(but called for 1:0:1:0 and 2:0:1:0).


	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 
state N non-preferred
	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 
state A non-preferred supports tolusna
	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01 
state N non-preferred
	Apr  6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00 
state A non-preferred supports tolusna




...


mpathb (360050764008100dac101) dm-1 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler' 
hwhandler='1 alua' wp=rw

|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:0:1 sdc 8:32  active undef running
| `- 2:0:0:1 sdk 8:160 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 1:0:1:1 sdg 8:96  active undef running
  `- 2:0:1:1 sdo 8:224 active undef running


Similarly,

	Apr  5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: [sdk] tag#2 Add. 
Sense: Logical unit not accessible, target port in unavailable state
	Apr  5 19:59:52 ltcalpine-lp2 kernel: sd 1:0:0:1: [sdc] tag#5 Add. 
Sense: Logical unit not accessible, target port in unavailable state


	Apr  5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:160.
	Apr  5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing 
path 8:32.


In this case alua_rtpg() does not print messages for the 'other' port group
(without 'supports'), because the other PG worker was running at the time.
There was an alua_rtpg() scheduled due to alua_check_sense() for 2:0:0:1,
and another for alua_activate() of 1:0:1:1 (from the 'enabled' path group).

	Apr  5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: alua: port group 00 
state U non-preferred supports tolusna
	Apr  5 19:5

Re: Race to power off harming SATA SSDs

2017-04-10 Thread Henrique de Moraes Holschuh
On Tue, 11 Apr 2017, Tejun Heo wrote:
> > The kernel then continues the shutdown path while the SSD is still
> > preparing itself to be powered off, and it becomes a race.  When the
> > kernel + firmware wins, platform power is cut before the SSD has
> > finished (i.e. the SSD is subject to an unclean power-off).
> 
> At that point, the device is fully flushed and in terms of data
> integrity should be fine with losing power at any point anyway.

All bets are off at this point, really.

We issued a command that explicitly orders the SSD to checkpoint and
stop all background tasks, and flush *everything* including invisible
state (device data, stats, logs, translation tables, flash metadata,
etc)...  and then cut its power before it finished.

> > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > the worst case, or otherwise harm it (reduce longevity, damage flash
> > blocks).  It is also not impossible to get data corruption.
> 
> I get that the incrementing counters might not be pretty but I'm a bit
> skeptical about this being an actual issue.  Because if that were

As an *example* I know of because I tracked it personally, Crucial SSDs
models from a few years ago were known to eventually brick on any
platforms where they were being subject to repeated unclean shutdowns,
*Windows included*.  There are some threads on their forums about it.
Firmware revisions made it harder to happen, but still...

> true, the device would be bricking itself from any sort of power
> losses be that an actual power loss, battery rundown or hard power off
> after crash.

Bricking is a worst-case, really.  I guess they learned to keep the
device always in a will-not-brick state using append-only logs for
critical state or something, so it really takes very nasty flash damage
to exactly the wrong place to render it unusable.

> > Fixing the issue properly:
> > 
> > The proof of concept patch works fine, but it "punishes" the system with
> > too much delay.  Also, if sd device shutdown is serialized, it will
> > punish systems with many /dev/sd devices severely.
> > 
> > 1. The delay needs to happen only once right before powering down for
> >hibernation/suspend/power-off.  There is no need to delay per-device
> >for platform power off/suspend/hibernate.
> > 
> > 2. A per-device delay needs to happen before signaling that a device
> >can be safely removed when doing controlled hotswap (e.g. when
> >deleting the SD device due to a sysfs command).
> > 
> > I am unsure how much *total* delay would be enough.  Two seconds seems
> > like a safe bet.
> > 
> > Any comments?  Any clues on how to make the delay "smarter" to trigger
> > only once during platform shutdown, but still trigger per-device when
> > doing per-device hotswapping ?
> 
> So, if this is actually an issue, sure, we can try to work around;
> however, can we first confirm that this has any other consequences
> than a SMART counter being bumped up?  I'm not sure how meaningful
> that is in itself.

I have no idea how to confirm an SSD is being either less, or more
damaged by the "STANDBY-IMMEDIATE and cut power too early", when
compared with "sudden power cut".  At least not without actually
damaging the SSDs using three groups (normal power cuts,
STANDBY-IMMEDIATE + power cut, control group).

A "SSD power cut test" search on duckduckgo shows several papers and
testing reports on the first results page.  I don't think there is any
doubt whatsoever that your typical consumer SSD *can* get damaged by a
"sudden power cut" so badly that it is actually noticed by the user.

That FLASH itself gets damaged or can have stored data corrupted by
power cuts at bad times is quite clear:

http://cseweb.ucsd.edu/users/swanson/papers/DAC2011PowerCut.pdf

SSDs do a lot of work to recover from that without data loss.  You won't
notice it easily unless that recovery work *fails*.

-- 
  Henrique Holschuh


Re: Race to power off harming SATA SSDs

2017-04-10 Thread Henrique de Moraes Holschuh
On Mon, 10 Apr 2017, James Bottomley wrote:
> On Tue, 2017-04-11 at 08:52 +0900, Tejun Heo wrote:
> [...]
> > > Any comments?  Any clues on how to make the delay "smarter" to 
> > > trigger only once during platform shutdown, but still trigger per
> > > -device when doing per-device hotswapping ?
> > 
> > So, if this is actually an issue, sure, we can try to work around;
> > however, can we first confirm that this has any other consequences
> > than a SMART counter being bumped up?  I'm not sure how meaningful
> > that is in itself.
> 
> Seconded; especially as the proposed patch is way too invasive: we run

It is a proof of concept thing.  It even says so in the patch commit
log, and in the cover text.

I don't want an one second delay per device.  I never proposed that,
either.  In fact, I *specifically* asked for something else in the
paragraph you quoted.

I would much prefer an one- or two-seconds delay per platform *power
off*.  And that's for platforms that do ACPI-like heavy-duty S3/S4/S5
like x86/x86-64.  Opportunistic high-frequency suspend on mobile likely
requires no such handling.

The per-device delay would be needed only for hotplug removal (device
delete), and that's just because some hardware powers down bays (like
older thinkpads with ATA-compatible bays, and some industrial systems).

-- 
  Henrique Holschuh


[PATCH] ipr: do not set DID_PASSTHROUGH on CHECK CONDITION

2017-04-10 Thread Mauricio Faria de Oliveira
On a dual controller setup with multipath enabled, some MEDIUM ERRORs
caused both paths to be failed, thus I/O got queued/blocked since the
'queue_if_no_path' feature is enabled by default on IPR controllers.

This example disabled 'queue_if_no_path' so the I/O failure is seen at
the sg_dd program.  Notice that after the sg_dd test-case, both paths
are in 'failed' state, and both path/priority groups are in 'enabled'
state (not 'active') -- which would block I/O with 'queue_if_no_path'.

# sg_dd if=/dev/dm-2 bs=4096 count=1 dio=1 verbose=4 blk_sgio=0
<...>
read(unix): count=4096, res=-1
sg_dd: reading, skip=0 : Input/output error
<...>

# dmesg
[...] sd 2:2:16:0: [sds] FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[...] sd 2:2:16:0: [sds] Sense Key : Medium Error [current]
[...] sd 2:2:16:0: [sds] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] sd 2:2:16:0: [sds] CDB: Read(10) 28 00 00 00 00 00 00 00 20 00
[...] blk_update_request: I/O error, dev sds, sector 0
[...] device-mapper: multipath: Failing path 65:32.
<...>
[...] device-mapper: multipath: Failing path 65:224.

# multipath -l
1IBM_IPR-0_59C2AE001F80 dm-2 IBM ,IPR-0   59C2AE00
size=5.2T features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 2:2:16:0 sds  65:32  failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 1:2:7:0  sdae 65:224 failed undef running

This is not the desired behavior. The dm-multipath explicitly checks
for the MEDIUM ERROR case (and a few others) so not to fail the path
(e.g., I/O to other sectors could potentially happen without problems).
See dm-mpath.c :: do_end_io_bio() -> noretry_error() !->! fail_path().

The problem trace is:

1) ipr_scsi_done()  // SENSE KEY/CHECK CONDITION detected, go to..
2) ipr_erp_start()  // ipr_is_gscsi() and masked_ioasc OK, go to..
3) ipr_gen_sense()  // masked_ioasc is IPR_IOASC_MED_DO_NOT_REALLOC,
// so set DID_PASSTHROUGH.

4) scsi_decide_disposition()  // check for DID_PASSTHROUGH and return
  // early on, faking a DID_OK.. *instead*
  // of reaching scsi_check_sense().

  // Had it reached the latter, that would
  // set host_byte to DID_MEDIUM_ERROR.

5) scsi_finish_command()
6) scsi_io_completion()
7) __scsi_error_from_host_byte()  // That would be converted to -ENODATA
<...>
8) dm_softirq_done()
9) multipath_end_io()
10) do_end_io()
11) noretry_error()  // And that is checked in dm-mpath :: noretry_error()
 // which would cause fail_path() not to be called.

With this patch applied, the I/O is failed but the paths are not.  This
multipath device continues accepting more I/O requests without blocking.
(and notice the different host byte/driver byte handling per SCSI layer).

# dmesg
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 
driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] sd 2:2:7:0: [sdaf] Done: SUCCESS Result: hostbyte=0x13 
driverbyte=DRIVER_OK
[...] sd 2:2:7:0: [sdaf] CDB: Read(10) 28 00 00 00 00 00 00 00 10 00
[...] sd 2:2:7:0: [sdaf] Sense Key : Medium Error [current]
[...] sd 2:2:7:0: [sdaf] Add. Sense: Unrecovered read error - recommend 
rewrite the data
[...] blk_update_request: critical medium error, dev sdaf, sector 0
[...] blk_update_request: critical medium error, dev dm-6, sector 0
[...] Buffer I/O error on dev dm-6, logical block 0, async page read

# multipath -l 1IBM_IPR-0_59C2AE001F80
1IBM_IPR-0_59C2AE001F80 dm-6 IBM ,IPR-0   59C2AE00
size=5.2T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| `- 2:2:7:0  sdaf 65:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 1:2:7:0  sdh  8:112  active undef running

Signed-off-by: Mauricio Faria de Oliveira 
---

P.S.:  The switch-case checking this condition predates linux.git, so I'd
   like to ask Brian for a review too, as I imagine he probably might
   have worked with this back then (if memory may serve one this well :-)

 drivers/scsi/ipr.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index b29afafc2885..1012674d9dc5 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -6293,7 +6293,12 @@ static void ipr_erp_start(struct ipr_ioa_cfg *ioa_cfg,
break;
case IPR_