[PATCH] libsas: fix "sysfs group not found" warnings at port teardown time
Praveen reports: --- After some debugging this is what I have found sas_phye_loss_of_signal gets triggered on phy_event from mvsas sas_phye_loss_of_signal calls sas_deform_port sas_deform_port posts a DISCE_DESTRUCT event (sas_unregister_domain_devices-> sas_unregister_dev) sas_deform_port calls sas_port_delete sas_port_delete calls sas_port_delete_link sysfs_remove_group: kobject 'port-X:Y' sas_port_delete calls device_del sysfs_remove_group: kobject 'port-X:Y' sas_destruct_devices gets triggered for the destruct event (DISCE_DESTRUCT) sas_destruct_devices calls sas_rphy_delete sas_rphy_delete calls scsi_remove_device scsi_remove_device calls __scsi_remove_device __scsi_remove_device calls bsg_unregister_queue bsg_unregister_queue -> device_unregister -> device_del -> sysfs_remove_group: kobject 'X:0:0:0' Since X:0:0:0 falls under port-X:Y (which got deleted during sas_port_delete), this call results in the warning. All the later warnings in the dmesg output I sent earlier are trying to delete objects under port-X:Y. Since port-X:Y got recursively deleted, all these calls result in warnings. Since, the PHY and DISC events are processed in two different work queues (and one triggers the other), is there any way other than checking if the object exists in sysfs (in device_del) before deleting? [ cut here ] WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0() sysfs group 818b97e0 not found for kobject '2:0:4:0' [..] CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: PW O 3.16.7-ckt9-logicube-ng.3 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, BIOS 4.6.5 01/23/2015 Workqueue: scsi_wq_2 sas_destruct_devices [libsas] 0009 8151cd18 88011b35bcd8 810687b7 88011a661400 88011b35bd28 8800c6e5e968 88028810 8800c89f2c00 8106881c 81733b68 0028 Call Trace: [] ? dump_stack+0x41/0x51 [] ? warn_slowpath_common+0x77/0x90 [] ? warn_slowpath_fmt+0x4c/0x50 [] ? device_del+0x40/0x1c0 [] ? device_unregister+0x1a/0x70 [] ? bsg_unregister_queue+0x5e/0xb0 [] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod] --- It appears we've always been double deleting the devices below sas_port, but recent sysfs changes now exposes this problem. Libsas should delete all the devices from rphy down before deleting the parent port. Cc: Reported-by: Praveen Murali Tested-by: Praveen Murali Signed-off-by: Dan Williams --- drivers/scsi/libsas/sas_discover.c |6 +++--- drivers/scsi/libsas/sas_port.c |1 - 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de66252fa2..a4db770fe8b0 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -362,11 +362,14 @@ static void sas_destruct_devices(struct work_struct *work) clear_bit(DISCE_DESTRUCT, &port->disc.pending); list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) { + struct sas_port *sas_port = dev_to_sas_port(dev->rphy->dev.parent); + list_del_init(&dev->disco_list_node); sas_remove_children(&dev->rphy->dev); sas_rphy_delete(dev->rphy); sas_unregister_common_dev(port, dev); + sas_port_delete(sas_port); } } @@ -400,9 +403,6 @@ void sas_unregister_domain_devices(struct asd_sas_port *port, int gone) list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node) sas_unregister_dev(port, dev); - - port->port->rphy = NULL; - } void sas_device_set_phy(struct domain_device *dev, struct sas_port *port) diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c index d3c5297c6c89..9a25ae3a52a4 100644 --- a/drivers/scsi/libsas/sas_port.c +++ b/drivers/scsi/libsas/sas_port.c @@ -219,7 +219,6 @@ void sas_deform_port(struct asd_sas_phy *phy, int gone) if (port->num_phys == 1) { sas_unregister_domain_devices(port, gone); - sas_port_delete(port->port); port->port = NULL; } else { sas_port_delete_phy(port->port, phy->phy); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] libsas: fix "sysfs group not found" warnings at port teardown time
Praveen reports: After some debugging this is what I have found sas_phye_loss_of_signal gets triggered on phy_event from mvsas sas_phye_loss_of_signal calls sas_deform_port sas_deform_port posts a DISCE_DESTRUCT event (sas_unregister_domain_devices-> sas_unregister_dev) sas_deform_port calls sas_port_delete sas_port_delete calls sas_port_delete_link sysfs_remove_group: kobject 'port-X:Y' sas_port_delete calls device_del sysfs_remove_group: kobject 'port-X:Y' sas_destruct_devices gets triggered for the destruct event (DISCE_DESTRUCT) sas_destruct_devices calls sas_rphy_delete sas_rphy_delete calls scsi_remove_device scsi_remove_device calls __scsi_remove_device __scsi_remove_device calls bsg_unregister_queue bsg_unregister_queue -> device_unregister -> device_del -> sysfs_remove_group: kobject 'X:0:0:0' Since X:0:0:0 falls under port-X:Y (which got deleted during sas_port_delete), this call results in the warning. All the later warnings in the dmesg output I sent earlier are trying to delete objects under port-X:Y. Since port-X:Y got recursively deleted, all these calls result in warnings. Since, the PHY and DISC events are processed in two different work queues (and one triggers the other), is there any way other than checking if the object exists in sysfs (in device_del) before deleting? [ cut here ] WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0() sysfs group 818b97e0 not found for kobject '2:0:4:0' [..] CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: PW O 3.16.7-ckt9-logicube-ng.3 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, BIOS 4.6.5 01/23/2015 Workqueue: scsi_wq_2 sas_destruct_devices [libsas] 0009 8151cd18 88011b35bcd8 810687b7 88011a661400 88011b35bd28 8800c6e5e968 88028810 8800c89f2c00 8106881c 81733b68 0028 Call Trace: [] ? dump_stack+0x41/0x51 [] ? warn_slowpath_common+0x77/0x90 [] ? warn_slowpath_fmt+0x4c/0x50 [] ? device_del+0x40/0x1c0 [] ? device_unregister+0x1a/0x70 [] ? bsg_unregister_queue+0x5e/0xb0 [] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod] It appears we've always been double deleting the devices below sas_port, but recent sysfs changes now exposes this problem. Libsas should delete all the devices from rphy down before deleting the parent port. Cc: Reported-by: Praveen Murali Tested-by: Praveen Murali Signed-off-by: Dan Williams --- v2: drop the "---" separators that will confuse git-am. Thanks Luis! drivers/scsi/libsas/sas_discover.c |6 +++--- drivers/scsi/libsas/sas_port.c |1 - 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de66252fa2..a4db770fe8b0 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -362,11 +362,14 @@ static void sas_destruct_devices(struct work_struct *work) clear_bit(DISCE_DESTRUCT, &port->disc.pending); list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) { + struct sas_port *sas_port = dev_to_sas_port(dev->rphy->dev.parent); + list_del_init(&dev->disco_list_node); sas_remove_children(&dev->rphy->dev); sas_rphy_delete(dev->rphy); sas_unregister_common_dev(port, dev); + sas_port_delete(sas_port); } } @@ -400,9 +403,6 @@ void sas_unregister_domain_devices(struct asd_sas_port *port, int gone) list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node) sas_unregister_dev(port, dev); - - port->port->rphy = NULL; - } void sas_device_set_phy(struct domain_device *dev, struct sas_port *port) diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c index d3c5297c6c89..9a25ae3a52a4 100644 --- a/drivers/scsi/libsas/sas_port.c +++ b/drivers/scsi/libsas/sas_port.c @@ -219,7 +219,6 @@ void sas_deform_port(struct asd_sas_phy *phy, int gone) if (port->num_phys == 1) { sas_unregister_domain_devices(port, gone); - sas_port_delete(port->port); port->port = NULL; } else { sas_port_delete_phy(port->port, phy->phy); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[for 4.1 PATCH resend] libsas: fix "sysfs group not found" warnings at port teardown time
Praveen reports: After some debugging this is what I have found sas_phye_loss_of_signal gets triggered on phy_event from mvsas sas_phye_loss_of_signal calls sas_deform_port sas_deform_port posts a DISCE_DESTRUCT event (sas_unregister_domain_devices-> sas_unregister_dev) sas_deform_port calls sas_port_delete sas_port_delete calls sas_port_delete_link sysfs_remove_group: kobject 'port-X:Y' sas_port_delete calls device_del sysfs_remove_group: kobject 'port-X:Y' sas_destruct_devices gets triggered for the destruct event (DISCE_DESTRUCT) sas_destruct_devices calls sas_rphy_delete sas_rphy_delete calls scsi_remove_device scsi_remove_device calls __scsi_remove_device __scsi_remove_device calls bsg_unregister_queue bsg_unregister_queue -> device_unregister -> device_del -> sysfs_remove_group: kobject 'X:0:0:0' Since X:0:0:0 falls under port-X:Y (which got deleted during sas_port_delete), this call results in the warning. All the later warnings in the dmesg output I sent earlier are trying to delete objects under port-X:Y. Since port-X:Y got recursively deleted, all these calls result in warnings. Since, the PHY and DISC events are processed in two different work queues (and one triggers the other), is there any way other than checking if the object exists in sysfs (in device_del) before deleting? WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0() sysfs group 818b97e0 not found for kobject '2:0:4:0' [..] CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: PW O 3.16.7-ckt9-logicube-ng.3 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, BIOS 4.6.5 01/23/2015 Workqueue: scsi_wq_2 sas_destruct_devices [libsas] 0009 8151cd18 88011b35bcd8 810687b7 88011a661400 88011b35bd28 8800c6e5e968 88028810 8800c89f2c00 8106881c 81733b68 0028 Call Trace: [] ? dump_stack+0x41/0x51 [] ? warn_slowpath_common+0x77/0x90 [] ? warn_slowpath_fmt+0x4c/0x50 [] ? device_del+0x40/0x1c0 [] ? device_unregister+0x1a/0x70 [] ? bsg_unregister_queue+0x5e/0xb0 [] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod] It appears we've always been double deleting the devices below sas_port, but recent sysfs changes now exposes this problem. Libsas should delete all the devices from rphy down before deleting the parent port. Cc: Reported-by: Praveen Murali Tested-by: Praveen Murali Signed-off-by: Dan Williams --- drivers/scsi/libsas/sas_discover.c |6 +++--- drivers/scsi/libsas/sas_port.c |1 - 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de66252fa2..a4db770fe8b0 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -362,11 +362,14 @@ static void sas_destruct_devices(struct work_struct *work) clear_bit(DISCE_DESTRUCT, &port->disc.pending); list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) { + struct sas_port *sas_port = dev_to_sas_port(dev->rphy->dev.parent); + list_del_init(&dev->disco_list_node); sas_remove_children(&dev->rphy->dev); sas_rphy_delete(dev->rphy); sas_unregister_common_dev(port, dev); + sas_port_delete(sas_port); } } @@ -400,9 +403,6 @@ void sas_unregister_domain_devices(struct asd_sas_port *port, int gone) list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node) sas_unregister_dev(port, dev); - - port->port->rphy = NULL; - } void sas_device_set_phy(struct domain_device *dev, struct sas_port *port) diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c index d3c5297c6c89..9a25ae3a52a4 100644 --- a/drivers/scsi/libsas/sas_port.c +++ b/drivers/scsi/libsas/sas_port.c @@ -219,7 +219,6 @@ void sas_deform_port(struct asd_sas_phy *phy, int gone) if (port->num_phys == 1) { sas_unregister_domain_devices(port, gone); - sas_port_delete(port->port); port->port = NULL; } else { sas_port_delete_phy(port->port, phy->phy); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend] libsas: fix "sysfs group not found" warnings at port teardown time
Praveen reports: After some debugging this is what I have found sas_phye_loss_of_signal gets triggered on phy_event from mvsas sas_phye_loss_of_signal calls sas_deform_port sas_deform_port posts a DISCE_DESTRUCT event (sas_unregister_domain_devices-> sas_unregister_dev) sas_deform_port calls sas_port_delete sas_port_delete calls sas_port_delete_link sysfs_remove_group: kobject 'port-X:Y' sas_port_delete calls device_del sysfs_remove_group: kobject 'port-X:Y' sas_destruct_devices gets triggered for the destruct event (DISCE_DESTRUCT) sas_destruct_devices calls sas_rphy_delete sas_rphy_delete calls scsi_remove_device scsi_remove_device calls __scsi_remove_device __scsi_remove_device calls bsg_unregister_queue bsg_unregister_queue -> device_unregister -> device_del -> sysfs_remove_group: kobject 'X:0:0:0' Since X:0:0:0 falls under port-X:Y (which got deleted during sas_port_delete), this call results in the warning. All the later warnings in the dmesg output I sent earlier are trying to delete objects under port-X:Y. Since port-X:Y got recursively deleted, all these calls result in warnings. Since, the PHY and DISC events are processed in two different work queues (and one triggers the other), is there any way other than checking if the object exists in sysfs (in device_del) before deleting? WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0() sysfs group 818b97e0 not found for kobject '2:0:4:0' [..] CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: PW O 3.16.7-ckt9-logicube-ng.3 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, BIOS 4.6.5 01/23/2015 Workqueue: scsi_wq_2 sas_destruct_devices [libsas] 0009 8151cd18 88011b35bcd8 810687b7 88011a661400 88011b35bd28 8800c6e5e968 88028810 8800c89f2c00 8106881c 81733b68 0028 Call Trace: [] ? dump_stack+0x41/0x51 [] ? warn_slowpath_common+0x77/0x90 [] ? warn_slowpath_fmt+0x4c/0x50 [] ? device_del+0x40/0x1c0 [] ? device_unregister+0x1a/0x70 [] ? bsg_unregister_queue+0x5e/0xb0 [] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod] It appears we've always been double deleting the devices below sas_port, but recent sysfs changes now exposes this problem. Libsas should delete all the devices from rphy down before deleting the parent port. Cc: Reported-by: Praveen Murali Tested-by: Praveen Murali Reviewed-by: Hannes Reinecke Signed-off-by: Dan Williams --- Hi Andrew, Can you take this through -mm. It's been on linux-scsi for a couple months, and Hannes has reviewed it. drivers/scsi/libsas/sas_discover.c |6 +++--- drivers/scsi/libsas/sas_port.c |1 - 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de66252fa2..a4db770fe8b0 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -362,11 +362,14 @@ static void sas_destruct_devices(struct work_struct *work) clear_bit(DISCE_DESTRUCT, &port->disc.pending); list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) { + struct sas_port *sas_port = dev_to_sas_port(dev->rphy->dev.parent); + list_del_init(&dev->disco_list_node); sas_remove_children(&dev->rphy->dev); sas_rphy_delete(dev->rphy); sas_unregister_common_dev(port, dev); + sas_port_delete(sas_port); } } @@ -400,9 +403,6 @@ void sas_unregister_domain_devices(struct asd_sas_port *port, int gone) list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node) sas_unregister_dev(port, dev); - - port->port->rphy = NULL; - } void sas_device_set_phy(struct domain_device *dev, struct sas_port *port) diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c index d3c5297c6c89..9a25ae3a52a4 100644 --- a/drivers/scsi/libsas/sas_port.c +++ b/drivers/scsi/libsas/sas_port.c @@ -219,7 +219,6 @@ void sas_deform_port(struct asd_sas_phy *phy, int gone) if (port->num_phys == 1) { sas_unregister_domain_devices(port, gone); - sas_port_delete(port->port); port->port = NULL; } else { sas_port_delete_phy(port->port, phy->phy); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [for 4.1 PATCH resend] libsas: fix "sysfs group not found" warnings at port teardown time
On Wed, Jul 22, 2015 at 11:28 AM, James Bottomley wrote: > On Wed, 2015-06-17 at 23:22 -0400, Dan Williams wrote: >> Praveen reports: >> >> After some debugging this is what I have found >> >> sas_phye_loss_of_signal gets triggered on phy_event from mvsas >> sas_phye_loss_of_signal calls sas_deform_port >> sas_deform_port posts a DISCE_DESTRUCT event >> (sas_unregister_domain_devices-> sas_unregister_dev) >> sas_deform_port calls sas_port_delete >> sas_port_delete calls sas_port_delete_link >> sysfs_remove_group: kobject 'port-X:Y' >> sas_port_delete calls device_del >> sysfs_remove_group: kobject 'port-X:Y' >> >> sas_destruct_devices gets triggered for the destruct event >> (DISCE_DESTRUCT) >> sas_destruct_devices calls sas_rphy_delete >> sas_rphy_delete calls scsi_remove_device >> scsi_remove_device calls __scsi_remove_device >> __scsi_remove_device calls bsg_unregister_queue >> bsg_unregister_queue -> device_unregister >> -> device_del -> sysfs_remove_group: kobject 'X:0:0:0' >> >> Since X:0:0:0 falls under port-X:Y (which got deleted during >> sas_port_delete), this call results in the warning. All the later >> warnings in the dmesg output I sent earlier are trying to delete objects >> under port-X:Y. Since port-X:Y got recursively deleted, all these calls >> result in warnings. Since, the PHY and DISC events are processed in two >> different work queues (and one triggers the other), is there any way >> other than checking if the object exists in sysfs (in device_del) before >> deleting? >> >> WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0() >> sysfs group 818b97e0 not found for kobject '2:0:4:0' >> [..] >> CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: PW O >> 3.16.7-ckt9-logicube-ng.3 #1 >> Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, >> BIOS 4.6.5 01/23/2015 >> Workqueue: scsi_wq_2 sas_destruct_devices [libsas] >> 0009 8151cd18 88011b35bcd8 810687b7 >> 88011a661400 88011b35bd28 8800c6e5e968 88028810 >> 8800c89f2c00 8106881c 81733b68 0028 >> Call Trace: >> [] ? dump_stack+0x41/0x51 >> [] ? warn_slowpath_common+0x77/0x90 >> [] ? warn_slowpath_fmt+0x4c/0x50 >> [] ? device_del+0x40/0x1c0 >> [] ? device_unregister+0x1a/0x70 >> [] ? bsg_unregister_queue+0x5e/0xb0 >> [] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod] >> >> It appears we've always been double deleting the devices below sas_port, >> but recent sysfs changes now exposes this problem. Libsas should delete >> all the devices from rphy down before deleting the parent port. > > There's a missing description of the fix here. > > So we make the DISCE_DESTROY event delete the port as well as > all the underlying devices > ? "We make DISCE_DESTROY responsible for deleting the child devices as well as the port." == "Libsas should delete all the devices from rphy down before deleting the parent port." >> Cc: >> Reported-by: Praveen Murali >> Tested-by: Praveen Murali >> Signed-off-by: Dan Williams >> --- >> drivers/scsi/libsas/sas_discover.c |6 +++--- >> drivers/scsi/libsas/sas_port.c |1 - >> 2 files changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/scsi/libsas/sas_discover.c >> b/drivers/scsi/libsas/sas_discover.c >> index 60de66252fa2..a4db770fe8b0 100644 >> --- a/drivers/scsi/libsas/sas_discover.c >> +++ b/drivers/scsi/libsas/sas_discover.c >> @@ -362,11 +362,14 @@ static void sas_destruct_devices(struct work_struct >> *work) >> clear_bit(DISCE_DESTRUCT, &port->disc.pending); >> >> list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) >> { >> + struct sas_port *sas_port = >> dev_to_sas_port(dev->rphy->dev.parent); >> + > > Do you need this? isn't what you've elaborately got here as sas_port, > simply port->port? Yes, it's just an elaborate workaround since port->port is already torn down. > Assuming you don't NULL that out (see below
Re: [for 4.1 PATCH resend] libsas: fix "sysfs group not found" warnings at port teardown time
On Mon, Jul 27, 2015 at 8:17 AM, James Bottomley wrote: > On Wed, 2015-07-22 at 14:08 -0700, Dan Williams wrote: >> On Wed, Jul 22, 2015 at 11:28 AM, James Bottomley >> wrote: >> > >> >> } >> >> >> >> void sas_device_set_phy(struct domain_device *dev, struct sas_port *port) >> >> diff --git a/drivers/scsi/libsas/sas_port.c >> >> b/drivers/scsi/libsas/sas_port.c >> >> index d3c5297c6c89..9a25ae3a52a4 100644 >> >> --- a/drivers/scsi/libsas/sas_port.c >> >> +++ b/drivers/scsi/libsas/sas_port.c >> >> @@ -219,7 +219,6 @@ void sas_deform_port(struct asd_sas_phy *phy, int >> >> gone) >> >> >> >> if (port->num_phys == 1) { >> >> sas_unregister_domain_devices(port, gone); >> >> - sas_port_delete(port->port); >> >> port->port = NULL; >> >> } else { >> >> sas_port_delete_phy(port->port, phy->phy); >> >> >> > >> > This should become >> > >> > if (port->num_phys == 1) >> > sas_unregister_domain_device(port, gone); >> > >> > sas_port_delete_phy(port->port, phy->phy); >> > >> > So we end up with a port scheduled for destruction with no phys rather >> > than making the last phy association hang around until the DISCE >> > workqueue runs. >> >> Sounds ok in theory. > > It's not really a choice. The specific problem you've introduced with > this patch is failure to cope with link flutter: a deform and form event > queued sequentially. In the new scheme you're trying to introduce, the > destruct event gets queued from the deform but behind the form and the > link flutter results in a dead link. I thought just forcing a zero phy > port would fix this, but it won't, either the destruct has to run in the > context of the deform event or the form has to be queued later than the > destruct. I think coupled with the changes above, there needs to be > > if (port->port) { > /* dying port, requeue form event */ > resend the PORTE_BYTES_DMAED event > return > } > > inside the unmatched port loop in sas_port_form() if nothing is found as > well to close this. I think it's too late. Once the lldd has triggered libsas to start tear down I seem to recall the lldd has the expectation that a new PORTE_BYTES_DMAED triggers the creation of a new port instance for that phy. Once the flutter reaches libsas the race is already lost and the port needs to be torn down, but I would need to take a closer look. >> I don't have a libsas environment handy, I worked with Praveen to >> validate the version as submitted if you want to re-work it. > > A couple of days ago, this was so urgent as to have to go outside the > usual patch process ... now it's not important enough for you to bother > working on it; which is it? Neither, it was a reviewed patch that was idling in the process. I'm still of the opinion that pinging Andrew in a case like this *is* the expected process, unless there's a place I can check that a patch is still in the application queue? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [for 4.1 PATCH resend] libsas: fix "sysfs group not found" warnings at port teardown time
On Mon, Jul 27, 2015 at 10:11 AM, James Bottomley wrote: > On Mon, 2015-07-27 at 08:48 -0700, Dan Williams wrote: >> On Mon, Jul 27, 2015 at 8:17 AM, James Bottomley >> wrote: >> > On Wed, 2015-07-22 at 14:08 -0700, Dan Williams wrote: >> >> On Wed, Jul 22, 2015 at 11:28 AM, James Bottomley >> >> wrote: >> >> > >> >> >> } >> >> >> >> >> >> void sas_device_set_phy(struct domain_device *dev, struct sas_port >> >> >> *port) >> >> >> diff --git a/drivers/scsi/libsas/sas_port.c >> >> >> b/drivers/scsi/libsas/sas_port.c >> >> >> index d3c5297c6c89..9a25ae3a52a4 100644 >> >> >> --- a/drivers/scsi/libsas/sas_port.c >> >> >> +++ b/drivers/scsi/libsas/sas_port.c >> >> >> @@ -219,7 +219,6 @@ void sas_deform_port(struct asd_sas_phy *phy, int >> >> >> gone) >> >> >> >> >> >> if (port->num_phys == 1) { >> >> >> sas_unregister_domain_devices(port, gone); >> >> >> - sas_port_delete(port->port); >> >> >> port->port = NULL; >> >> >> } else { >> >> >> sas_port_delete_phy(port->port, phy->phy); >> >> >> >> >> > >> >> > This should become >> >> > >> >> > if (port->num_phys == 1) >> >> > sas_unregister_domain_device(port, gone); >> >> > >> >> > sas_port_delete_phy(port->port, phy->phy); >> >> > >> >> > So we end up with a port scheduled for destruction with no phys rather >> >> > than making the last phy association hang around until the DISCE >> >> > workqueue runs. >> >> >> >> Sounds ok in theory. >> > >> > It's not really a choice. The specific problem you've introduced with >> > this patch is failure to cope with link flutter: a deform and form event >> > queued sequentially. In the new scheme you're trying to introduce, the >> > destruct event gets queued from the deform but behind the form and the >> > link flutter results in a dead link. I thought just forcing a zero phy >> > port would fix this, but it won't, either the destruct has to run in the >> > context of the deform event or the form has to be queued later than the >> > destruct. I think coupled with the changes above, there needs to be >> > >> > if (port->port) { >> > /* dying port, requeue form event */ >> > resend the PORTE_BYTES_DMAED event >> > return >> > } >> > >> > inside the unmatched port loop in sas_port_form() if nothing is found as >> > well to close this. >> >> I think it's too late. Once the lldd has triggered libsas to start >> tear down I seem to recall the lldd has the expectation that a new >> PORTE_BYTES_DMAED triggers the creation of a new port instance for >> that phy. Once the flutter reaches libsas the race is already lost >> and the port needs to be torn down, but I would need to take a closer >> look. > > I don't understand your reasoning. The expectation is that > PORTE_BYTES_DMAED leads to port formation. The proposal detects that > this event precedes DISCE_DESTRUCT for the port and requeues the event, > now after DISC_DESTRUCT, so it gets acted on. Where is the problem? > Ah, I read "flutter" and mistakenly thought "debounce". You're addressing the case where we're fully committed to the port going down, but a "port up" event is injected between the "port down" and "destruct" events. Yes, I agree re-queuing needs to happen, however don't we now have queue re-order problem with respect to a new "port down" event? It seems events need to be held off in-order while the subordinate DISCE events are processed. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [for 4.1 PATCH resend] libsas: fix "sysfs group not found" warnings at port teardown time
On Mon, Jul 27, 2015 at 11:07 AM, James Bottomley wrote: > On Mon, 2015-07-27 at 08:48 -0700, Dan Williams wrote: >> On Mon, Jul 27, 2015 at 8:17 AM, James Bottomley >> wrote: >> > On Wed, 2015-07-22 at 14:08 -0700, Dan Williams wrote: >> >> I don't have a libsas environment handy, I worked with Praveen to >> >> validate the version as submitted if you want to re-work it. >> > >> > A couple of days ago, this was so urgent as to have to go outside the >> > usual patch process ... now it's not important enough for you to bother >> > working on it; which is it? >> >> Neither, it was a reviewed patch that was idling in the process. I'm >> still of the opinion that pinging Andrew in a case like this *is* the >> expected process, unless there's a place I can check that a patch is >> still in the application queue? > > I didn't ask you to justify your process, I asked you how important you > thought the patch was mainly because of the conflicting signals you've > sent. I get that you think I should treat all your patches as important > whether you do or not, but the world doesn't quite work like that: patch > application is a process of triage. Patches, like this, which have > timing related issues potentially leading to races get looked at by me > as the last reviewer. The speed of review depends on several factors, > but one of which is what type of user visible issue is this causing. > The user visible effects of this are a nasty warning message and nothing > more, I believe? A useful indicator in this triage is how important the > submitter thinks the patch is, which was originally why I asked. > That would be a question to Praveen. It wasn't clear to me whether this sysfs backtrace was a simply a warning or eventually fatal to the box. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [for 4.1 PATCH resend] libsas: fix "sysfs group not found" warnings at port teardown time
On Mon, Jul 27, 2015 at 11:38 AM, James Bottomley wrote: > No, that seems to be the intent of the prior code. The reason port > visibility goes immediately (along with all associated phys), is that > the port is ready for reuse as soon as sas_deform_port() returns. > Destroying the subtree is left to DISCE_DESTRUCT, but since the subtree > is not visible, a new port can form even as the elements associated with > the old port are being destroyed. the DISCE_DISCOVER that populates it > will queue behind the destruct, so everything just works today (modulo > the new warning). > > It would be ideal if we could detect in the destruct queue that the > visibility is already gone and make use of it, but we can't. The patch > that causes this warning induces a mismatch between the state of the > kernfs tree and the kobjects ... we still have to call device_del which > leads to kobject_del (which triggers the warning) to bring this state > back into alignment. So the disconnected subtree we originally used to > make all this work is what's suddenly been rendered invalid. > Looking at the original report this seemed new for the 3.16 kernel. However, looking closer, that warning in sysfs_remove_group() has been present for a long time before that. I don't recall seeing it back when we were doing focused hotplug testing with the isci driver for the 3.5 / 3.6 kernels, so I believe it is newly uncovered since then. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: fix scsi_error_handler vs. scsi_host_dev_release race
On Thu, Aug 27, 2015 at 11:16 AM, wrote: > From: Michal Hocko > > b9d5c6b7ef57 ("[SCSI] cleanup setting task state in > scsi_error_handler()") has introduced a race between scsi_error_handler > and scsi_host_dev_release resulting in the hang when the device goes > away because scsi_error_handler might miss a wake up: > > CPU0CPU1 > scsi_error_handler scsi_host_dev_release > kthread_stop() > kthread_should_stop() > test_bit(KTHREAD_SHOULD_STOP) > set_bit(KTHREAD_SHOULD_STOP) > wake_up_process() > wait_for_completion() > > set_current_state(TASK_INTERRUPTIBLE) > schedule() > > The most straightforward solution seems to be to invert the ordering of > the set_current_state and kthread_should_stop. > > The issue has been noticed during reboot test on a 3.0 based kernel but > the current code seems to be affected in the same way. > > Cc: stable # 3.6+ > Reported-and-Debugged-by: Mike Mayer > Signed-off-by: Michal Hocko Acked-by: Dan Williams -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/4] block: Add iocontext priority to request
On Thu, Oct 13, 2016 at 12:53 PM, Adam Manzanares wrote: > Patch adds an association between iocontext ioprio and the ioprio of a > request. This value is set in blk_rq_set_prio which takes the request and > the ioc as arguments. If the ioc is valid in blk_rq_set_prio then the > iopriority of the request is set as the iopriority of the ioc. In > init_request_from_bio a check is made to see if the ioprio of the bio is > valid and if so then the request prio comes from the bio. > > Signed-off-by: Adam Manzananares > --- > block/blk-core.c | 4 +++- > include/linux/blkdev.h | 14 ++ > 2 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 14d7c07..361b1b9 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1153,6 +1153,7 @@ static struct request *__get_request(struct > request_list *rl, int op, > > blk_rq_init(q, rq); > blk_rq_set_rl(rq, rl); > + blk_rq_set_prio(rq, ioc); > req_set_op_attrs(rq, op, op_flags | REQ_ALLOCED); > > /* init elvpriv */ > @@ -1656,7 +1657,8 @@ void init_request_from_bio(struct request *req, struct > bio *bio) > > req->errors = 0; > req->__sector = bio->bi_iter.bi_sector; > - req->ioprio = bio_prio(bio); > + if (ioprio_valid(bio_prio(bio))) > + req->ioprio = bio_prio(bio); Should we use ioprio_best() here? If req->ioprio and bio_prio() disagree one side has explicitly asked for a higher priority. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/4] block: Add iocontext priority to request
On Thu, Oct 13, 2016 at 1:09 PM, Jens Axboe wrote: > On 10/13/2016 02:06 PM, Dan Williams wrote: >> >> On Thu, Oct 13, 2016 at 12:53 PM, Adam Manzanares >> wrote: >>> >>> Patch adds an association between iocontext ioprio and the ioprio of a >>> request. This value is set in blk_rq_set_prio which takes the request and >>> the ioc as arguments. If the ioc is valid in blk_rq_set_prio then the >>> iopriority of the request is set as the iopriority of the ioc. In >>> init_request_from_bio a check is made to see if the ioprio of the bio is >>> valid and if so then the request prio comes from the bio. >>> >>> Signed-off-by: Adam Manzananares >>> --- >>> block/blk-core.c | 4 +++- >>> include/linux/blkdev.h | 14 ++ >>> 2 files changed, 17 insertions(+), 1 deletion(-) >>> >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index 14d7c07..361b1b9 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -1153,6 +1153,7 @@ static struct request *__get_request(struct >>> request_list *rl, int op, >>> >>> blk_rq_init(q, rq); >>> blk_rq_set_rl(rq, rl); >>> + blk_rq_set_prio(rq, ioc); >>> req_set_op_attrs(rq, op, op_flags | REQ_ALLOCED); >>> >>> /* init elvpriv */ >>> @@ -1656,7 +1657,8 @@ void init_request_from_bio(struct request *req, >>> struct bio *bio) >>> >>> req->errors = 0; >>> req->__sector = bio->bi_iter.bi_sector; >>> - req->ioprio = bio_prio(bio); >>> + if (ioprio_valid(bio_prio(bio))) >>> + req->ioprio = bio_prio(bio); >> >> >> Should we use ioprio_best() here? If req->ioprio and bio_prio() >> disagree one side has explicitly asked for a higher priority. > > > It's a good question - but if priority has been set in the bio, it makes > sense that that would take priority over the general setting for the > task/io context. So I think the patch is correct as-is. Assuming you always trust the kernel to know the right priority... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/4] block: Add iocontext priority to request
On Thu, Oct 13, 2016 at 1:24 PM, Jens Axboe wrote: > On 10/13/2016 02:19 PM, Dan Williams wrote: >> >> On Thu, Oct 13, 2016 at 1:09 PM, Jens Axboe wrote: >>> >>> On 10/13/2016 02:06 PM, Dan Williams wrote: >>>> >>>> >>>> On Thu, Oct 13, 2016 at 12:53 PM, Adam Manzanares >>>> wrote: >>>>> >>>>> >>>>> Patch adds an association between iocontext ioprio and the ioprio of a >>>>> request. This value is set in blk_rq_set_prio which takes the request >>>>> and >>>>> the ioc as arguments. If the ioc is valid in blk_rq_set_prio then the >>>>> iopriority of the request is set as the iopriority of the ioc. In >>>>> init_request_from_bio a check is made to see if the ioprio of the bio >>>>> is >>>>> valid and if so then the request prio comes from the bio. >>>>> >>>>> Signed-off-by: Adam Manzananares >>>>> --- >>>>> block/blk-core.c | 4 +++- >>>>> include/linux/blkdev.h | 14 ++ >>>>> 2 files changed, 17 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>>> index 14d7c07..361b1b9 100644 >>>>> --- a/block/blk-core.c >>>>> +++ b/block/blk-core.c >>>>> @@ -1153,6 +1153,7 @@ static struct request *__get_request(struct >>>>> request_list *rl, int op, >>>>> >>>>> blk_rq_init(q, rq); >>>>> blk_rq_set_rl(rq, rl); >>>>> + blk_rq_set_prio(rq, ioc); >>>>> req_set_op_attrs(rq, op, op_flags | REQ_ALLOCED); >>>>> >>>>> /* init elvpriv */ >>>>> @@ -1656,7 +1657,8 @@ void init_request_from_bio(struct request *req, >>>>> struct bio *bio) >>>>> >>>>> req->errors = 0; >>>>> req->__sector = bio->bi_iter.bi_sector; >>>>> - req->ioprio = bio_prio(bio); >>>>> + if (ioprio_valid(bio_prio(bio))) >>>>> + req->ioprio = bio_prio(bio); >>>> >>>> >>>> >>>> Should we use ioprio_best() here? If req->ioprio and bio_prio() >>>> disagree one side has explicitly asked for a higher priority. >>> >>> >>> >>> It's a good question - but if priority has been set in the bio, it makes >>> sense that that would take priority over the general setting for the >>> task/io context. So I think the patch is correct as-is. >> >> >> Assuming you always trust the kernel to know the right priority... > > > If it set it in the bio, it better know what it's doing. Besides, > there's nothing stopping the caller from checking the task priority when > it sets it. If we do ioprio_best(), then we are effectively preventing > anyone from submitting a bio with a lower priority than the task has > generally set. Ah, that makes sense. Move the ioprio_best() decision out to whatever code is setting bio_prio() to allow for cases where the kernel knows best. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] scsi: libsas: fix WARN on device removal
On Wed, Nov 9, 2016 at 9:36 AM, John Garry wrote: > On 09/11/2016 12:28, John Garry wrote: >> >> On 03/11/2016 14:58, John Garry wrote: >>> >>> The following patch introduces an annoying WARN >>> when a device is removed from the SAS topology: >>> [SCSI] libsas: prevent domain rediscovery competing with ata error >>> handling >>> >> >> Are there any views on this patch? I would have thought that the parties >> who use the drivers based on libsas would be interested in fixing this >> bug. >> > > I should have added the before and after logs earlier, so the issue is > illustrated. Now attached. When a 24-port expander is unplugged we get >6k > lines of WARN on the console, lasting >30 seconds. Not nice. > I might be mistaken, but this patch seems functionally identical to this attempt: http://marc.info/?l=linux-scsi&m=143459794823595&w=2 i.e. it moves the port destruction to the workqueue and still suffers from the flutter problem: http://marc.info/?l=linux-scsi&m=143801026028006&w=2 http://marc.info/?l=linux-scsi&m=143801971131073&w=2 Perhaps we instead need to quiet this warning? http://marc.info/?l=linux-scsi&m=143802229932175&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] scsi: libsas: fix WARN on device removal
On Wed, Nov 9, 2016 at 11:09 AM, Dan Williams wrote: > On Wed, Nov 9, 2016 at 9:36 AM, John Garry wrote: >> On 09/11/2016 12:28, John Garry wrote: >>> >>> On 03/11/2016 14:58, John Garry wrote: >>>> >>>> The following patch introduces an annoying WARN >>>> when a device is removed from the SAS topology: >>>> [SCSI] libsas: prevent domain rediscovery competing with ata error >>>> handling >>>> >>> >>> Are there any views on this patch? I would have thought that the parties >>> who use the drivers based on libsas would be interested in fixing this >>> bug. >>> >> >> I should have added the before and after logs earlier, so the issue is >> illustrated. Now attached. When a 24-port expander is unplugged we get >6k >> lines of WARN on the console, lasting >30 seconds. Not nice. >> > > I might be mistaken, but this patch seems functionally identical to > this attempt: > > http://marc.info/?l=linux-scsi&m=143459794823595&w=2 > > i.e. it moves the port destruction to the workqueue and still suffers > from the flutter problem: > > http://marc.info/?l=linux-scsi&m=143801026028006&w=2 > http://marc.info/?l=linux-scsi&m=143801971131073&w=2 > > Perhaps we instead need to quiet this warning? > > http://marc.info/?l=linux-scsi&m=143802229932175&w=2 Alternatively we need a mechanism to cancel in-flight port shutdown requests when we start re-attaching devices before queued port destruction events have run. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] scsi: libsas: fix WARN on device removal
On Thu, Nov 17, 2016 at 7:23 AM, John Garry wrote: > On 11/11/2016 08:49, wangyijing wrote: > > I have not seen the flutter issue. I am just trying to solve the > horrible WARN dump. > However I do understand that there may be a issue related to how we > queue the events; there was a recent attempt to fix this, but it came to > nothing: > https://www.spinics.net/lists/linux-scsi/msg1.html We found libsas hotplug several problems: 1. sysfs warning calltrace(like the case you found); >>> >>> >>> Maybe you can then review my patch. >> >> >> I did it, I think your solution to fix the sysfs calltrace issue is ok, >> and what I worried about is we still need to fix >> the rest issues. So it's better if we could fix all issues one time. >> > > @Maintainers, would you be willing to accept this patch as an interim fix > for the dastardly WARN while we try to fix the flutter issue? To me this adds a bug to quiet a benign, albeit noisy, warning. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] scsi: libsas: fix WARN on device removal
On Fri, Nov 18, 2016 at 1:00 AM, John Garry wrote: > On 18/11/2016 01:53, Dan Williams wrote: >> >> On Thu, Nov 17, 2016 at 7:23 AM, John Garry wrote: >>> >>> On 11/11/2016 08:49, wangyijing wrote: >>>>>>> >>>>>>> >>>>>>> I have not seen the flutter issue. I am just trying to solve the >>>>>>> horrible WARN dump. >>>>>>> However I do understand that there may be a issue related to how we >>>>>>> queue the events; there was a recent attempt to fix this, but it came >>>>>>> to >>>>>>> nothing: >>>>>>> https://www.spinics.net/lists/linux-scsi/msg1.html >>>>>> >>>>>> >>>>>> >>>>>> We found libsas hotplug several problems: >>>>>> 1. sysfs warning calltrace(like the case you found); >>>>> >>>>> >>>>> >>>>> Maybe you can then review my patch. >>>> >>>> >>>> >>>> I did it, I think your solution to fix the sysfs calltrace issue is ok, >>>> and what I worried about is we still need to fix >>>> the rest issues. So it's better if we could fix all issues one time. >>>> >>> >>> @Maintainers, would you be willing to accept this patch as an interim fix >>> for the dastardly WARN while we try to fix the flutter issue? >> >> >> To me this adds a bug to quiet a benign, albeit noisy, warning. >> > > What is the bug which is being added? The bug where we queue a port teardown, but see a port formation event in the meantime. > And it's a very noisy warning, as in 6K lines on the console when an > expander is unplugged. Does something like this modulate the failure? diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.cindex 60b651bfaa01..11401e5c88ba 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -262,9 +262,10 @@ static void sas_bsg_remove(struct Scsi_Host *shost, struct sas_rphy *rphy { struct request_queue *q; - if (rphy) + if (rphy) { q = rphy->q; - else + rphy->q = NULL; + } else q = to_sas_host_attrs(shost)->q; if (!q) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] scsi: libsas: fix WARN on device removal
On Mon, Nov 21, 2016 at 7:16 AM, John Garry wrote: > @Maintainers, would you be willing to accept this patch as an interim > fix > for the dastardly WARN while we try to fix the flutter issue? To me this adds a bug to quiet a benign, albeit noisy, warning. >>> >>> What is the bug which is being added? >> >> >> The bug where we queue a port teardown, but see a port formation event >> in the meantime. > > > As I understand, this vulnerability already exists: > http://marc.info/?l=linux-scsi&m=143801026028006&w=2 > > I actually don't understand how libsas dealt with flutter (which I take to > mean a burst of up and down events) before these changes, as it can only > queue simultaneously one up and one down event per port. So, if we get a > flutter, then the events are lost and we get indeterminate state. > The events are not lost. The new problem this patch introduces is delaying sas port deletion where it was previously immediate. So now we can get into a situation where the port has gone down and can start processing a port up event before the previous deletion work has run. >> >>> And it's a very noisy warning, as in 6K lines on the console when an >>> expander is unplugged. >> >> >> Does something like this modulate the failure? I'm curious if we simply need to fix the double deletion of the sas_port bsg queue, could you try the changes below? >> >> diff --git a/drivers/scsi/scsi_transport_sas.c >> b/drivers/scsi/scsi_transport_sas.cindex >> 60b651bfaa01..11401e5c88ba 100644 >> --- a/drivers/scsi/scsi_transport_sas.c >> +++ b/drivers/scsi/scsi_transport_sas.c >> @@ -262,9 +262,10 @@ static void sas_bsg_remove(struct Scsi_Host >> *shost, struct sas_rphy *rphy >> { >> struct request_queue *q; >> >> - if (rphy) >> + if (rphy) { >> q = rphy->q; >> - else >> + rphy->q = NULL; >> + } else >> q = to_sas_host_attrs(shost)->q; >> >> if (!q) >> >> . >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] scsi, block: fix duplicate bdi name registration crashes
Warnings of the following form occur because scsi reuses a devt number while the block layer still has it referenced as the name of the bdi [1]: WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 ? kernfs_path_from_node+0x4f/0x60 sysfs_warn_dup+0x62/0x80 sysfs_create_dir_ns+0x77/0x90 kobject_add_internal+0xb2/0x350 kobject_add+0x75/0xd0 device_add+0x15a/0x650 device_create_groups_vargs+0xe0/0xf0 device_create_vargs+0x1c/0x20 bdi_register+0x90/0x240 ? lockdep_init_map+0x57/0x200 bdi_register_owner+0x36/0x60 device_add_disk+0x1bb/0x4e0 ? __pm_runtime_use_autosuspend+0x5c/0x70 sd_probe_async+0x10d/0x1c0 async_run_entry_fn+0x39/0x170 This is a brute-force fix to pass the devt release information from sd_probe() to the locations where we register the bdi, device_add_disk(), and unregister the bdi, blk_cleanup_queue(). Thanks to Omar for the quick reproducer script [2]. This patch survives where an unmodified kernel fails in a few seconds. [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4 [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2 Cc: James Bottomley Cc: Bart Van Assche Cc: "Martin K. Petersen" Cc: Christoph Hellwig Cc: Jens Axboe Reported-by: Omar Sandoval Signed-off-by: Dan Williams --- block/blk-core.c |1 + block/genhd.c |7 +++ drivers/scsi/sd.c | 41 + include/linux/blkdev.h |1 + include/linux/genhd.h | 17 + 5 files changed, 59 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 61ba08c58b64..950cea1e202e 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -597,6 +597,7 @@ void blk_cleanup_queue(struct request_queue *q) spin_unlock_irq(lock); bdi_unregister(&q->backing_dev_info); + put_disk_devt(q->disk_devt); /* @q is and will stay empty, shutdown and put */ blk_put_queue(q); diff --git a/block/genhd.c b/block/genhd.c index fcd6d4fae657..eb8009e928f5 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -612,6 +612,13 @@ void device_add_disk(struct device *parent, struct gendisk *disk) disk_alloc_events(disk); + /* +* Take a reference on the devt and assign it to queue since it +* must not be reallocated while the bdi is registerted +*/ + disk->queue->disk_devt = disk->disk_devt; + get_disk_devt(disk->disk_devt); + /* Register BDI before referencing it from bdev */ bdi = &disk->queue->backing_dev_info; bdi_register_owner(bdi, disk_to_dev(disk)); diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 0b09638fa39b..09405351577c 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3067,6 +3067,23 @@ static void sd_probe_async(void *data, async_cookie_t cookie) put_device(&sdkp->dev); } +struct sd_devt { + int idx; + struct disk_devt disk_devt; +}; + +void sd_devt_release(struct kref *kref) +{ + struct sd_devt *sd_devt = container_of(kref, struct sd_devt, + disk_devt.kref); + + spin_lock(&sd_index_lock); + ida_remove(&sd_index_ida, sd_devt->idx); + spin_unlock(&sd_index_lock); + + kfree(sd_devt); +} + /** * sd_probe - called during driver initialization and whenever a * new scsi device is attached to the system. It is called once @@ -3088,6 +3105,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie) static int sd_probe(struct device *dev) { struct scsi_device *sdp = to_scsi_device(dev); + struct sd_devt *sd_devt; struct scsi_disk *sdkp; struct gendisk *gd; int index; @@ -3113,9 +3131,13 @@ static int sd_probe(struct device *dev) if (!sdkp) goto out; + sd_devt = kzalloc(sizeof(*sd_devt), GFP_KERNEL); + if (!sd_devt) + goto out_free; + gd = alloc_disk(SD_MINORS); if (!gd) - goto out_free; + goto out_free_devt; do { if (!ida_pre_get(&sd_index_ida, GFP_KERNEL)) @@ -3131,6 +3153,11 @@ static int sd_probe(struct device *dev) goto out_put; } + kref_init(&sd_devt->disk_devt.kref); + sd_devt->disk_devt.release = sd_devt_release; + sd_devt->idx = index; + gd->disk_devt = &sd_devt->disk_devt; + error = sd_format_disk_name("sd", index, gd->disk_name, DISK_NAME_LEN); if (error) { sdev_printk(KERN_WARNING, sdp, "SCSI disk (sd) name length exceeded.\n"); @@ -3170,13 +3197,14 @@ static int sd_probe(struct device *dev) r
Re: [RFC PATCH] scsi, block: fix duplicate bdi name registration crashes
On Sun, Jan 29, 2017 at 11:22 PM, Omar Sandoval wrote: > On Mon, Jan 30, 2017 at 08:05:52AM +0100, Hannes Reinecke wrote: >> On 01/29/2017 05:58 AM, Dan Williams wrote: >> > Warnings of the following form occur because scsi reuses a devt number >> > while the block layer still has it referenced as the name of the bdi >> > [1]: >> > >> > WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 >> > sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' >> > [..] >> > Call Trace: >> > dump_stack+0x86/0xc3 >> > __warn+0xcb/0xf0 >> > warn_slowpath_fmt+0x5f/0x80 >> > ? kernfs_path_from_node+0x4f/0x60 >> > sysfs_warn_dup+0x62/0x80 >> > sysfs_create_dir_ns+0x77/0x90 >> > kobject_add_internal+0xb2/0x350 >> > kobject_add+0x75/0xd0 >> > device_add+0x15a/0x650 >> > device_create_groups_vargs+0xe0/0xf0 >> > device_create_vargs+0x1c/0x20 >> > bdi_register+0x90/0x240 >> > ? lockdep_init_map+0x57/0x200 >> > bdi_register_owner+0x36/0x60 >> > device_add_disk+0x1bb/0x4e0 >> > ? __pm_runtime_use_autosuspend+0x5c/0x70 >> > sd_probe_async+0x10d/0x1c0 >> > async_run_entry_fn+0x39/0x170 >> > >> > This is a brute-force fix to pass the devt release information from >> > sd_probe() to the locations where we register the bdi, >> > device_add_disk(), and unregister the bdi, blk_cleanup_queue(). >> > >> > Thanks to Omar for the quick reproducer script [2]. This patch survives >> > where an unmodified kernel fails in a few seconds. >> > >> > [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4 >> > [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2 >> > >> > Cc: James Bottomley >> > Cc: Bart Van Assche >> > Cc: "Martin K. Petersen" >> > Cc: Christoph Hellwig >> > Cc: Jens Axboe >> > Reported-by: Omar Sandoval >> > Signed-off-by: Dan Williams >> > --- >> > block/blk-core.c |1 + >> > block/genhd.c |7 +++ >> > drivers/scsi/sd.c | 41 + >> > include/linux/blkdev.h |1 + >> > include/linux/genhd.h | 17 + >> > 5 files changed, 59 insertions(+), 8 deletions(-) >> > >> Please check the patchset from Jan Kara (cf 'BDI lifetime fix' on >> linux-block), which attempts to solve the same problem. > > Hi, Hannes, > > It's not the same problem. Jan's series fixes a bdi vs. inode lifetime > issue, this patch is for a bdi vs devt lifetime issue. Jan's series > doesn't fix the crashes caused by my reproducer script. Correct. In fact I was running Jan's patches in my baseline kernel that fails almost immediately. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] scsi, block: fix duplicate bdi name registration crashes
On Mon, Jan 30, 2017 at 4:24 AM, Christoph Hellwig wrote: > Hi Dan, > > this looks mostly fine to me. A few code comments below, but except > for this there is another issue with it: We still have drivers > that share a single request_queue for multiple gendisks, so I wonder > > Also I think you probably want one patch for the block framework, > and one to switch SCSI over to it. > >> +struct disk_devt { >> + struct kref kref; >> + void (*release)(struct kref *); >> +}; >> + >> +static inline void put_disk_devt(struct disk_devt *disk_devt) >> +{ >> + if (disk_devt) >> + kref_put(&disk_devt->kref, disk_devt->release); >> +} >> + >> +static inline void get_disk_devt(struct disk_devt *disk_devt) >> +{ >> + if (disk_devt) >> + kref_get(&disk_devt->kref); >> +} > > Given that we have a user-supplied release callack I'd much rather get > rid of the kref here, use a normal atomic_t and pass the disk_devt > structure to the release callback then a kref. I'm missing something... kref is just: struct kref { atomic_t refcount; }; ...so what do we gain by open coding kref_get() and kref_put()? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] scsi, block: fix duplicate bdi name registration crashes
On Mon, Jan 30, 2017 at 4:24 AM, Christoph Hellwig wrote: > Hi Dan, > > this looks mostly fine to me. A few code comments below, but except > for this there is another issue with it: We still have drivers > that share a single request_queue for multiple gendisks, so I wonder scsi drivers or others? If those drivers can switch to dynamically allocated devt (GENHD_FL_EXT_DEVT), then they don't need this fix. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] scsi, block: fix duplicate bdi name registration crashes
Warnings of the following form occur because scsi reuses a devt number while the block layer still has it referenced as the name of the bdi [1]: WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 ? kernfs_path_from_node+0x4f/0x60 sysfs_warn_dup+0x62/0x80 sysfs_create_dir_ns+0x77/0x90 kobject_add_internal+0xb2/0x350 kobject_add+0x75/0xd0 device_add+0x15a/0x650 device_create_groups_vargs+0xe0/0xf0 device_create_vargs+0x1c/0x20 bdi_register+0x90/0x240 ? lockdep_init_map+0x57/0x200 bdi_register_owner+0x36/0x60 device_add_disk+0x1bb/0x4e0 ? __pm_runtime_use_autosuspend+0x5c/0x70 sd_probe_async+0x10d/0x1c0 async_run_entry_fn+0x39/0x170 This is a brute-force fix to pass the devt release information from sd_probe() to the locations where we register the bdi, device_add_disk(), and unregister the bdi, blk_cleanup_queue(). Thanks to Omar for the quick reproducer script [2]. This patch survives where an unmodified kernel fails in a few seconds. [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4 [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2 Cc: James Bottomley Cc: Bart Van Assche Cc: "Martin K. Petersen" Cc: Christoph Hellwig Cc: Jens Axboe Cc: Jan Kara Reported-by: Omar Sandoval Signed-off-by: Dan Williams --- Changes in v2: * rebased on top of Jan's bdi lifetime series * replace kref_{get,put}() with atomic_{inc,dec_and_test} (Christoph) block/blk-core.c |1 + block/genhd.c |7 +++ drivers/scsi/sd.c | 41 + include/linux/blkdev.h |1 + include/linux/genhd.h | 17 + 5 files changed, 59 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 84fabb51714a..0cd6b3c4b41c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -595,6 +595,7 @@ void blk_cleanup_queue(struct request_queue *q) spin_unlock_irq(lock); bdi_unregister(q->backing_dev_info); + put_disk_devt(q->disk_devt); /* @q is and will stay empty, shutdown and put */ blk_put_queue(q); diff --git a/block/genhd.c b/block/genhd.c index d9ccd42f3675..124499db04d6 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -612,6 +612,13 @@ void device_add_disk(struct device *parent, struct gendisk *disk) disk_alloc_events(disk); + /* +* Take a reference on the devt and assign it to queue since it +* must not be reallocated while the bdi is registered +*/ + disk->queue->disk_devt = disk->disk_devt; + get_disk_devt(disk->disk_devt); + /* Register BDI before referencing it from bdev */ bdi = disk->queue->backing_dev_info; bdi_register_owner(bdi, disk_to_dev(disk)); diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 0b09638fa39b..102111e730ce 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3067,6 +3067,23 @@ static void sd_probe_async(void *data, async_cookie_t cookie) put_device(&sdkp->dev); } +struct sd_devt { + int idx; + struct disk_devt disk_devt; +}; + +void sd_devt_release(struct disk_devt *disk_devt) +{ + struct sd_devt *sd_devt = container_of(disk_devt, struct sd_devt, + disk_devt); + + spin_lock(&sd_index_lock); + ida_remove(&sd_index_ida, sd_devt->idx); + spin_unlock(&sd_index_lock); + + kfree(sd_devt); +} + /** * sd_probe - called during driver initialization and whenever a * new scsi device is attached to the system. It is called once @@ -3088,6 +3105,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie) static int sd_probe(struct device *dev) { struct scsi_device *sdp = to_scsi_device(dev); + struct sd_devt *sd_devt; struct scsi_disk *sdkp; struct gendisk *gd; int index; @@ -3113,9 +3131,13 @@ static int sd_probe(struct device *dev) if (!sdkp) goto out; + sd_devt = kzalloc(sizeof(*sd_devt), GFP_KERNEL); + if (!sd_devt) + goto out_free; + gd = alloc_disk(SD_MINORS); if (!gd) - goto out_free; + goto out_free_devt; do { if (!ida_pre_get(&sd_index_ida, GFP_KERNEL)) @@ -3131,6 +3153,11 @@ static int sd_probe(struct device *dev) goto out_put; } + atomic_set(&sd_devt->disk_devt.count, 1); + sd_devt->disk_devt.release = sd_devt_release; + sd_devt->idx = index; + gd->disk_devt = &sd_devt->disk_devt; + error = sd_format_disk_name("sd", index, gd->disk_name, DISK_NAME_LEN); if (error) { sdev_print
[PATCH v3] scsi, block: fix duplicate bdi name registration crashes
Warnings of the following form occur because scsi reuses a devt number while the block layer still has it referenced as the name of the bdi [1]: WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 ? kernfs_path_from_node+0x4f/0x60 sysfs_warn_dup+0x62/0x80 sysfs_create_dir_ns+0x77/0x90 kobject_add_internal+0xb2/0x350 kobject_add+0x75/0xd0 device_add+0x15a/0x650 device_create_groups_vargs+0xe0/0xf0 device_create_vargs+0x1c/0x20 bdi_register+0x90/0x240 ? lockdep_init_map+0x57/0x200 bdi_register_owner+0x36/0x60 device_add_disk+0x1bb/0x4e0 ? __pm_runtime_use_autosuspend+0x5c/0x70 sd_probe_async+0x10d/0x1c0 async_run_entry_fn+0x39/0x170 This is a brute-force fix to pass the devt release information from sd_probe() to the locations where we register the bdi, device_add_disk(), and unregister the bdi, blk_cleanup_queue(). Thanks to Omar for the quick reproducer script [2]. This patch survives where an unmodified kernel fails in a few seconds. [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4 [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2 Cc: James Bottomley Cc: Bart Van Assche Cc: "Martin K. Petersen" Cc: Christoph Hellwig Cc: Jens Axboe Cc: Jan Kara Reported-by: Omar Sandoval Tested-by: Omar Sandoval Signed-off-by: Dan Williams --- Changes in v3: * un-inline {get,put}_disk_devt() (Bart) * added Omar's tested-by from v2 Changes in v2: * rebased on top of Jan's bdi lifetime series * replace kref_{get,put}() with atomic_{inc,dec_and_test} (Christoph) block/blk-core.c |1 + block/genhd.c | 21 + drivers/scsi/sd.c | 41 + include/linux/blkdev.h |1 + include/linux/genhd.h |8 5 files changed, 64 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 84fabb51714a..0cd6b3c4b41c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -595,6 +595,7 @@ void blk_cleanup_queue(struct request_queue *q) spin_unlock_irq(lock); bdi_unregister(q->backing_dev_info); + put_disk_devt(q->disk_devt); /* @q is and will stay empty, shutdown and put */ blk_put_queue(q); diff --git a/block/genhd.c b/block/genhd.c index d9ccd42f3675..3631cd480295 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -572,6 +572,20 @@ static void register_disk(struct device *parent, struct gendisk *disk) disk_part_iter_exit(&piter); } +void put_disk_devt(struct disk_devt *disk_devt) +{ + if (disk_devt && atomic_dec_and_test(&disk_devt->count)) + disk_devt->release(disk_devt); +} +EXPORT_SYMBOL(put_disk_devt); + +void get_disk_devt(struct disk_devt *disk_devt) +{ + if (disk_devt) + atomic_inc(&disk_devt->count); +} +EXPORT_SYMBOL(get_disk_devt); + /** * device_add_disk - add partitioning information to kernel list * @parent: parent device for the disk @@ -612,6 +626,13 @@ void device_add_disk(struct device *parent, struct gendisk *disk) disk_alloc_events(disk); + /* +* Take a reference on the devt and assign it to queue since it +* must not be reallocated while the bdi is registered +*/ + disk->queue->disk_devt = disk->disk_devt; + get_disk_devt(disk->disk_devt); + /* Register BDI before referencing it from bdev */ bdi = disk->queue->backing_dev_info; bdi_register_owner(bdi, disk_to_dev(disk)); diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 0b09638fa39b..102111e730ce 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3067,6 +3067,23 @@ static void sd_probe_async(void *data, async_cookie_t cookie) put_device(&sdkp->dev); } +struct sd_devt { + int idx; + struct disk_devt disk_devt; +}; + +void sd_devt_release(struct disk_devt *disk_devt) +{ + struct sd_devt *sd_devt = container_of(disk_devt, struct sd_devt, + disk_devt); + + spin_lock(&sd_index_lock); + ida_remove(&sd_index_ida, sd_devt->idx); + spin_unlock(&sd_index_lock); + + kfree(sd_devt); +} + /** * sd_probe - called during driver initialization and whenever a * new scsi device is attached to the system. It is called once @@ -3088,6 +3105,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie) static int sd_probe(struct device *dev) { struct scsi_device *sdp = to_scsi_device(dev); + struct sd_devt *sd_devt; struct scsi_disk *sdkp; struct gendisk *gd; int index; @@ -3113,9 +3131,13 @@ static int sd_probe(struct device *dev) if (!sdkp) goto out; +
Re: [PATCH v3] scsi, block: fix duplicate bdi name registration crashes
On Wed, Feb 1, 2017 at 2:35 PM, Jens Axboe wrote: > On 02/01/2017 02:05 PM, Dan Williams wrote: >> Warnings of the following form occur because scsi reuses a devt number >> while the block layer still has it referenced as the name of the bdi >> [1]: >> >> WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 >> sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' >> [..] >> Call Trace: >> dump_stack+0x86/0xc3 >> __warn+0xcb/0xf0 >> warn_slowpath_fmt+0x5f/0x80 >> ? kernfs_path_from_node+0x4f/0x60 >> sysfs_warn_dup+0x62/0x80 >> sysfs_create_dir_ns+0x77/0x90 >> kobject_add_internal+0xb2/0x350 >> kobject_add+0x75/0xd0 >> device_add+0x15a/0x650 >> device_create_groups_vargs+0xe0/0xf0 >> device_create_vargs+0x1c/0x20 >> bdi_register+0x90/0x240 >> ? lockdep_init_map+0x57/0x200 >> bdi_register_owner+0x36/0x60 >> device_add_disk+0x1bb/0x4e0 >> ? __pm_runtime_use_autosuspend+0x5c/0x70 >> sd_probe_async+0x10d/0x1c0 >> async_run_entry_fn+0x39/0x170 >> >> This is a brute-force fix to pass the devt release information from >> sd_probe() to the locations where we register the bdi, >> device_add_disk(), and unregister the bdi, blk_cleanup_queue(). >> >> Thanks to Omar for the quick reproducer script [2]. This patch survives >> where an unmodified kernel fails in a few seconds. > > What is the patch against? Doesn't seem to apply cleanly for me on > master, nor the 4.11 block tree. > I built it on top of Jan's bdi fixes series [1]. I can rebase to block/for-next, just let me know which patches you want to take first. [1]: http://marc.info/?l=linux-block&m=148586843819160&w=2
Re: [lkp-robot] [scsi, block] 0dba1314d4: WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup
On Mon, Feb 6, 2017 at 8:09 PM, Jens Axboe wrote: > On 02/06/2017 05:14 PM, James Bottomley wrote: >> On Sun, 2017-02-05 at 21:13 -0800, Dan Williams wrote: >>> On Sun, Feb 5, 2017 at 1:13 AM, Christoph Hellwig wrote: >>>> Dan, >>>> >>>> can you please quote your emails? I can't find any content >>>> inbetween all these quotes. >>> >>> Sorry, I'm using gmail, but I'll switch to attaching the logs. >>> >>> So with help from Xiaolong I was able to reproduce this, and it does >>> not appear to be a regression. We simply change the failure output of >>> an existing bug. Attached is a log of the same test on v4.10-rc7 >>> (i.e. without the recent block/scsi fixes), and it shows sda being >>> registered twice. >>> >>> "[6.647077] kobject (d5078ca4): tried to init an initialized >>> object, something is seriously wrong." >>> >>> The change that "scsi, block: fix duplicate bdi name registration >>> crashes" makes is to properly try to register sdb since the sda devt >>> is still alive. However that's not a fix because we've managed to >>> call blk_register_queue() twice on the same queue. >> >> OK, time to involve others: linux-scsi and linux-block cc'd and I've >> inserted the log below. >> >> James >> >> --- >> >> [5.969672] scsi host0: scsi_debug: version 1.86 [20160430] >> [5.969672] dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0 >> [5.971895] scsi 0:0:0:0: Direct-Access Linuxscsi_debug >> 0186 PQ: 0 ANSI: 7 >> [6.006983] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 >> MB/8.00 MiB) >> [6.026965] sd 0:0:0:0: [sda] Write Protect is off >> [6.027870] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08 >> [6.066962] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, >> supports DPO and FUA >> [6.486962] sd 0:0:0:0: [sda] Attached SCSI disk >> [6.488377] sd 0:0:0:0: [sda] Synchronizing SCSI cache >> [6.489455] sd 0:0:0:0: Attached scsi generic sg0 type 0 >> [6.526982] sd 0:0:0:0: [sda] 16384 512-byte logical blocks: (8.39 >> MB/8.00 MiB) >> [6.546964] sd 0:0:0:0: [sda] Write Protect is off >> [6.547873] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08 >> [6.586963] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, >> supports DPO and FUA >> [6.647077] kobject (d5078ca4): tried to init an initialized object, >> something is seriously wrong. > > So sda is probed twice, and hilarity ensues when we try to register it > twice. I can't reproduce this, using scsi_debug and with scsi_async > enabled. > > This is running linux-next? What's your .config? > The original failure report is here: http://marc.info/?l=linux-kernel&m=148619222300774&w=2 ...but it reproduces on current mainline with the same config. I haven't spotted what makes scsi_debug behave like this.
Re: [lkp-robot] [scsi, block] 0dba1314d4: WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup
On Wed, Feb 8, 2017 at 4:08 PM, James Bottomley wrote: > On Mon, 2017-02-06 at 21:42 -0800, Dan Williams wrote: [..] >> ...but it reproduces on current mainline with the same config. I >> haven't spotted what makes scsi_debug behave like this. > > Looking at the config, it's a static debug with report luns enabled. > Is it as simple as the fact that we probe lun 0 manually to see if the > target exists, but then we don't account for the fact that we already > did this, so if it turns up again in the report lun scan, we'll probe > it again leading to a double add. If that theory is correct, this may > be the fix (compile tested only). > > James > > --- > > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c > index 6f7128f..ba4be08 100644 > --- a/drivers/scsi/scsi_scan.c > +++ b/drivers/scsi/scsi_scan.c > @@ -1441,6 +1441,10 @@ static int scsi_report_lun_scan(struct scsi_target > *starget, int bflags, > for (lunp = &lun_data[1]; lunp <= &lun_data[num_luns]; lunp++) { > lun = scsilun_to_int(lunp); > > + if (lun == 0) > + /* already scanned LUN 0 */ > + continue; > + > if (lun > sdev->host->max_lun) { > sdev_printk(KERN_WARNING, sdev, > "lun%llu has a LUN larger than" I gave this a shot on top of linux-next, but still hit the failure. Log attached. log Description: Binary data
Re: [PATCH 1/3] libsas: modify SATA error handler
On Wed, Aug 6, 2014 at 3:50 AM, Xiangliang Yu wrote: > Hi, Dan & James > How about the patches about support for PM? > Two months had passed since I submitted the patches. > Thanks! > Did you address my review comments? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] libsas: modify SATA error handler
[ adding yuxia...@marvell.com ] On Tue, Jun 3, 2014 at 6:41 PM, Dan Williams wrote: > Hi, some notes below: > > On Thu, Apr 24, 2014 at 6:27 AM, Xiangliang Yu wrote: >> Add support for SATA port softreset and port multiplier error >> handling. > > Some more detailed notes about the approach and any caveats would be > appreciated. > >> >> Signed-off-by: Xiangliang Yu >> --- >> drivers/scsi/libsas/sas_ata.c | 226 >> - >> include/scsi/libsas.h |6 + >> 2 files changed, 231 insertions(+), 1 deletions(-) >> >> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c >> index 766098a..29a19fd 100644 >> --- a/drivers/scsi/libsas/sas_ata.c >> +++ b/drivers/scsi/libsas/sas_ata.c >> @@ -442,6 +442,226 @@ static int sas_ata_hard_reset(struct ata_link *link, >> unsigned int *class, >> return ret; >> } >> >> +static void sas_ata_freeze(struct ata_port *ap) >> +{ >> + struct domain_device *dev = ap->private_data; >> + struct sas_ha_struct *sas_ha = dev->port->ha; >> + struct Scsi_Host *host = sas_ha->core.shost; >> + struct sas_internal *i = to_sas_internal(host->transportt); >> + >> + if (i->dft->lldd_dev_freeze) >> + i->dft->lldd_dev_freeze(dev); >> +} >> + >> +static void sas_ata_thaw(struct ata_port *ap) >> +{ >> + struct domain_device *dev = ap->private_data; >> + struct sas_ha_struct *sas_ha = dev->port->ha; >> + struct Scsi_Host *host = sas_ha->core.shost; >> + struct sas_internal *i = to_sas_internal(host->transportt); >> + >> + if (i->dft->lldd_dev_thaw) >> + i->dft->lldd_dev_thaw(dev); >> +} >> + >> +static int sas_ata_wait_task_done(struct sas_task *task, unsigned long >> timeout, >> + int (*check_done)(struct sas_task *task)) >> +{ > > Why do we need a custom check_done() routine? Since we have a > sas_task we can use the normal completion infrastructure. See > smp_execute_task(). > >> + struct ata_port *ap = task->uldd_task; >> + unsigned long deadline; >> + int done; >> + >> + if (!check_done) { >> + SAS_DPRINTK("check function is null.\n"); >> + return -1; >> + } >> + >> + deadline = ata_deadline(jiffies, timeout); >> + done = check_done(task); >> + >> + while (done && time_before(jiffies, deadline)) { >> + ata_msleep(ap, 1); >> + >> + done = check_done(task); > > This can simply be: > > completion_done(&task->slow_task->completion) > >> + } >> + >> + return done; >> +} >> + >> +static int sas_ata_exec_polled_cmd(struct ata_port *ap, struct ata_taskfile >> *tf, >> + int pmp, unsigned long timeout) >> +{ >> + struct domain_device *dev = ap->private_data; >> + struct sas_ha_struct *sas_ha = dev->port->ha; >> + struct Scsi_Host *host = sas_ha->core.shost; >> + struct sas_internal *i = to_sas_internal(host->transportt); >> + struct sas_task *task = NULL; >> + int ret = -1; >> + >> + if (!i->dft->lldd_execute_task) { >> + SAS_DPRINTK("execute function is null.\n"); >> + return ret; >> + } >> + >> + task = sas_alloc_task(GFP_ATOMIC); > > I think this can be downgraded to GFP_NOIO. We're in a sleepable context. > >> + if (!task) { >> + SAS_DPRINTK("failed to alloc sas task.\n"); >> + goto fail; >> + } >> + >> + task->dev = dev; >> + task->task_proto = SAS_PROTOCOL_SATA; >> + task->uldd_task = ap; >> + >> + ata_tf_to_fis(tf, pmp, 0, (u8 *)&task->ata_task.fis); >> + task->ata_task.retry_count = 1; >> + task->task_state_flags = SAS_TASK_STATE_PENDING; >> + task->task_state_flags |= SAS_TASK_NEED_DEV_RESET; >> + >> + ret = i->dft->lldd_execute_task(task, 1, GFP_ATOMIC); > > Same here. > >> + if (ret) { >> + SAS_DPRINTK("failed to send internal task.\n"); >> + goto fail; >> + } >> + >> + if (timeout) { >> + ret
Re: [PATCH 1/3] libsas: modify SATA error handler
Some more comments below. [..] >>> + >>> + pmp = sata_srst_pmp(link); >>> + >>> + msecs = 0; >>> + now = jiffies; >>> + if (time_after(deadline, now)) >>> + msecs = jiffies_to_msecs(deadline - now); >>> + >>> + memset(&tf, 0, sizeof(struct ata_taskfile)); >>> + tf.ctl = ATA_SRST; >>> + tf.device = ATA_DEVICE_OBS; >>> + >>> + if (sas_ata_exec_polled_cmd(ap, &tf, pmp, msecs)) { >>> + ret = -EIO; >>> + goto fail; >>> + } >>> + >>> + tf.ctl &= ~ATA_SRST; >>> + sas_ata_exec_polled_cmd(ap, &tf, pmp, msecs); >> >> What about lldd's that do not know how to handle ATA_SRST? I think we >> need preparation patches to make SRST support opt-in, or teach all >> lldds how to handle these SRST sas_tasks. > > I think lldds have different actions to handle SRST because there is no > unified standard. > But it should be easy to support it. > Later, I'll submit a mvsas patch to show how to support it. Right, but what about the other lldd's? Libsas needs to check whether the lldd supports SRST before attempting to submit. [..] >>> +/* According sata 3.0 spec 13.15.4.2, enable the device port */ >>> +static int sas_ata_pmp_hard_reset(struct ata_link *link, unsigned int >>> *class, >>> + unsigned long deadline) { >>> + struct ata_port *ap = link->ap; >>> + struct domain_device *dev = ap->private_data; >>> + struct sas_ha_struct *sas_ha = dev->port->ha; >>> + struct Scsi_Host *host = sas_ha->core.shost; >>> + struct sas_internal *i = to_sas_internal(host->transportt); >>> + int ret = -1; >>> + u32 scontrol = 0; >>> + >>> + set_bit(SAS_DEV_RESET, &dev->state); >>> + >>> + ret = sata_scr_read(link, SCR_CONTROL, &scontrol); >> >> I think a comment is needed clarifying that these reads generate >> sas_tasks to a pmp and are not trying to read/write local SCR >> registers that do not exist on a SAS hba. >> > > I think the situation can't happen here. Right, that's why we need a comment, because by normally libsas lldd's do not support scr-register accesses. > >>> + if (ret) >>> + goto error; >>> + >>> + /* enable device port */ >>> + scontrol = 0x1; >>> + ret = sata_scr_write(link, SCR_CONTROL, scontrol); >>> + if (ret) >>> + goto error; >>> + >>> + ata_msleep(ap, 1); >>> + >>> + scontrol = 0x0; >>> + ret = sata_scr_write(link, SCR_CONTROL, scontrol); >>> + if (ret) >>> + goto error; >>> + >>> + ata_msleep(ap, 10); >>> + >>> + /* check device link status */ >>> + if (ata_link_offline(link)) { >>> + SAS_DPRINTK("link is offline.\n"); >>> + goto error; >>> + } >>> + >>> + /* clear X bit */ >>> + scontrol = 0x; >>> + ret = sata_scr_write(link, SCR_ERROR, scontrol); >>> + if (ret) >>> + goto error; >>> + >>> + /* may be need to wait for device sig */ >>> + ata_msleep(ap, 3); >>> + >>> + /* check device class */ >>> + if (i->dft->lldd_dev_classify) >>> + *class = i->dft->lldd_dev_classify(dev); >>> + >>> + return 0; >>> + >>> +error: >>> + SAS_DPRINTK("failed to hard reset.\n"); >>> + return ret; >>> +} >>> + >>> /* >>> * notify the lldd to forget the sas_task for this internal ata command >>> * that bypasses scsi-eh >>> @@ -551,8 +771,12 @@ void sas_ata_end_eh(struct ata_port *ap) static >>> struct ata_port_operations sas_sata_ops = { >>> .prereset = ata_std_prereset, >>> .hardreset = sas_ata_hard_reset, >>> + .softreset = sas_ata_soft_reset, >>> + .pmp_hardreset = sas_ata_pmp_hard_reset, >>> + .freeze = sas_ata_freeze, >>> + .thaw = sas_ata_thaw, >>> .postreset = ata_std_postreset, >>> - .error_handler = ata_std_error_handler, >>> + .error_handler = sata_pmp_error_handler, >>> .post_internal_cmd = sas_ata_post_internal, >>> .qc_defer = ata_std_qc_defer, >>> .qc_prep= ata_noop_qc_prep, >>> diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h index >>> ef7872c..a26466a 100644 >>> --- a/include/scsi/libsas.h >>> +++ b/include/scsi/libsas.h >>> @@ -689,6 +689,12 @@ struct sas_domain_function_template { >>> /* GPIO support */ >>> int (*lldd_write_gpio)(struct sas_ha_struct *, u8 reg_type, >>>u8 reg_index, u8 reg_count, u8 >>> *write_data); >>> + >>> + /* ATA EH functions */ >>> + void (*lldd_dev_freeze)(struct domain_device *); >> >> Why do we need to pass this all the way through to the lldd? Can we >> get away with emulating this in sas_ata.c. >> > > Because SAS HBAs spec haven't a unified standard,
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On Tue, May 31, 2016 at 1:38 AM, Wei Fang wrote: > sas_ata_strategy_handler() adds the works of the ata error handler > to system_unbound_wq. This workqueue asynchronously runs work items, > so the ata error handler will be performed concurrently on different > CPUs. In this case, ->host_failed will be decreased simultaneously in > scsi_eh_finish_cmd() on different CPUs, and become abnormal. > > It will lead to permanently inequal between ->host_failed and > ->host_busy, and scsi error handler thread won't become running. > IO errors after that won't be handled forever. > > Use atomic type for ->host_failed to fix this race. > > This fixes the problem introduced in > commit 50824d6c5657 ("[SCSI] libsas: async ata-eh"). > > Reviewed-by: Christoph Hellwig > Signed-off-by: Wei Fang Acked-by: Dan Williams -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed
On Tue, May 31, 2016 at 8:21 PM, Dan Williams wrote: > On Tue, May 31, 2016 at 1:38 AM, Wei Fang wrote: >> sas_ata_strategy_handler() adds the works of the ata error handler >> to system_unbound_wq. This workqueue asynchronously runs work items, >> so the ata error handler will be performed concurrently on different >> CPUs. In this case, ->host_failed will be decreased simultaneously in >> scsi_eh_finish_cmd() on different CPUs, and become abnormal. >> >> It will lead to permanently inequal between ->host_failed and >> ->host_busy, and scsi error handler thread won't become running. >> IO errors after that won't be handled forever. >> >> Use atomic type for ->host_failed to fix this race. >> >> This fixes the problem introduced in >> commit 50824d6c5657 ("[SCSI] libsas: async ata-eh"). >> >> Reviewed-by: Christoph Hellwig >> Signed-off-by: Wei Fang > > Acked-by: Dan Williams Should also tag this for -stable. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] badblocks: Add core badblock management code
On Tue, Dec 8, 2015 at 1:08 PM, Verma, Vishal L wrote: > On Wed, 2015-12-09 at 08:03 +1100, NeilBrown wrote: >> On Sat, Dec 05 2015, Verma, Vishal L wrote: >> > > >> > > > +int badblocks_clear(struct badblocks *bb, sector_t s, int >> > > > sectors) >> > > > +{ >> > > [...] >> > > > +#define DO_DEBUG 1 >> > > >> > > Why have this at all if it's unconditionally defined and always >> > > set. >> > >> > Neil - any reason or anything you had in mind for this? Or is it >> > just an >> > artifact and can be removed. >> >> Like the comment says: >> >> /* Allow clearing via sysfs *only* for testing/debugging. >>* Normally only a successful write may clear a badblock >>*/ >> >> The DO_DEBUG define and ifdefs are documentation identifying bits of >> code that should be removed when it all seems to be working. >> Maybe now is a good time to remove that code. >> > Hm, I think it would be nice to continue to have the ability to clear > badblocks using sysfs at least for a while more, as we test the various > error handling paths for NVDIMMS (Dan, thoughts?). > > We could either remove it later or (I'm leaning towards) make it a > config option similar to FAIL_MAKE_REQUEST and friends.. "later" as in before v4.5-rc1? We can always carry this debug feature locally for testing. We don't want userspace growing ABI attachments to this capability now that it's more than just md tooling that will see this. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/3] Badblock tracking for gendisks
Hi Jens, Are you on-board with this approach? Any concerns with me carrying this through the nvdimm tree along with our other pending error-handling enabling? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: restart list search after unlock in scsi_remove_target
On Mon, Oct 19, 2015 at 8:56 AM, Christoph Hellwig wrote: > On Mon, Oct 19, 2015 at 08:36:23AM -0700, Bart Van Assche wrote: >> Thanks for looking into this. However, I think we need a motivation in the >> patch description why this patch does not reintroduce the soft lockup >> documented in patch "scsi_remove_target: fix softlockup regression on hot >> remove" (commit bc3f02a795d3). > > Interesting. I tried to find the original report and what state > changes would cause an endless loop here. Dan, do you remember any > details about this bug report? I believe the issue I was seeing back then might have been fixed or at least modulated by "f2495e228fce [SCSI] dual scan thread bug fix" which came a few years later. The original problem was hot-remove racing hot-add and that scsi_target_reap() was not guaranteed to advance the state of the target if it was in the process of being scanned when a removal event arrived. However the comment in that change: + /* +* if we get here and the target is still in the CREATED state that +* means it was allocated but never made visible (because a scan +* turned up no LUNs), so don't call device_del() on it. +*/ ...is not what I was seeing. The target was in the CREATED state because it had not yet completed the initial scan before tear down was initiated. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Restart list search after unlock in scsi_remove_target
On Wed, Nov 4, 2015 at 2:44 PM, James Bottomley wrote: [..] > The fundamental problem with this is how have the conditions that caused > us to move away from list restart: > > commit bc3f02a795d3b4faa99d37390174be2a75d091bd > Author: Dan Williams > Date: Tue Aug 28 22:12:10 2012 -0700 > > [SCSI] scsi_remove_target: fix softlockup regression on hot remove > > Which was triggered by this bug report > > http://thread.gmane.org/gmane.linux.kernel/1348679 > > been mitigated? > I think it has because the problem that led to that report was the fact that scsi_target_destroy() did not advance the target state, but we changed that in f2495e228fce. http://marc.info/?l=linux-scsi&m=144527527308725&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] block: Add badblock management for gendisks
On Tue, Nov 24, 2015 at 12:10 PM, Verma, Vishal L wrote: > On Tue, 2015-11-24 at 14:14 -0500, Jeff Moyer wrote: >> >> I'm not sure whether it makes sense to continue without badblock >> management for the RAID code. I was hoping Neil would comment on >> that. >> >> -Jeff > > Not sure I follow? I believe I've kept all the badblocks functionality > RAID already had.. > > > On a related note, something I observed when testing with md: > > md's badblock list is per-device, and can be seen at: > /sys/block/md0/md/dev-pmem0/bad_blocks > > Now if we have badblocks in the gendisks too, there is also: > /sys/block/pmem0/bad_blocks > > The two are separate 'accounts' maintained by separate drivers (md for > the former, and pmem for the latter). This can potentially be > confusing.. > > Should we consolidate the two, i.e. make md (re)use the gendisk > badblocks for its purposes too? If we get agreement that tracking bad blocks at the gendisk is useful for more than just nvdimms then yes, I think it makes sense to move md bad_blocks to the gendisk layer. That is provided we can add a symlink to make the move transparent to existing md userspace tooling. The use cases I can envision being useful for other disks is: 1/ Bypassing / avoiding known bad blocks on disks / drivers that inject long completion latency for error handling. 2/ Simulating bad blocks for disks that can't store them locally. Similar to the md use case, if one encounters errors restoring a disk image from a backup it's useful to have the option to record the error in a bad block list and keep going. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: eSATA Drive Detection issues on mvsas
On Tue, Oct 15, 2013 at 5:15 PM, Praveen Murali wrote: > Dan/James, > Can you please take a look at this and let me know if I am at the right > place? Or point me in the right direction? As I understand, this deost not > look like an mvsas driver issue. > Looks like a latent bug in libsas to me. Commit 110dd8f1 "[SCSI] libsas: fix scr_read/write users and update the libata documentation" looks like a compile fix when the build was broken by commit 9977126c "libata: add @is_cmd to ata_tf_to_fis()" where libata changed the interface for ata_tf_to_fis(). We were passing 0 for pmp prior to that and changed to 1 here, probably a typo intending 'is_cmd to always be 1. Somehow we have gotten away with is_cmd being 0? Does the following patch work for you: diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 161c98efade9..d0fb99d5da95 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -211,7 +211,7 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc) qc->tf.nsect = 0; } - ata_tf_to_fis(&qc->tf, 1, 0, (u8*)&task->ata_task.fis); + ata_tf_to_fis(&qc->tf, qc->dev->link->pmp, 1, (u8*)&task->ata_task.fis); task->uldd_task = qc; if (ata_is_atapi(qc->tf.protocol)) { memcpy(task->ata_task.atapi_packet, qc->cdb, qc->dev->cdb_len); That being said I don't think anybody has really checked out port-multiplier support on libsas, but we shouldn't be setting this bit by default. -- Dan > On 10/14/2013 05:18 PM, Praveen Murali wrote: >> >> Hi, >> I have couple of external drives (Western Digital and Seagate) that have >> an eSATA interface. My Linux box with a Marvell HBA (9445) running Ubuntu >> 12.04 with 3.2.48 kernel doest not seem to detect the drive. I tried with >> the latest upstream kernel and it behaves the same. But both the drives >> detect fine if I enter the mvsas BIOS during bootup. So I have hooked up a >> SATA analyzer and this is what I found >> - When I tried to detect the drives in the mvsas BIOS, all the ATA >> commands that the bios issues have the port multiplier byte set to 0. >> - If I bootup my Linux system and then connect the drives, the first >> IDENTIFY command has the port multiplier set to 0 (this one is successful) >> and the subsequent IDENTIFY command has port multiplier set to 1 (this one >> fails). I assume the first IDENTIFY is coming from the BIOS, not Linux correct? >> - If I connect any other SATA drives I have to the HBA, all the ATA >> commands have port multiplier set to 1 but they detects and work fine. >> >> Just to rule out the port-multiplier possibility I changed the following >> line in drivers/scsi/libsas/sas_ata.c - fucntion sas_ata_qc_issue() >> >> ata_tf_to_fis(&qc->tf, 1, 0, (u8*)&task->ata_task.fis); >> >> to >> >> ata_tf_to_fis(&qc->tf, 0, 0, (u8*)&task->ata_task.fis); >> >> now all my drives seem to detect just fine. I believe, the eSATA interface >> on these external drives is a port multiplier, which is why the command >> fails. Also, the normal drives ignore this field thats why they work fine >> with port multiplier being set to either 0 or 1. >> >> Question(s): Are my above assumtions correct? If so, what is the reasoning >> behind setting the port multiplier to 1 by default in libsas layer? >> >> Thanks, >> Praveen >> >> >> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Fix device detection issues with mvsas driver
On Thu, Jun 21, 2012 at 10:05 AM, Praveen Murali wrote: > here is the updated patch. > > On Thu, 2012-06-21 at 09:48 -0700, Dan Williams wrote: >> On Thu, Jun 21, 2012 at 9:41 AM, Praveen Murali wrote: >> > Ok, do you want me to regenerate the patch? >> >> Yes, James will be the one to ultimately apply it and would save him >> time to have a new patch. The patch you attached had a shorter >> changelog I think it would be fine to include your entire 4 paragraph >> lead in message. > Hi Praveen, Your last message prompted me to look back in my archives and I found this patch which does not appear to be upstream yet. I assume it is still needed? -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: eSATA Drive Detection issues on mvsas
On Wed, Oct 16, 2013 at 10:28 AM, Praveen Murali wrote: >> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c >> index 161c98efade9..d0fb99d5da95 100644 >> --- a/drivers/scsi/libsas/sas_ata.c >> +++ b/drivers/scsi/libsas/sas_ata.c >> @@ -211,7 +211,7 @@ static unsigned int sas_ata_qc_issue(struct >> ata_queued_cmd *qc) >> qc->tf.nsect = 0; >> } >> >> - ata_tf_to_fis(&qc->tf, 1, 0, (u8*)&task->ata_task.fis); >> + ata_tf_to_fis(&qc->tf, qc->dev->link->pmp, 1, >> (u8*)&task->ata_task.fis); >> task->uldd_task = qc; >> if (ata_is_atapi(qc->tf.protocol)) { >> memcpy(task->ata_task.atapi_packet, qc->cdb, >> qc->dev->cdb_len); >> > Hi Dan, > I tested this patch and it works great! > Thanks! I'll send it up with your Reported-by and Tested-by. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libsas: fix usage of ata_tf_to_fis
Since commit 110dd8f19df5 "[SCSI] libsas: fix scr_read/write users and update the libata documentation" we have been passing pmp=1 and is_cmd=0 to ata_tf_to_fis(). Praveen reports that eSATA attached drives do not discover correctly. His investigation found that the BIOS was passing pmp=0 while Linux was passing pmp=1 and failing to discover the drives. Update libsas to follow the libata example of pulling the pmp setting from the ata_link and correct is_cmd to be 1 since all tf's submitted through ->qc_issue are commands. Presumably libsas lldds do not care about is_cmd as they have sideband mechanisms to perform link management. http://marc.info/?l=linux-scsi&m=138179681726990&w=2 Cc: xjtu...@gmail.com Cc: lindar_...@usish.com Cc: Lukasz Dorau Cc: Maciej Patelczyk Cc: Dave Jiang Cc: Xiangliang Yu Cc: sta...@vger.kernel.org Reported-by: Praveen Murali Tested-by: Praveen Murali Signed-off-by: Dan Williams --- drivers/scsi/libsas/sas_ata.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 161c98efade9..d0fb99d5da95 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -211,7 +211,7 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc) qc->tf.nsect = 0; } - ata_tf_to_fis(&qc->tf, 1, 0, (u8*)&task->ata_task.fis); + ata_tf_to_fis(&qc->tf, qc->dev->link->pmp, 1, (u8*)&task->ata_task.fis); task->uldd_task = qc; if (ata_is_atapi(qc->tf.protocol)) { memcpy(task->ata_task.atapi_packet, qc->cdb, qc->dev->cdb_len); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backporting 'libsas: fix timeout vs completion race' and related patches to 3.2.51 recommended?
On Fri, Oct 25, 2013 at 5:47 PM, Amit Uttamchandani wrote: > Hello, > > Tested kernel 3.4.62, which introduces: > > libsas error handling + discovery patch set > > http://lwn.net/Articles/476986/ > > Is it advisable for me to backport this to v3.2.51? > If you need reliable ata error handling via libsas then, yes. Kernels prior to 3.4 will encounter some bugs in this area. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] isci: fix reset timeout handling
Remove an erroneous BUG_ON() in the case of a hard reset timeout. If a SATA device is unable to restore the link in time we expect the port to go back to the "awaiting link-up" state. Since the timeout path notified libsas that the port is down, we want to notify libsas when it returns and expect that the port may not be in the "resetting" state. Also, once we have notified that the port is down, in-progress resets should abort immediately. Return -ENODEV from ->lldd_I_T_nexus_reset() to indicate that libata error handling should end. Cc: Reported-by: Xun Ni Tested-by: Xun Ni Signed-off-by: Dan Williams --- drivers/scsi/isci/port_config.c |5 - drivers/scsi/isci/task.c|2 +- 2 files changed, 1 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/isci/port_config.c b/drivers/scsi/isci/port_config.c index 85c77f6b802b..4b84d204db01 100644 --- a/drivers/scsi/isci/port_config.c +++ b/drivers/scsi/isci/port_config.c @@ -617,11 +617,6 @@ static void sci_apc_agent_link_up(struct isci_host *ihost, /* the phy is already the part of the port */ u32 port_state = iport->sm.current_state_id; - /* if the PORT'S state is resetting then the link up is from -* port hard reset in this case, we need to tell the port -* that link up is recieved -*/ - BUG_ON(port_state != SCI_PORT_RESETTING); port_agent->phy_ready_mask |= 1 << phy_index; sci_port_link_up(iport, iphy); } diff --git a/drivers/scsi/isci/task.c b/drivers/scsi/isci/task.c index 0d30ca849e8f..5d6fda72d659 100644 --- a/drivers/scsi/isci/task.c +++ b/drivers/scsi/isci/task.c @@ -801,7 +801,7 @@ int isci_task_I_T_nexus_reset(struct domain_device *dev) /* XXX: need to cleanup any ireqs targeting this * domain_device */ - ret = TMF_RESP_FUNC_COMPLETE; + ret = -ENODEV; goto out; } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: spinlock_irqsave() && flags (Was: pm80xx: Spinlock fix)
On Thu, Jan 2, 2014 at 2:31 AM, Suresh Thiagarajan wrote: > > > On Fri, Dec 27, 2013 at 9:48 PM, Oleg Nesterov wrote: >> On 12/24, Suresh Thiagarajan wrote: >>> >>> Below is a small pseudo code on protecting/serializing the flag for global >>> access. >>> struct temp >>> { >>> ... >>> spinlock_t lock; >>> unsigned long lock_flags; >>> }; >>> void my_lock(struct temp *t) >>> { >>>unsigned long flag; // thread-private variable as suggested >>>spin_lock_irqsave(&t->lock, flag); >>>t->lock_flags = flag; //updating inside critical section now >>> to serialize the access to flag >>> } >>> >>> void my_unlock(struct temp *t) >>> { >>>unsigned long flag = t->lock_flags; >>>t->lock_flags = 0; //clearing it before getting out of >>> critical section >>>spin_unlock_irqrestore(&t->lock, flag); >>> } >> >> Yes, this should work as a quick fix. And you do not need to clear >> ->lock_flags >> in my_unlock(). >> >> But when I look at original patch again, I no longer understand why do >> you need pm8001_ha->lock_flags at all. Of course I do not understand this >> code, I am sure I missed something, but at first glance it seems that only >> this sequence >> >> spin_unlock_irq(&pm8001_ha->lock); >> t->task_done(t); >> spin_lock_irq(&pm8001_ha->lock); >> >> should be fixed? >> >> If yes, why you can't simply do spin_unlock() + spin_lock() around >> t->task_done() ? This won't enable irqs, but spin_unlock_irqrestore() >> doesn't necessarily enables irqs too, so ->task_done() can run with >> irqs disabled? >> >> And note that the pattern above has a lot of users, perhaps it makes >> sense to start with something like the patch below? > > Thanks James, Oleg and all for your inputs. > Will start with review and testing this patch and then work/investigate to > keep shortest and clearest critical > section so that we can have the lock and unlock within the same routine. > Fwiw we solved this in libsas a while back with a similar pattern proposed by Oleg: unsigned long flags; local_irq_save(flags); spin_unlock(lock); ... spin_lock_lock(lock); local_irq_restore(flags); See commit 312d3e56119a "[SCSI] libsas: remove ata_port.lock management duties from lldds" -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] isci: fix reset timeout handling
Remove an erroneous BUG_ON() in the case of a hard reset timeout. The reset timeout handler puts the port into the "awaiting link-up" state. The timeout causes the device to be disconnected and we need to be in the awaiting link-up state to re-connect the port. The BUG_ON() made the incorrect assumption that resets never timeout and we always complete the reset in the "resetting" state. Testing this patch also uncovered that libata continues to attempt to reset the port long after the driver has torn down the context. Once the driver has committed to abandoning the link it must indicate to libata that recovery ends by returning -ENODEV from ->lldd_I_T_nexus_reset(). Cc: Cc: Maciej Patelczyk Cc: Dave Jiang Acked-by: Lukasz Dorau Reported-by: David Milburn Reported-by: Xun Ni Tested-by: Xun Ni Signed-off-by: Dan Williams --- v2: Dave reported the build warning regression the last patch caused, and I clarified the changelog a bit. drivers/scsi/isci/port_config.c |7 --- drivers/scsi/isci/task.c|2 +- 2 files changed, 1 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/isci/port_config.c b/drivers/scsi/isci/port_config.c index 85c77f6b802b..ac879745ef80 100644 --- a/drivers/scsi/isci/port_config.c +++ b/drivers/scsi/isci/port_config.c @@ -615,13 +615,6 @@ static void sci_apc_agent_link_up(struct isci_host *ihost, SCIC_SDS_APC_WAIT_LINK_UP_NOTIFICATION); } else { /* the phy is already the part of the port */ - u32 port_state = iport->sm.current_state_id; - - /* if the PORT'S state is resetting then the link up is from -* port hard reset in this case, we need to tell the port -* that link up is recieved -*/ - BUG_ON(port_state != SCI_PORT_RESETTING); port_agent->phy_ready_mask |= 1 << phy_index; sci_port_link_up(iport, iphy); } diff --git a/drivers/scsi/isci/task.c b/drivers/scsi/isci/task.c index 0d30ca849e8f..5d6fda72d659 100644 --- a/drivers/scsi/isci/task.c +++ b/drivers/scsi/isci/task.c @@ -801,7 +801,7 @@ int isci_task_I_T_nexus_reset(struct domain_device *dev) /* XXX: need to cleanup any ireqs targeting this * domain_device */ - ret = TMF_RESP_FUNC_COMPLETE; + ret = -ENODEV; goto out; } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] libata: resume in the background
On Mon, Dec 16, 2013 at 3:30 PM, Phillip Susi wrote: > Don't block the resume path waiting for the disk to > spin up. > --- > drivers/ata/libata-core.c | 10 -- > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index 8f856bb..4a28caf 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -5430,20 +5430,18 @@ static int __ata_port_resume_common(struct ata_port > *ap, pm_message_t mesg, > static int ata_port_resume_common(struct device *dev, pm_message_t mesg) > { > struct ata_port *ap = to_ata_port(dev); > + static int dontcare; > > - return __ata_port_resume_common(ap, mesg, NULL); > + return __ata_port_resume_common(ap, mesg, &dontcare); > } I was going to comment that this leaves us open to a race to submit new i/o before recovery starts, but scsi_schedule_eh already blocks new i/o, so I think we're good. I think it deserves a comment here about why it's safe to not care. > > static int ata_port_resume(struct device *dev) > { > int rc; > > + if (pm_runtime_suspended(dev)) > + return 0; Why? If we dpm_resume() a port that was rpm_suspend()ed the state needs to be reset to active. I think this check also forces the port to resume twice, once for system resume and again for the eventual runtime_resume. > rc = ata_port_resume_common(dev, PMSG_RESUME); > - if (!rc) { > - pm_runtime_disable(dev); > - pm_runtime_set_active(dev); > - pm_runtime_enable(dev); > - } > > return rc; > } ...so, the pm_runtime_suspended() check should go and something like this folded in: diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index 92d7797223be..a6cc107c9434 100644 --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -4143,5 +4143,11 @@ static void ata_eh_handle_port_resume(struct ata_port *ap) ap->pm_result = NULL; } spin_unlock_irqrestore(ap->lock, flags); + + if (rc == 0) { + pm_runtime_disable(dev); + pm_runtime_set_active(dev); + pm_runtime_enable(dev); + } } #endif /* CONFIG_PM */ -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] libata: resume in the background
On Fri, Jan 10, 2014 at 6:25 PM, Phillip Susi wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > > On 01/10/2014 05:26 PM, Dan Williams wrote: >> I was going to comment that this leaves us open to a race to >> submit new i/o before recovery starts, but scsi_schedule_eh already >> blocks new i/o, so I think we're good. I think it deserves a >> comment here about why it's safe to not care. > > I don't follow, could you explain? The question I have when reading "__ata_port_resume_common(ap, mesg, &dontcare);" is what makes it safe for the port to not actually be resumed upon return. The answer I believe is that the host is guaranteed to be in the SHOST_RECOVERY state upon return and no new i/o submissions will occur until the error handler has a chance to run. >>> static int ata_port_resume(struct device *dev) { int rc; >>> >>> + if (pm_runtime_suspended(dev)) + return 0; >> >> Why? If we dpm_resume() a port that was rpm_suspend()ed the state >> needs to be reset to active. I think this check also forces the >> port to resume twice, once for system resume and again for the >> eventual runtime_resume. > > It stops the first resume by returning early, so only the second one > happens. Got it, but it looks odd. > >> ...so, the pm_runtime_suspended() check should go and something >> like this folded in: >> >> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c >> index 92d7797223be..a6cc107c9434 100644 --- >> a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -4143,5 >> +4143,11 @@ static void ata_eh_handle_port_resume(struct ata_port >> *ap) ap->pm_result = NULL; } spin_unlock_irqrestore(ap->lock, >> flags); + + if (rc == 0) { + >> pm_runtime_disable(dev); + >> pm_runtime_set_active(dev); + >> pm_runtime_enable(dev); + } } #endif /* CONFIG_PM */ > > Ahh, and that will stop the second resume, rather than the previous > change to stop the first? > > Don't we want to stop the first rather than the second? Otherwise if > the port is runtime suspended at system suspend time ( maybe no drive > connected to it? ) then there is no sense in resuming it at system > resume time. Hmm, if the drive disconnected during suspend I think we want to find that out sooner rather than later. The changelog should really read "Don't block the resume path waiting for the disk re-establish its link". Given it's async do we gain that much by deferring a one time check of the ports at resume? At the very least I think tying rpm_resume to dpm_resume like that is an additional change to defer to a separate patch. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RESEND v2 1/2] Hard disk S3 resume time optimization
On Fri, Jan 10, 2014 at 6:13 PM, Phillip Susi wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > > On 01/10/2014 06:11 PM, Brandt, Todd E wrote: >> Yes yours is simpler, but it also opens a potential memory issue >> by passing a static int as the return location for the error value. >> I think it's just safer to tell the callback to attempt no return >> value at all, and for that you need to expand it into two >> arguments, one for selection, the other for the output address. > > What sort of memory issue? Also isn't there a system NULL page > somewhere that could be used? > I think the static variable is ok. We can be sure that all eh threads are torn down before libata.ko is unloaded. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RESEND v2 1/2] Hard disk S3 resume time optimization
On Mon, Jan 13, 2014 at 12:06 PM, Todd E Brandt wrote: > On Fri, Jan 10, 2014 at 07:13:11PM -0800, Dan Williams wrote: >> On Fri, Jan 10, 2014 at 6:13 PM, Phillip Susi wrote: >> > -BEGIN PGP SIGNED MESSAGE- >> > Hash: SHA512 >> > >> > On 01/10/2014 06:11 PM, Brandt, Todd E wrote: >> >> Yes yours is simpler, but it also opens a potential memory issue >> >> by passing a static int as the return location for the error value. >> >> I think it's just safer to tell the callback to attempt no return >> >> value at all, and for that you need to expand it into two >> >> arguments, one for selection, the other for the output address. >> > >> > What sort of memory issue? Also isn't there a system NULL page >> > somewhere that could be used? >> > >> >> I think the static variable is ok. We can be sure that all eh threads >> are torn down before libata.ko is unloaded. > > Actually there's one other reason. In the ata_port_request_pm function it > checks to see if there's a previous resume operation pending, and if so > it calls ata_port_wait_eh in order to wait for it to complete before > issuing the new suspend. If you just use the (int*)async parameter it > will return immediately and defer to the caller to try again, like is does > with SAS. But in our case we *don't* try again, so it would result in the > resume being skipped. There needs to be a new case where the caller wants > the call to be asynchronous, and it wants ata_port_request_pm to do its > own waiting, but doesn't care about the return value. Thus the additional > parameter. I think that is specifically for the libata case of a suspend request arriving while an async resume is still in flight. Given libata suspends are synchronous I do not think we have the reverse problem of resume requests arriving while a suspend is in flight. However, it might be worth a WARN_ON_ONCE() to document that assumption. In the libsas case suspends are asynchronous, but they are flushed by libsas before any resumes are processed, so there should not be conflicts. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RESEND v2 1/2] Hard disk S3 resume time optimization
On Mon, Jan 13, 2014 at 3:51 PM, Todd E Brandt wrote: > On Mon, Jan 13, 2014 at 12:37:01PM -0800, Dan Williams wrote: >> On Mon, Jan 13, 2014 at 12:06 PM, Todd E Brandt >> > Actually there's one other reason. In the ata_port_request_pm function it >> > checks to see if there's a previous resume operation pending, and if so >> > it calls ata_port_wait_eh in order to wait for it to complete before >> > issuing the new suspend. If you just use the (int*)async parameter it >> > will return immediately and defer to the caller to try again, like is does >> > with SAS. But in our case we *don't* try again, so it would result in the >> > resume being skipped. There needs to be a new case where the caller wants >> > the call to be asynchronous, and it wants ata_port_request_pm to do its >> > own waiting, but doesn't care about the return value. Thus the additional >> > parameter. >> >> I think that is specifically for the libata case of a suspend request >> arriving while an async resume is still in flight. Given libata > > Accoring to the comments it's for a previous resume, not a previous suspend. > > /* Previous resume operation might still be in > * progress. Wait for PM_PENDING to clear. > */ > if (ap->pflags & ATA_PFLAG_PM_PENDING) { > if (async) { > *async = -EAGAIN; > return 0; > } > ata_port_wait_eh(ap); > WARN_ON(ap->pflags & ATA_PFLAG_PM_PENDING); > } Right, but that should only happen when ata_port_request_pm() is called for 'suspend' and 'resume' is in flight. I think it's a core device-power-management bug if ->resume() is called twice without an intervening ->suspend(). > I'm going to assume that it was added for a reason, so I've structured my > patch in such a way that it doesn't alter the existing logic. Removing that > particular check would be a completely different discussion and is out of > the scope of this patch. You're right that logic should stay, I don't think removing it was ever proposed? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk spin-up optimization during system resume
On Fri, Jan 17, 2014 at 6:57 AM, Alan Stern wrote: > On Thu, 16 Jan 2014, Todd E Brandt wrote: >> My patch makes ata port and scsi disk resume 'non-blocking' (asynchronous >> with respect to system resume). Which means once they're called the power >> manager pays no more attention to them and will complete system resume >> whenever. Both the power manager and the ata subsystems call these >> features 'asynchronous' so it can be confusing. > > I see. That's not explained very well in the patch description -- it > talks about going on to resume the next device on the list, but not > about waiting for the overall system resume to finish. > Once dpm_resume of the disk is asynchronous, is there much incremental gain by further deferring spin up? The drawback of doing on-demand resume of the disk is that you incur the full resume latency right when you need the data. System resume is a strong hint to warm the disks back up, and they will go back to sleep if unused. Of course it reduces the power savings if dpm_suspend/resume is a frequent occurrence. However, are systems that make aggressive use of system suspend shipping with rotating media, or do they ship with disks that are cheap to resume? If suspend is not frequent I wonder if it is worth the trouble/latency cost to keep the disk(s) asleep. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk spin-up optimization during system resume
On Fri, Jan 17, 2014 at 12:49 PM, Phillip Susi wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 1/17/2014 2:31 PM, Dan Williams wrote: >> Once dpm_resume of the disk is asynchronous, is there much >> incremental gain by further deferring spin up? The drawback of >> doing on-demand resume of the disk is that you incur the full >> resume latency right when you need the data. System resume is a >> strong hint to warm the disks back up, and they will go back to >> sleep if unused. > > It isn't a strong hint to warm the disks back up at all. Well, "at all" is overstating it. The system was idle and now it's being woken up to do some work. Inactivity timers are invalidated and now the decision becomes minimize access latency, or lazily wake up. > You *may* > access *a* disk soon after resume, then again, you may not. Either > way, it isn't good to spin up *all* disks after a resume only to have > them stop again after a few minutes when they aren't accessed. Hiding resume latency is a good reason to resume the disk as soon as possible. >It puts needless wear and tear on disks. I'm not sure spin up cycles are predictive of drive failure? >> Of course it reduces the power savings if dpm_suspend/resume is a >> frequent occurrence. However, are systems that make aggressive use >> of system suspend shipping with rotating media, or do they ship >> with disks that are cheap to resume? If suspend is not frequent I >> wonder if it is worth the trouble/latency cost to keep the disk(s) >> asleep. > > The vast majority of systems still use HDD; SSD isn't *that* popular. I'm not arguing about the attach rate of HDDs in shipping systems, I'm asking what is the system suspend / resume frequency of such systems and is it worth carrying that complexity in the kernel and injecting unnecessary resume latency for what amounts to a niche use case. > Also there is the power savings of leaving the disk in standby when > there is no need to "spin" it up, which while fast on an SSD, does > still increase power consumption and so should be avoided. > > In any case, this will have no affect for most people since ATA disks > spin up on their own by default. So the use case is even more niche... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/2] asynchronous ata port resume
On Wed, Jan 15, 2014 at 8:38 PM, Todd E Brandt wrote: > On resume, the ATA port driver currently waits until the AHCI controller > finishes executing the port wakeup command. This patch changes the > ata_port_resume callback to issue the wakeup and then return immediately, > thus allowing the next device in the pm queue to resume. Any commands > issued to the AHCI hardware during the wakeup will be queued up and > executed once the port is physically online. Thus no information is lost. > > This patch applies to all ata resume callbacks: resume, restore, and thaw for > code simplicity. The primary benefit is to S3 resume. > > The return value from ata_resume_async is now an indicator of whether > the resume was initiated, rather than if it was completed. I'm letting the ata > driver assume control over its own error reporting in this case (which it does > already). If you look at the ata_port resume code you'll see that the > ata_port_resume callback returns the status of the ahci_port_resume callback, > which is always 0. So I don't see any harm in ignoring it. > > If somebody requests suspend while ATA port resume is still running, the > request is queued until the resume has completed. I've tested > that case very heavily. Basically if you do two suspends in a row (within > seconds of each other) you lose any performance benefit, but that's a use > case that should happen only very rarerly (and wouldn't be expected to > be of any power benefit since too many sequential suspends actually > takes more power than just letting the machine stay in run mode). > > v4: Jan 15, 2014 > [in response to Tejun Huo] > - completely removed the pm_result member of the ata_port struct > - result from ata_port_suspend/resume is now returned from the function >instead of into a pointer location > - ata_port_request_pm now only returns 0/-EAGAIN for pass/fail > > Signed-off-by: Todd Brandt > Signed-off-by: Arjan van de Ven You may want to split out the pure "do resume async" changes from the return code and api cleanup. Doubtful that anything depends on the return code but at least gives a bisection point for each conceptual change. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk spin-up optimization during system resume
On Fri, Jan 17, 2014 at 5:41 PM, Alan Stern wrote: > On Fri, 17 Jan 2014, Tejun Heo wrote: > >> On Fri, Jan 17, 2014 at 03:15:54PM -0500, Alan Stern wrote: >> > You will have to argue this point with Phillip. >> > >> > If necessary, we could add a sysfs attribute to force a spin-up during >> > system resume. Or you could disable runtime PM for the disk, but that >> > has its own disadvantages. >> >> Isn't immediate spin-up trivial to implement from anywhere? I'm not >> sure whether we'll end up defaulting to the lazy behavior or not but >> if we do requiring userland to echo something to sysfs to configure >> immediate spin-up feels a bit silly when userland might as well just >> issue a dummy command to force spinup. > > Then turn it around. Make the new sysfs attribute enable a lazy > resume. > >> And, yes, I agree with Dan that avoiding spinup of harddrives on >> system resume seems a bit niche in its usefulness. suspend/resume >> cycle at the very least generates logs which most likely will be >> committed to the drive sooner or later. >> >> What kind of use cases are we expecting for the lazy behavior? > > The intention is that this will help on systems with more than one > disk drive. The one containing the core OS files and the journal will > certainly spin up right away, but the others may not. > > To tell the truth, I'm not sure how useful this feature would be to > most people. But there probably are a few, like Phillip, who will be > glad to have it. That seems like sufficient justification for adding > it. > Hmm, given that an ATA "start" command is just a "read verify" I wonder if this can be done more simply than tying runtime and system resume together. Could the lazy resume knob simply cause sd_resume to be skipped? The first real command that comes down will achieve the same effect. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk spin-up optimization during system resume
On Fri, Jan 17, 2014 at 7:18 PM, Phillip Susi wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > > On 01/17/2014 05:17 PM, Dan Williams wrote: >> Well, "at all" is overstating it. The system was idle and now >> it's being woken up to do some work. Inactivity timers are >> invalidated and now the decision becomes minimize access latency, >> or lazily wake up. > > Sure, but that work may or may not involve disk access. Think of a > laptop you are using to watch a video on youtube. You need to get up > for a minute so you suspend, come back, and continue watching... no > need for disk IO just because you paused for a few minutes. For a > single disk system you may argue that it is likely the disk will be > needed, but for multiple disk systems, that probability goes down. > >>> It puts needless wear and tear on disks. >> >> I'm not sure spin up cycles are predictive of drive failure? > > Drive makers rate them to handle only so many start/stop cycles before > they fail, which is why SMART keeps track of the count. Any time you > move mechanical parts, they wear out. They also recommend maximum intervals of spin down time. Mind you its on the order of weeks, but spinning up the drive help keeps the bearing lubricant evenly distributed. What I was getting at is do these patches have a measurable impact on drive lifetime? For example how many spin ups is this saving vs deferring in your setup. I buy the power saving impact, but wearing out the drive prematurely is harder to accept without data. >> I'm not arguing about the attach rate of HDDs in shipping systems, >> I'm asking what is the system suspend / resume frequency of such >> systems and is it worth carrying that complexity in the kernel and >> injecting unnecessary resume latency for what amounts to a niche >> use case. > > It doesn't add any latency for an SSD, or even an ATA HDD that does > not have PuiS enabled. SCSI disks are more likely to be in larger > systems where it is more likely that at least some of the drives won't > be accessed soon, and if you really want to avoid the latency, then > you can configure your user space pm scripts to disable runtime pm > prior to suspend to get the old behavior back. > Larger systems are less likely to ever sleep. I think it's a bit of an interface surprise to have pm_runtime disable have side effects only for scsi_disk devices. I think lazy_resume needs to be an explicit attribute of the disk. For ata devices any command wakes up the drive. Perhaps we could simply do the same for scsi devices. In lazy_resume mode flag the device as "needs spinup" at sd_resume time and schedule it when a command arrives. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] isci: correct erroneous for_each_isci_host macro
On Mon, Jan 20, 2014 at 8:54 AM, Lukasz Dorau wrote: > In the first place, the loop 'for' in the macro 'for_each_isci_host' > (drivers/scsi/isci/host.h:314) is incorrect, because it accesses > the 3rd element of 2 element array. After the 2nd iteration it executes > the instruction: > ihost = to_pci_info(pdev)->hosts[2] > (while the size of the 'hosts' array equals 2) and reads an > out of range element. > > In the second place, because of the following GCC v4.8 bug: > http://marc.info/?l=linux-kernel&m=138998871911336&w=2 > this loop is incorrectly compiled by GCC v4.8. > As a result, on platforms with two SCU controllers, > the loop at drivers/scsi/isci/init.c:717 is executed > more times than it can be (for i=0,1 and 2). > It causes the following oops after 'rmmod isci': > Surprised those bugs lasted this long... some comments below. > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: [] __list_add+0x1b/0xc0 > Oops: [#1] SMP > RIP: 0010:[] [] __list_add+0x1b/0xc0 > Call Trace: > [] __mutex_lock_slowpath+0x114/0x1b0 > [] mutex_lock+0x1f/0x30 > [] sas_disable_events+0x1b/0x50 [libsas] > [] sas_unregister_ha+0x18/0x60 [libsas] > [] isci_unregister+0x1e/0x40 [isci] > [] isci_pci_remove+0x5d/0x100 [isci] > [] pci_device_remove+0x3b/0xb0 > [] __device_release_driver+0x7f/0xf0 > [] driver_detach+0xa8/0xb0 > [] bus_remove_driver+0x9b/0x120 > [] driver_unregister+0x2c/0x50 > [] pci_unregister_driver+0x23/0x80 > [] isci_exit+0x10/0x1e [isci] > [] SyS_delete_module+0x16b/0x2d0 > [] ? do_notify_resume+0x61/0xa0 > [] system_call_fastpath+0x16/0x1b > > The loop has been corrected. > This patch fixes the above oops. > > Signed-off-by: Lukasz Dorau > Reviewed-by: Maciej Patelczyk > Tested-by: Lukasz Dorau > Cc: Dan Williams > Cc: Dave Jiang > Cc: > --- > drivers/scsi/isci/host.h |6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/scsi/isci/host.h b/drivers/scsi/isci/host.h > index 4911310..95adf31 100644 > --- a/drivers/scsi/isci/host.h > +++ b/drivers/scsi/isci/host.h > @@ -311,9 +311,9 @@ static inline struct Scsi_Host *to_shost(struct isci_host > *ihost) > } > > #define for_each_isci_host(id, ihost, pdev) \ > - for (id = 0, ihost = to_pci_info(pdev)->hosts[id]; \ > -id < ARRAY_SIZE(to_pci_info(pdev)->hosts) && ihost; \ > -ihost = to_pci_info(pdev)->hosts[++id]) > + for (id = 0, ihost = to_pci_info(pdev)->hosts[0]; \ No need to initialize ihost here since we assign and check it later. > +(id < ARRAY_SIZE(to_pci_info(pdev)->hosts)) && \ Just use SCI_MAX_CONTROLLERS here. > +(ihost = to_pci_info(pdev)->hosts[id]); id++) -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk spin-up optimization during system resume
On Sat, Jan 18, 2014 at 6:08 AM, Alan Stern wrote: > On Fri, 17 Jan 2014, Dan Williams wrote: > >> I think it's a bit of an interface surprise to have pm_runtime disable >> have side effects only for scsi_disk devices. I think lazy_resume >> needs to be an explicit attribute of the disk. For ata devices any >> command wakes up the drive. Perhaps we could simply do the same for >> scsi devices. In lazy_resume mode flag the device as "needs spinup" >> at sd_resume time and schedule it when a command arrives. > > What you have just described is essentially what runtime PM does. Deliberately so... > There's no need to implement it any other way. And this explains why > disabling runtime PM would also disable lazy resume. The need is to avoid tying rpm and dpm ops together in surprising new ways that require userspace to change if it wants to keep the default behavior of hiding resume latency as much as possible. A lazy_resume knob vs changing the default let's the few setups that care set the power-up in standby mode. It should also "perform" better given it does not require the disk to be runtime suspended prior to system suspend, it will unconditionally resume on demand. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk spin-up optimization during system resume
On Tue, Jan 21, 2014 at 7:44 AM, Alan Stern wrote: > On Tue, 21 Jan 2014, Dan Williams wrote: > >> On Sat, Jan 18, 2014 at 6:08 AM, Alan Stern >> wrote: >> > On Fri, 17 Jan 2014, Dan Williams wrote: >> > >> >> I think it's a bit of an interface surprise to have pm_runtime disable >> >> have side effects only for scsi_disk devices. I think lazy_resume >> >> needs to be an explicit attribute of the disk. For ata devices any >> >> command wakes up the drive. Perhaps we could simply do the same for >> >> scsi devices. In lazy_resume mode flag the device as "needs spinup" >> >> at sd_resume time and schedule it when a command arrives. >> > >> > What you have just described is essentially what runtime PM does. >> >> Deliberately so... >> >> > There's no need to implement it any other way. And this explains why >> > disabling runtime PM would also disable lazy resume. >> >> The need is to avoid tying rpm and dpm ops together in surprising new >> ways that require userspace to change if it wants to keep the default >> behavior of hiding resume latency as much as possible. A lazy_resume >> knob vs changing the default let's the few setups that care set the >> power-up in standby mode. > > That's why I proposed it. What are you getting at? ? Yes, earlier in the thread you propose a lazy_resume knob and then later say "there's no need to implement it any other way" referring to runtime PM. I took that to mean there's no need to consider any other solution besides disabling runtime PM. >> It should also "perform" better given it >> does not require the disk to be runtime suspended prior to system >> suspend, it will unconditionally resume on demand. > > None of my posts in this email thread have required the disk to be in > runtime suspend prior to the system suspend. Apologies, I was recalling one of the earlier patches. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk spin-up optimization during system resume
On Tue, Jan 21, 2014 at 8:18 AM, Dan Williams wrote: > On Tue, Jan 21, 2014 at 7:44 AM, Alan Stern wrote: >> On Tue, 21 Jan 2014, Dan Williams wrote: >> >>> On Sat, Jan 18, 2014 at 6:08 AM, Alan Stern >>> wrote: >>> > On Fri, 17 Jan 2014, Dan Williams wrote: >>> > >>> >> I think it's a bit of an interface surprise to have pm_runtime disable >>> >> have side effects only for scsi_disk devices. I think lazy_resume >>> >> needs to be an explicit attribute of the disk. For ata devices any >>> >> command wakes up the drive. Perhaps we could simply do the same for >>> >> scsi devices. In lazy_resume mode flag the device as "needs spinup" >>> >> at sd_resume time and schedule it when a command arrives. >>> > >>> > What you have just described is essentially what runtime PM does. >>> >>> Deliberately so... >>> >>> > There's no need to implement it any other way. And this explains why >>> > disabling runtime PM would also disable lazy resume. >>> >>> The need is to avoid tying rpm and dpm ops together in surprising new >>> ways that require userspace to change if it wants to keep the default >>> behavior of hiding resume latency as much as possible. A lazy_resume >>> knob vs changing the default let's the few setups that care set the >>> power-up in standby mode. >> >> That's why I proposed it. What are you getting at? > > ? > > Yes, earlier in the thread you propose a lazy_resume knob and then > later say "there's no need to implement it any other way" referring to > runtime PM. I took that to mean there's no need to consider any other > solution besides disabling runtime PM. The source of my contention was that userspace should not need to take new action to restore historical behavior. After taking another look at the thread I see you were likely agreeing with that, but just proposing that when lazy_resume is enabled the mechanism to maintain disk spin-down is runtime PM. That makes sense. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[LSF/MM TOPIC][ATTEND] Plumbing I/O Cache / Tiered Storage Hints
In addition to disk drives with internally tiered-storage (solid-state + magnetic media) the kernel has also grown native I/O-caching / tiered-storage implementations in bcache and dm-cache. Currently, all these solutions depend on heuristics to determine what tier the data referenced in an I/O belongs. However, the presence of hinting proposals from SCSI [1], ATA [2], and bcache [3] indicate that these devices (hardware or virtual) want to consume explicit hints indicating the value of caching data in a higher performing tier. At LSF I want to discuss options and opportunities for plumbing cache hints from userspace, through the I/O stack to devices. My colleague, Jason Akers, is also interested in this discussion as he is investigating how to exploit these hints from userspace. I participated in the LSF 2012 discussion of this topic, see that it was raised again at LSF 2013, and note that we have not settled on an enabling path. What's new for this year is an effort to set aside, for now, the deeper complexities of the device specification proposals and focus on a minimal set of hints that can be specified per-process (ionice) and maybe per-file (fadvise). My suspicion is that AIO attributes is useful for applications that want access to the full range of access hints exposed by a device. However, for the general buffered-I/O / tiered-storage case, a small set of hints achieves the bulk of the value. I am also interested in: Volatile ranges SMR Enabling Integrity passthrough -- Dan [1] T10 LBA Access Hints [2] T13 Hybrid Information Feature [3] AIO Attributes: http://marc.info/?l=linux-aio&m=136580574523674&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] Plumbing I/O Cache / Tiered Storage Hints
On Tue, Jan 28, 2014 at 2:57 PM, Dan Williams wrote: > In addition to disk drives with internally tiered-storage (solid-state > + magnetic media) the kernel has also grown native I/O-caching / > tiered-storage implementations in bcache and dm-cache. > > Currently, all these solutions depend on heuristics to determine what > tier the data referenced in an I/O belongs. However, the presence of > hinting proposals from SCSI [1], ATA [2], and bcache [3] indicate that > these devices (hardware or virtual) want to consume explicit hints > indicating the value of caching data in a higher performing tier. > > At LSF I want to discuss options and opportunities for plumbing cache > hints from userspace, through the I/O stack to devices. My colleague, > Jason Akers, is also interested in this discussion as he is > investigating how to exploit these hints from userspace. I > participated in the LSF 2012 discussion of this topic, see that it was > raised again at LSF 2013, and note that we have not settled on an > enabling path. What's new for this year is an effort to set aside, > for now, the deeper complexities of the device specification proposals > and focus on a minimal set of hints that can be specified per-process > (ionice) and maybe per-file (fadvise). > > My suspicion is that AIO attributes is useful for applications that > want access to the full range of access hints exposed by a device. > However, for the general buffered-I/O / tiered-storage case, a small > set of hints achieves the bulk of the value. > > I am also interested in: > Volatile ranges > SMR Enabling > Integrity passthrough > > -- > Dan > > [1] T10 LBA Access Hints > [2] T13 Hybrid Information Feature Correction, the hinting scheme is defined by SATA-IO in SATA 3.2 > [3] AIO Attributes: http://marc.info/?l=linux-aio&m=136580574523674&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi/isci/port_config: Fix a infinite loop.
On Tue, Jul 16, 2013 at 7:54 PM, Xinghai Yu wrote: > It seems the "phy_index++;" have been placed in wrong place, without it > the while circle up will do a infinite loop. > > Signed-off-by: Xinghai Yu > --- > drivers/scsi/isci/port_config.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/drivers/scsi/isci/port_config.c b/drivers/scsi/isci/port_config.c > index cd962da..85c77f6 100644 > --- a/drivers/scsi/isci/port_config.c > +++ b/drivers/scsi/isci/port_config.c > @@ -311,9 +311,9 @@ sci_mpc_agent_validate_phy_configuration(struct isci_host > *ihost, > &ihost->phys[phy_index]); > > assigned_phy_mask |= (1 << phy_index); > + phy_index++; > } > > - phy_index++; > } > > return sci_port_configuration_agent_validate_ports(ihost, port_agent); Hmm, I'm not aware of anyone using mpc mode. Maybe we should just remove it? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] isci: fix needless ata reset escalations
isci is needlessly tying libata's hands by returning SAM_STAT_CHECK_CONDITION to some ata errors. Instead, prefer SAS_PROTO_RESPONSE to let libata (via sas_ata_task_done()) disposition the device-to-host fis. For example isci is triggering an HSM Violation where AHCI is showing a simple media error for the same bus condition: isci: ata7.00: failed command: READ VERIFY SECTOR(S) ata7.00: cmd 40/00:01:00:00:00/00:00:00:00:00/e0 tag 0 res 01/04:00:00:00:00/00:00:00:00:00/e0 Emask 0x3 (HSM violation) ahci: ata6.00: failed command: READ VERIFY SECTOR(S) ata6.00: cmd 40/00:01:00:00:00/00:00:00:00:00/e0 tag 0 res 51/40:01:00:00:00/00:00:00:00:00/e0 Emask 0x9 (media error) Note that the isci response matches this from sas_ata_task_done(): /* We saw a SAS error. Send a vague error. */ [..] dev->sata_dev.fis[3] = 0x04; /* status err */ dev->sata_dev.fis[2] = ATA_ERR; The end effect is that isci is needlessly triggering hard resets when they are not necessary. Cc: Reported-by: Xun Ni Tested-by: Nelson Cheng Acked-by: Lukasz Dorau Signed-off-by: Dan Williams --- drivers/scsi/isci/request.c |8 ++-- 1 files changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/isci/request.c b/drivers/scsi/isci/request.c index 99d2930b18c8..56e38096f0c4 100644 --- a/drivers/scsi/isci/request.c +++ b/drivers/scsi/isci/request.c @@ -2723,13 +2723,9 @@ static void isci_process_stp_response(struct sas_task *task, struct dev_to_host_ memcpy(resp->ending_fis, fis, sizeof(*fis)); ts->buf_valid_size = sizeof(*resp); - /* If the device fault bit is set in the status register, then -* set the sense data and return. -*/ - if (fis->status & ATA_DF) + /* If an error is flagged let libata decode the fis */ + if (ac_err_mask(fis->status)) ts->stat = SAS_PROTO_RESPONSE; - else if (fis->status & ATA_ERR) - ts->stat = SAM_STAT_CHECK_CONDITION; else ts->stat = SAM_STAT_GOOD; -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] isci, libsas fixes for 3.4-rc2
Hi James, Here are some collected fixes. All but patch 2 are tagged for -stable. Patch 1 and 4 have been on the list since before the 3.14 merge window, patch 2 and 3 are new. Please apply, thank you. [PATCH 1/4] isci: fix reset timeout handling [PATCH 2/4] scsi, libsas: introduce scmd_dbg() to quiet false positive "timeout" messages [PATCH 3/4] isci: fix needless ata reset escalations [PATCH 4/4] isci: correct erroneous for_each_isci_host macro drivers/scsi/isci/host.h|5 ++--- drivers/scsi/isci/port_config.c |7 --- drivers/scsi/isci/request.c |8 ++-- drivers/scsi/isci/task.c|2 +- drivers/scsi/libsas/sas_scsi_host.c |2 +- include/scsi/scsi_device.h | 12 6 files changed, 18 insertions(+), 18 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] scsi, libsas: introduce scmd_dbg() to quiet false positive "timeout" messages
libsas sometimes short circuits timeouts to force commands into error recovery. It is misleading to log that the command timed-out in sas_scsi_timed_out() when in fact it was just queued for error handling. It's also redundant in the case of a true timeout as libata eh will detect and report timeouts via it's AC_ERR_TIMEOUT facility. Given that some environments consider "timeout" errors to be indicative of impending device failure demote the sas_scsi_timed_out() timeout message to be disabled by default. This parallels ata_scsi_timed_out(). Cc: Jack Wang Cc: Anand Kumar Santhanam Cc: Sangeetha Gnanasekaran Cc: Nikith Ganigarakoppal Reported-by: Xun Ni Tested-by: Nelson Cheng Acked-by: Lukasz Dorau Signed-off-by: Dan Williams --- drivers/scsi/libsas/sas_scsi_host.c |2 +- include/scsi/scsi_device.h | 12 2 files changed, 13 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c index da3aee17faa5..25d0f127424d 100644 --- a/drivers/scsi/libsas/sas_scsi_host.c +++ b/drivers/scsi/libsas/sas_scsi_host.c @@ -862,7 +862,7 @@ out: enum blk_eh_timer_return sas_scsi_timed_out(struct scsi_cmnd *cmd) { - scmd_printk(KERN_DEBUG, cmd, "command %p timed out\n", cmd); + scmd_dbg(cmd, "command %p timed out\n", cmd); return BLK_EH_NOT_HANDLED; } diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index d65fbec2533d..067ac9f1607c 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -235,12 +235,24 @@ struct scsi_dh_data { #define sdev_printk(prefix, sdev, fmt, a...) \ dev_printk(prefix, &(sdev)->sdev_gendev, fmt, ##a) +#define sdev_dbg(sdev, fmt, a...) \ + dev_dbg(&(sdev)->sdev_gendev, fmt, ##a) + #define scmd_printk(prefix, scmd, fmt, a...) \ (scmd)->request->rq_disk ? \ sdev_printk(prefix, (scmd)->device, "[%s] " fmt,\ (scmd)->request->rq_disk->disk_name, ##a) : \ sdev_printk(prefix, (scmd)->device, fmt, ##a) +#define scmd_dbg(scmd, fmt, a...) \ +do { \ + if ((scmd)->request->rq_disk) \ + sdev_dbg((scmd)->device, "[%s] " fmt, \ +(scmd)->request->rq_disk->disk_name, ##a);\ + else \ + sdev_dbg((scmd)->device, fmt, ##a);\ + } while (0) + enum scsi_target_state { STARGET_CREATED = 1, STARGET_RUNNING, -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] isci: correct erroneous for_each_isci_host macro
From: Lukasz Dorau In the first place, the loop 'for' in the macro 'for_each_isci_host' (drivers/scsi/isci/host.h:314) is incorrect, because it accesses the 3rd element of 2 element array. After the 2nd iteration it executes the instruction: ihost = to_pci_info(pdev)->hosts[2] (while the size of the 'hosts' array equals 2) and reads an out of range element. In the second place, this loop is incorrectly optimized by GCC v4.8 (see http://marc.info/?l=linux-kernel&m=138998871911336&w=2). As a result, on platforms with two SCU controllers, the loop is executed more times than it can be (for i=0,1 and 2). It causes kernel panic during entering the S3 state and the following oops after 'rmmod isci': BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] __list_add+0x1b/0xc0 Oops: [#1] SMP RIP: 0010:[] [] __list_add+0x1b/0xc0 Call Trace: [] __mutex_lock_slowpath+0x114/0x1b0 [] mutex_lock+0x1f/0x30 [] sas_disable_events+0x1b/0x50 [libsas] [] sas_unregister_ha+0x18/0x60 [libsas] [] isci_unregister+0x1e/0x40 [isci] [] isci_pci_remove+0x5d/0x100 [isci] [] pci_device_remove+0x3b/0xb0 [] __device_release_driver+0x7f/0xf0 [] driver_detach+0xa8/0xb0 [] bus_remove_driver+0x9b/0x120 [] driver_unregister+0x2c/0x50 [] pci_unregister_driver+0x23/0x80 [] isci_exit+0x10/0x1e [isci] [] SyS_delete_module+0x16b/0x2d0 [] ? do_notify_resume+0x61/0xa0 [] system_call_fastpath+0x16/0x1b The loop has been corrected. This patch fixes kernel panic during entering the S3 state and the above oops. Signed-off-by: Lukasz Dorau Reviewed-by: Maciej Patelczyk Tested-by: Lukasz Dorau Cc: Dave Jiang Cc: Signed-off-by: Dan Williams --- drivers/scsi/isci/host.h |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/isci/host.h b/drivers/scsi/isci/host.h index 4911310a38f5..22a9bb1abae1 100644 --- a/drivers/scsi/isci/host.h +++ b/drivers/scsi/isci/host.h @@ -311,9 +311,8 @@ static inline struct Scsi_Host *to_shost(struct isci_host *ihost) } #define for_each_isci_host(id, ihost, pdev) \ - for (id = 0, ihost = to_pci_info(pdev)->hosts[id]; \ -id < ARRAY_SIZE(to_pci_info(pdev)->hosts) && ihost; \ -ihost = to_pci_info(pdev)->hosts[++id]) + for (id = 0; id < SCI_MAX_CONTROLLERS && \ +(ihost = to_pci_info(pdev)->hosts[id]); id++) static inline void wait_for_start(struct isci_host *ihost) { -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] isci: fix reset timeout handling
Remove an erroneous BUG_ON() in the case of a hard reset timeout. The reset timeout handler puts the port into the "awaiting link-up" state. The timeout causes the device to be disconnected and we need to be in the awaiting link-up state to re-connect the port. The BUG_ON() made the incorrect assumption that resets never timeout and we always complete the reset in the "resetting" state. Testing this patch also uncovered that libata continues to attempt to reset the port long after the driver has torn down the context. Once the driver has committed to abandoning the link it must indicate to libata that recovery ends by returning -ENODEV from ->lldd_I_T_nexus_reset(). Cc: Cc: Maciej Patelczyk Cc: Dave Jiang Acked-by: Lukasz Dorau Reported-by: David Milburn Reported-by: Xun Ni Tested-by: Xun Ni Signed-off-by: Dan Williams --- drivers/scsi/isci/port_config.c |7 --- drivers/scsi/isci/task.c|2 +- 2 files changed, 1 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/isci/port_config.c b/drivers/scsi/isci/port_config.c index 85c77f6b802b..ac879745ef80 100644 --- a/drivers/scsi/isci/port_config.c +++ b/drivers/scsi/isci/port_config.c @@ -615,13 +615,6 @@ static void sci_apc_agent_link_up(struct isci_host *ihost, SCIC_SDS_APC_WAIT_LINK_UP_NOTIFICATION); } else { /* the phy is already the part of the port */ - u32 port_state = iport->sm.current_state_id; - - /* if the PORT'S state is resetting then the link up is from -* port hard reset in this case, we need to tell the port -* that link up is recieved -*/ - BUG_ON(port_state != SCI_PORT_RESETTING); port_agent->phy_ready_mask |= 1 << phy_index; sci_port_link_up(iport, iphy); } diff --git a/drivers/scsi/isci/task.c b/drivers/scsi/isci/task.c index 0d30ca849e8f..5d6fda72d659 100644 --- a/drivers/scsi/isci/task.c +++ b/drivers/scsi/isci/task.c @@ -801,7 +801,7 @@ int isci_task_I_T_nexus_reset(struct domain_device *dev) /* XXX: need to cleanup any ireqs targeting this * domain_device */ - ret = TMF_RESP_FUNC_COMPLETE; + ret = -ENODEV; goto out; } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_transport_sas: move bsg destructor into sas_rphy_remove
Hi Joe, thanks for copying me on that other thread. On Tue, Dec 3, 2013 at 10:15 AM, Joe Lawrence wrote: > The recent change in sysfs, bcdde7e221a8750f9b62b6d0bd31b72ea4ad9309 > "sysfs: make __sysfs_remove_dir() recursive" revealed an asymmetric > rphy device creation/deletion sequence in scsi_transport_sas: > > modprobe mpt2sas > sas_rphy_add > device_add A rphy->dev > device_add B sas_device transport class > device_add C sas_end_device transport class > device_add D bsg class > > rmmod mpt2sas > sas_rphy_delete > sas_rphy_remove > device_del B > device_del C > device_del A > sysfs_remove_group recursive sysfs dir removal > sas_rphy_free > device_del D warning > > where device A is the parent of B, C, and D. > > When sas_rphy_free tries to unregister the bsg request queue (device D > above), the ensuing sysfs cleanup discovers that its sysfs group has > already been removed and emits a warning, "sysfs group... not found for > kobject 'end_device-X:0'". > > Since bsg creation is a side effect of sas_rphy_add, move its > complementary removal call into sas_rphy_remove. This imposes the > following tear-down order for the devices above: D, B, C, A. > > Note the sas_device and sas_end_device transport class devices (B and C > above) are created and destroyed both via the list match traversal in > attribute_container_device_trigger, so the order in which they are > handled is fixed. This is fine as long as they are deleted before their > parent device. > > Signed-off-by: Joe Lawrence > Cc: "James E.J. Bottomley" > --- > drivers/scsi/scsi_transport_sas.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/scsi/scsi_transport_sas.c > b/drivers/scsi/scsi_transport_sas.c > index 1b681427dde0..c341f855fadc 100644 > --- a/drivers/scsi/scsi_transport_sas.c > +++ b/drivers/scsi/scsi_transport_sas.c > @@ -1621,8 +1621,6 @@ void sas_rphy_free(struct sas_rphy *rphy) > list_del(&rphy->list); > mutex_unlock(&sas_host->lock); > > - sas_bsg_remove(shost, rphy); > - > transport_destroy_device(dev); > > put_device(dev); > @@ -1681,6 +1679,7 @@ sas_rphy_remove(struct sas_rphy *rphy) > } > > sas_rphy_unlink(rphy); > + sas_bsg_remove(NULL, rphy); > transport_remove_device(dev); > device_del(dev); > } Acked-by: Dan Williams I think this is the right move, and some follow-up cleanups / fixes come to mind: 1/ Why is sas_bsg_remove() multiplexing 2 use cases? Let's just have a sas_host_bsg_remove() and a sas_device_bsg_remove(). 2/ I think sas_rphy_free() mishandles the reference counts in the sas_rphy_add() failing case. sas_rphy_free() does: transport_destroy_device(dev); /* which is just put_device(dev) */ put_device(dev); ...but if sas_rphy_add() failed we never took the reference from transport_add_device() and I expect we only have the one reference from device_initialize() to drop. We also ignore the return code from transport_add_device() which is another hole. -- Dan Commenting on this mail should refresh this in James' queue. But you may want to resend anyways given this is a few months old. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mpt2sas driver barfs when force removing a drive on 3.13.1
On Wed, Feb 12, 2014 at 9:41 AM, Richard Yao wrote: > You should use `git send-email` to send the patch to all of the people > listed by get_maintainer.pl: I see nothing wrong with the patch Joe originally submitted: http://marc.info/?l=linux-scsi&m=138609455422179&w=2 > > richard@desktop ~/devel/linux $ scripts/get_maintainer.pl --file > drivers/scsi/scsi_transport_sas.c > "James E.J. Bottomley" (maintainer:SCSI > SUBSYSTEM) > linux-scsi@vger.kernel.org (open list:SCSI SUBSYSTEM) > linux-ker...@vger.kernel.org (open list) > > If I see that James E.J. Bottomley is already on CC for this discussion, > but my recent (and first) experience submitting a patch for another > subsystem was that they are ignored unless this strict procedure is > followed. > > If the maintainer does not respond within a month after doing this, I > suggest informing Linus Torvalds via email that the subsystem maintainer > is non-responsive. > Just re-ping James, he'll get to it... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] isci, libsas fixes for 3.4-rc2
On Sun, Feb 23, 2014 at 12:12 PM, James Bottomley wrote: > > On Thu, 2014-02-06 at 12:22 -0800, Dan Williams wrote: >> Hi James, >> >> Here are some collected fixes. All but patch 2 are tagged for -stable. >> >> Patch 1 and 4 have been on the list since before the 3.14 merge window, >> patch 2 and 3 are new. >> >> Please apply, thank you. >> >> [PATCH 1/4] isci: fix reset timeout handling >> [PATCH 2/4] scsi, libsas: introduce scmd_dbg() to quiet false positive >> "timeout" messages >> [PATCH 3/4] isci: fix needless ata reset escalations >> [PATCH 4/4] isci: correct erroneous for_each_isci_host macro > > 2,3 aren't really bug fixes; they're more enhancements (they make the > behaviour better but nothing breaks without them), so I'll queue 1,4 for > fixes and 2-3 for misc. Ok. Also Joe's patch while you're in the area: http://marc.info/?l=linux-scsi&m=138609455422179&w=2 I acked it, but it shows some additional work needed for registration error paths in scsi_transport_sas. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/3] libata, libsas: kill pm_result and related cleanup
Tejun says: "At least for libata, worrying about suspend/resume failures don't make whole lot of sense. If suspend failed, just proceed with suspend. If the device can't be woken up afterwards, that's that. There isn't anything we could have done differently anyway. The same for resume, if spinup fails, the device is dud and the following commands will invoke EH actions and will eventually fail. Again, there really isn't any *choice* to make. Just making sure the errors are handled gracefully (ie. don't crash) and the following commands are handled correctly should be enough." The only libata user that actually cares about the result from a suspend operation is libsas. However, it only cares about whether queuing a new operation collides with an in-flight one. All libsas does with the error is retry, but we can just let libata wait for the previous operation before continuing. While we're at it take an opportunity to replace helper functions that just do a to_ata_port(dev) conversion with macros whose names readily distinguish sync vs async (queued) operations. Reference: http://marc.info/?l=linux-scsi&m=138995409532286&w=2 Cc: Phillip Susi Cc: Alan Stern Suggested-by: Tejun Heo Signed-off-by: Todd Brandt Signed-off-by: Dan Williams --- drivers/ata/libata-core.c | 75 +++-- drivers/ata/libata-eh.c | 13 +-- drivers/scsi/libsas/sas_ata.c | 35 +++ include/linux/libata.h|9 ++--- include/scsi/libsas.h |1 - 5 files changed, 39 insertions(+), 94 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 1a3dbd1b196e..0f47436c714c 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5351,22 +5351,17 @@ bool ata_link_offline(struct ata_link *link) } #ifdef CONFIG_PM -static int ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, - unsigned int action, unsigned int ehi_flags, - int *async) +static void ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, + unsigned int action, unsigned int ehi_flags, + bool async) { struct ata_link *link; unsigned long flags; - int rc = 0; /* Previous resume operation might still be in * progress. Wait for PM_PENDING to clear. */ if (ap->pflags & ATA_PFLAG_PM_PENDING) { - if (async) { - *async = -EAGAIN; - return 0; - } ata_port_wait_eh(ap); WARN_ON(ap->pflags & ATA_PFLAG_PM_PENDING); } @@ -5375,11 +5370,6 @@ static int ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, spin_lock_irqsave(ap->lock, flags); ap->pm_mesg = mesg; - if (async) - ap->pm_result = async; - else - ap->pm_result = &rc; - ap->pflags |= ATA_PFLAG_PM_PENDING; ata_for_each_link(link, ap, HOST_FIRST) { link->eh_info.action |= action; @@ -5390,16 +5380,13 @@ static int ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, spin_unlock_irqrestore(ap->lock, flags); - /* wait and check result */ if (!async) { ata_port_wait_eh(ap); WARN_ON(ap->pflags & ATA_PFLAG_PM_PENDING); } - - return rc; } -static int __ata_port_suspend_common(struct ata_port *ap, pm_message_t mesg, int *async) +static int ata_port_suspend_common(struct ata_port *ap, pm_message_t mesg, bool async) { /* * On some hardware, device fails to respond after spun down @@ -5411,59 +5398,53 @@ static int __ata_port_suspend_common(struct ata_port *ap, pm_message_t mesg, int */ unsigned int ehi_flags = ATA_EHI_QUIET | ATA_EHI_NO_AUTOPSY | ATA_EHI_NO_RECOVERY; - return ata_port_request_pm(ap, mesg, 0, ehi_flags, async); + ata_port_request_pm(ap, mesg, 0, ehi_flags, async); + return 0; } -static int ata_port_suspend_common(struct device *dev, pm_message_t mesg) -{ - struct ata_port *ap = to_ata_port(dev); - - return __ata_port_suspend_common(ap, mesg, NULL); -} +#define ata_port_suspend_sync(ap, msg) ata_port_suspend_common((ap), (msg), false) +#define queue_ata_port_suspend(ap, msg) ata_port_suspend_common((ap), (msg), true) static int ata_port_suspend(struct device *dev) { + struct ata_port *ap = to_ata_port(dev); + if (pm_runtime_suspended(dev)) return 0; - return ata_port_suspend_common(dev, PMSG_SUSPEND); + return ata_port_suspend_sync(ap, PMSG_SUSPEND); } static int ata_port_do_freeze(struct device *dev) { + struct
[PATCH v5 3/3] scsi: async sd resume
async_schedule() sd resume work to allow disks and other devices to resume in parallel. This moves the entirety of scsi_device resume to an async context to ensure that scsi_device_resume() remains ordered with respect to the completion of the start/stop command. For the duration of the resume, new command submissions (that do not originate from the scsi-core) will be deferred (BLKPREP_DEFER). It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container of these operations. Like scsi_sd_probe_domain it is flushed at sd_remove() time to ensure async ops do not continue past the end-of-life of the sdev. The implementation explicitly refrains from reusing scsi_sd_probe_domain directly for this purpose as it is flushed at the end of dpm_resume(), potentially defeating some of the benefit. Given sdevs are quiesced it is permissible for these resume operations to bleed past the async_synchronize_full() calls made by the driver core. We defer the resolution of which pm callback to call until scsi_dev_type_{suspend|resume} time and guarantee that the 'int (*cb)(struct device *)' parameter is never NULL. With this in place the type of resume operation is encoded in the async function identifier. Inspired by Todd's analysis and initial proposal [2]: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi Cc: Alan Stern Suggested-by: Todd Brandt [djbw: kick all resume work to the async queue] Signed-off-by: Dan Williams --- drivers/scsi/scsi.c |3 + drivers/scsi/scsi_pm.c | 115 -- drivers/scsi/scsi_priv.h |1 drivers/scsi/sd.c|1 4 files changed, 95 insertions(+), 25 deletions(-) diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index d8afec8317cf..990d4a302788 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -91,6 +91,9 @@ EXPORT_SYMBOL(scsi_logging_level); ASYNC_DOMAIN(scsi_sd_probe_domain); EXPORT_SYMBOL(scsi_sd_probe_domain); +ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain); +EXPORT_SYMBOL(scsi_sd_pm_domain); + /* NB: These are exposed through /proc/scsi/scsi and form part of the ABI. * You may not alter any existing entry (although adding new ones is * encouraged once assigned by ANSI/INCITS T10 diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c index 001e9ceda4c3..1cb211bf0383 100644 --- a/drivers/scsi/scsi_pm.c +++ b/drivers/scsi/scsi_pm.c @@ -18,17 +18,52 @@ #ifdef CONFIG_PM_SLEEP +#define do_pm_op(dev, op) \ + dev->driver ? dev->driver->pm ? \ + dev->driver->pm->op(dev) : 0 : 0 + +static int do_scsi_suspend(struct device *dev) +{ + return do_pm_op(dev, suspend); +} + +static int do_scsi_freeze(struct device *dev) +{ + return do_pm_op(dev, freeze); +} + +static int do_scsi_poweroff(struct device *dev) +{ + return do_pm_op(dev, poweroff); +} + +static int do_scsi_resume(struct device *dev) +{ + return do_pm_op(dev, resume); +} + +static int do_scsi_thaw(struct device *dev) +{ + return do_pm_op(dev, thaw); +} + +static int do_scsi_restore(struct device *dev) +{ + return do_pm_op(dev, restore); +} + static int scsi_dev_type_suspend(struct device *dev, int (*cb)(struct device *)) { int err; + /* flush pending in-flight resume operations, suspend is synchronous */ + async_synchronize_full_domain(&scsi_sd_pm_domain); + err = scsi_device_quiesce(to_scsi_device(dev)); if (err == 0) { - if (cb) { - err = cb(dev); - if (err) - scsi_device_resume(to_scsi_device(dev)); - } + err = cb(dev); + if (err) + scsi_device_resume(to_scsi_device(dev)); } dev_dbg(dev, "scsi suspend: %d\n", err); return err; @@ -38,10 +73,16 @@ static int scsi_dev_type_resume(struct device *dev, int (*cb)(struct device *)) { int err = 0; - if (cb) - err = cb(dev); + err = cb(dev); scsi_device_resume(to_scsi_device(dev)); dev_dbg(dev, "scsi resume: %d\n", err); + + if (err == 0) { + pm_runtime_disable(dev); + pm_runtime_set_active(dev); + pm_runtime_enable(dev); + } + return err; } @@ -66,20 +107,50 @@ scsi_bus_suspend_common(struct device *dev, int (*cb)(struct device *)) return err; } +static void async_sdev_resume(void *dev, async_cookie_t cookie) +{ + scsi_dev_type_resume(dev, do_scsi_resume); +} + +static void async_sdev_thaw(void *dev, async_cookie_t cookie) +{ + scsi_dev_type_resume(dev, do_scsi_thaw); +} + +static void async_sdev_restore(void *dev, async_cookie_t cookie) +{ + scsi_dev_type_resume(dev, do_scsi_restore); +} + +static async_func_t to_async_sdev_resume_fn(st
[PATCH v5 0/3] Accelerate Storage Resume (2x or more)
It is not everyday that a kernel operation gets 2x faster. Nice find by Todd and his AnalyzeSuspend tool [1]. Todd is on vacation so I am taking care of these patches to make sure they get in the queue for 3.15. The significant changes from Todd's last submission [2]: 1/ Split out the pure cleanup into its own patch, and expand it to clean up libsas as well 2/ Move the entirety of scsi_device resume to an async context. As written, v4 of the patchset overlapped the scsi-start-stop command with the scsi_device_resume(). Now in v5 the queue restart is properly ordered with the completion of the scsi-start-stop command. Resume completion time as reported by: "echo devices > /sys/power/pm_test && echo mem > /sys/power/state" BEFORE: PM: resume of devices complete after 2271.097 msecs AFTER: PM: resume of devices complete after 1057.404 msecs On a system with two SSDs attached via AHCI. -- Dan [1]: Todd's testing showing up to 12x resume latency improvement in some cases: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach [2]: Todd's v4 patchset: [PATCH v4 0/2] http://marc.info/?l=linux-ide&m=138984698103487&w=2 [PATCH v4 1/2] http://marc.info/?l=linux-ide&m=138984713203515&w=2 [PATCH v4 2/2] http://marc.info/?l=linux-ide&m=138984722003539&w=2 --- Dan Williams (2): libata, libsas: kill pm_result and related cleanup scsi: async sd resume Todd Brandt (1): libata: async resume drivers/ata/libata-core.c | 75 ++- drivers/ata/libata-eh.c | 13 + drivers/scsi/libsas/sas_ata.c | 35 ++-- drivers/scsi/scsi.c |3 + drivers/scsi/scsi_pm.c| 115 - drivers/scsi/scsi_priv.h |1 drivers/scsi/sd.c |1 include/linux/libata.h|9 +-- include/scsi/libsas.h |1 9 files changed, 134 insertions(+), 119 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 2/3] libata: async resume
From: Todd Brandt Improve overall system resume time by making libata link recovery actions asynchronous relative to other resume events. Link resume operations are performed using the scsi_eh thread, so commands, particularly the sd resume start/stop command, will be held off until the device exits error handling. Libata already flushes eh with ata_port_wait_eh() in the port teardown paths, so there are no concerns with async operation colliding with the end-of-life of the ata_port object. Also, libata-core is already careful to flush in-flight pm operations before another round of pm starts on the given ata_port. Reference: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi Cc: Alan Stern Signed-off-by: Todd Brandt [djbw: rebase on cleanup patch, changelog wordsmithing] Signed-off-by: Dan Williams --- drivers/ata/libata-core.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 0f47436c714c..7719ec7d9df9 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5444,7 +5444,7 @@ static int ata_port_resume(struct device *dev) { int rc; - rc = ata_port_resume_sync(to_ata_port(dev), PMSG_RESUME); + rc = queue_ata_port_resume(to_ata_port(dev), PMSG_RESUME); if (!rc) { pm_runtime_disable(dev); pm_runtime_set_active(dev); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] SCSI: sd: don't fail if the device doesn't recognize SYNCHRONIZE CACHE
On Mon, Mar 3, 2014 at 1:25 PM, Daniel Mack wrote: > From: Alan Stern > > Evidently some wacky USB-ATA bridges don't recognize the SYNCHRONIZE > CACHE command, as shown in this email thread: > > http://marc.info/?t=13897835622&r=1&w=2 > > The fact that we can't tell them to drain their caches shouldn't > prevent the system from going into suspend. Therefore sd_sync_cache() > shouldn't return an error if the device replies with an Invalid > Command ASC. > > Signed-off-by: Alan Stern > Reported-by: Sven Neumann > Tested-by: Daniel Mack > CC: Oliver Neukum > CC: > --- > Hi, > > this patch has been around for awhile, but hasn't gained much > attraction, and hasn't been merged anywhere yet. Which is sad, > as it fixes a bug on real hardware when going to suspend :) > > Could anyone from the SCSI people have a quick look maybe? Acked-by: Dan Williams But I agree with Tejun [1], that this likely does not go far enough. We should also be looking to fail future writes to the device or disabling the cache. Tejun's comment: "Ooh, yeah, flush failure is special. That said, I think the right way to deal with that is marking the device as failed and fail writes / flushes afterwards instead of failing suspend. It's hightly unlikely the device is in any useable state after failing flushes anyway and failing suspend has potential to lead to pretty dramatic failure conditions (device overheating in the bag would be a common one) too." [1]: http://marc.info/?l=linux-scsi&m=138998568010393&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/3] scsi: async sd resume
On Fri, Mar 7, 2014 at 7:42 AM, Alan Stern wrote: > On Wed, 5 Mar 2014, Dan Williams wrote: > >> async_schedule() sd resume work to allow disks and other devices to >> resume in parallel. >> >> This moves the entirety of scsi_device resume to an async context to >> ensure that scsi_device_resume() remains ordered with respect to the >> completion of the start/stop command. For the duration of the resume, >> new command submissions (that do not originate from the scsi-core) will >> be deferred (BLKPREP_DEFER). >> >> It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container >> of these operations. Like scsi_sd_probe_domain it is flushed at >> sd_remove() time to ensure async ops do not continue past the >> end-of-life of the sdev. The implementation explicitly refrains from >> reusing scsi_sd_probe_domain directly for this purpose as it is flushed >> at the end of dpm_resume(), potentially defeating some of the benefit. >> Given sdevs are quiesced it is permissible for these resume operations >> to bleed past the async_synchronize_full() calls made by the driver >> core. >> >> We defer the resolution of which pm callback to call until >> scsi_dev_type_{suspend|resume} time and guarantee that the 'int >> (*cb)(struct device *)' parameter is never NULL. With this in place the >> type of resume operation is encoded in the async function identifier. >> >> Inspired by Todd's analysis and initial proposal [2]: >> https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach >> >> Cc: Len Brown >> Cc: Phillip Susi >> Cc: Alan Stern >> Suggested-by: Todd Brandt >> [djbw: kick all resume work to the async queue] >> Signed-off-by: Dan Williams >> --- > >> --- a/drivers/scsi/scsi_pm.c >> +++ b/drivers/scsi/scsi_pm.c >> @@ -18,17 +18,52 @@ >> >> #ifdef CONFIG_PM_SLEEP >> >> +#define do_pm_op(dev, op) \ >> + dev->driver ? dev->driver->pm ? \ >> + dev->driver->pm->op(dev) : 0 : 0 > > This will crash if dev->driver->pm->op is NULL. How about making it > easier to read, too? > > #define RETURN_PM_OP(dev, op) \ > if (dev->driver && dev->driver->pm && dev->driver->pm->op) \ > return dev->driver->pm->op(dev);\ > else\ > return 0 > > static int do_scsi_suspend(struct device *dev) > { > RETURM_PM_OP(dev, suspend); > } > > etc. > > Alternatively, you could put the "dev->driver && dev->driver->pm" part > of the test directly into scsi_dev_type_suspend and > scsi_dev_type_resume, to save a little code space. Then the original > macro formulation would become sufficiently simple: one test and one > function call. I like that better than hiding too much (especially bugs) in the macro. > >> +static int do_scsi_suspend(struct device *dev) >> +{ >> + return do_pm_op(dev, suspend); >> +} >> + >> +static int do_scsi_freeze(struct device *dev) >> +{ >> + return do_pm_op(dev, freeze); >> +} >> + >> +static int do_scsi_poweroff(struct device *dev) >> +{ >> + return do_pm_op(dev, poweroff); >> +} >> + >> +static int do_scsi_resume(struct device *dev) >> +{ >> + return do_pm_op(dev, resume); >> +} >> + >> +static int do_scsi_thaw(struct device *dev) >> +{ >> + return do_pm_op(dev, thaw); >> +} >> + >> +static int do_scsi_restore(struct device *dev) >> +{ >> + return do_pm_op(dev, restore); >> +} >> + >> static int scsi_dev_type_suspend(struct device *dev, int (*cb)(struct >> device *)) >> { >> int err; >> >> + /* flush pending in-flight resume operations, suspend is synchronous */ >> + async_synchronize_full_domain(&scsi_sd_pm_domain); >> + >> err = scsi_device_quiesce(to_scsi_device(dev)); >> if (err == 0) { >> - if (cb) { >> - err = cb(dev); >> - if (err) >> - scsi_device_resume(to_scsi_device(dev)); >> - } >> + err = cb(dev); >> + if (err) >> + scsi_device_resume(to_scsi_device(dev)); >> } >> dev_dbg(dev, "scsi suspend: %d\n", err); >>
Re: [PATCH v5 3/3] scsi: async sd resume
On Fri, 2014-03-07 at 10:42 -0500, Alan Stern wrote: > > #ifdef CONFIG_PM_SLEEP > > > > +#define do_pm_op(dev, op) \ > > + dev->driver ? dev->driver->pm ? \ > > + dev->driver->pm->op(dev) : 0 : 0 > > This will crash if dev->driver->pm->op is NULL. How about making it > easier to read, too? > > #define RETURN_PM_OP(dev, op) \ > if (dev->driver && dev->driver->pm && dev->driver->pm->op) \ > return dev->driver->pm->op(dev);\ > else\ > return 0 > > static int do_scsi_suspend(struct device *dev) > { > RETURM_PM_OP(dev, suspend); > } > > etc. > > Alternatively, you could put the "dev->driver && dev->driver->pm" part > of the test directly into scsi_dev_type_suspend and > scsi_dev_type_resume, to save a little code space. Then the original > macro formulation would become sufficiently simple: one test and one > function call. [..] > Given that this function is used in only one place; I would put it > inline. > Ended up putting the dev->driver->pm lookup into scsi_dev_type_{suspend| resume} directly, and open coding the pm->op is NULL check. Moved the lookup of the async function inline. 8< Subject: scsi: async sd resume From: Dan Williams async_schedule() sd resume work to allow disks and other devices to resume in parallel. This moves the entirety of scsi_device resume to an async context to ensure that scsi_device_resume() remains ordered with respect to the completion of the start/stop command. For the duration of the resume, new command submissions (that do not originate from the scsi-core) will be deferred (BLKPREP_DEFER). It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container of these operations. Like scsi_sd_probe_domain it is flushed at sd_remove() time to ensure async ops do not continue past the end-of-life of the sdev. The implementation explicitly refrains from reusing scsi_sd_probe_domain directly for this purpose as it is flushed at the end of dpm_resume(), potentially defeating some of the benefit. Given sdevs are quiesced it is permissible for these resume operations to bleed past the async_synchronize_full() calls made by the driver core. We defer the resolution of which pm callback to call until scsi_dev_type_{suspend|resume} time and guarantee that the callback parameter is never NULL. With this in place the type of resume operation is encoded in the async function identifier. Inspired by Todd's analysis and initial proposal [2]: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi Cc: Alan Stern [alan: bug fix and clean up suggestion] Suggested-by: Todd Brandt [djbw: kick all resume work to the async queue] Signed-off-by: Dan Williams --- drivers/scsi/scsi.c |3 + drivers/scsi/scsi_pm.c | 119 +++--- drivers/scsi/scsi_priv.h |1 drivers/scsi/sd.c|1 4 files changed, 95 insertions(+), 29 deletions(-) diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index d8afec8317cf..990d4a302788 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -91,6 +91,9 @@ EXPORT_SYMBOL(scsi_logging_level); ASYNC_DOMAIN(scsi_sd_probe_domain); EXPORT_SYMBOL(scsi_sd_probe_domain); +ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain); +EXPORT_SYMBOL(scsi_sd_pm_domain); + /* NB: These are exposed through /proc/scsi/scsi and form part of the ABI. * You may not alter any existing entry (although adding new ones is * encouraged once assigned by ANSI/INCITS T10 diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c index 001e9ceda4c3..06fc787e7787 100644 --- a/drivers/scsi/scsi_pm.c +++ b/drivers/scsi/scsi_pm.c @@ -18,35 +18,77 @@ #ifdef CONFIG_PM_SLEEP -static int scsi_dev_type_suspend(struct device *dev, int (*cb)(struct device *)) +static int do_scsi_suspend(struct device *dev, const struct dev_pm_ops *pm) { + return pm && pm->suspend ? pm->suspend(dev) : 0; +} + +static int do_scsi_freeze(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->freeze ? pm->freeze(dev) : 0; +} + +static int do_scsi_poweroff(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->poweroff ? pm->poweroff(dev) : 0; +} + +static int do_scsi_resume(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->resume ? pm->resume(dev) : 0; +} + +static int do_scsi_thaw(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->thaw ? pm->thaw(dev) : 0; +} + +static int
Re: [PATCH v5 3/3] scsi: async sd resume
On Mon, Mar 10, 2014 at 1:43 PM, Tejun Heo wrote: > On Fri, Mar 07, 2014 at 06:52:06PM -0800, Dan Williams wrote: >> From: Dan Williams >> >> async_schedule() sd resume work to allow disks and other devices to >> resume in parallel. >> >> This moves the entirety of scsi_device resume to an async context to >> ensure that scsi_device_resume() remains ordered with respect to the >> completion of the start/stop command. For the duration of the resume, >> new command submissions (that do not originate from the scsi-core) will >> be deferred (BLKPREP_DEFER). >> >> It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container >> of these operations. Like scsi_sd_probe_domain it is flushed at >> sd_remove() time to ensure async ops do not continue past the >> end-of-life of the sdev. The implementation explicitly refrains from >> reusing scsi_sd_probe_domain directly for this purpose as it is flushed >> at the end of dpm_resume(), potentially defeating some of the benefit. >> Given sdevs are quiesced it is permissible for these resume operations >> to bleed past the async_synchronize_full() calls made by the driver >> core. >> >> We defer the resolution of which pm callback to call until >> scsi_dev_type_{suspend|resume} time and guarantee that the callback >> parameter is never NULL. With this in place the type of resume >> operation is encoded in the async function identifier. >> >> Inspired by Todd's analysis and initial proposal [2]: >> https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach > > The only thing which is a bit concerning is that this doesn't have any > throttling mechanism for simultaneous wakeups. Would this be able to > blow up the PSU if used on a machine with a lot of spindles? Good point. The primary benefit is completing userspace resume without needlessly waiting for the disk. For now I think it would be enough to have a mutex to maintain one disk at a time. We can follow on later with something more complex to enable a max simultaneous spin-up tunable. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/3] scsi: async sd resume
On Mon, Mar 10, 2014 at 1:59 PM, Alan Stern wrote: > On Mon, 10 Mar 2014, Dan Williams wrote: > >> > The only thing which is a bit concerning is that this doesn't have any >> > throttling mechanism for simultaneous wakeups. Would this be able to >> > blow up the PSU if used on a machine with a lot of spindles? >> >> Good point. The primary benefit is completing userspace resume >> without needlessly waiting for the disk. For now I think it would be >> enough to have a mutex to maintain one disk at a time. We can follow >> on later with something more complex to enable a max simultaneous >> spin-up tunable. > > Why? The existing code doesn't have anything like that. > Why follow up later, or why maintain one disk at a time? I know at least the SCSI driver I maintained had this as a per-low-level driver tunable. Seems to be a good candidate for attempting a unified platform-level tunable. Mind you I would not be looking to implement it anytime soon. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/3] scsi: async sd resume
On Mon, Mar 10, 2014 at 9:19 PM, James Bottomley wrote: > On Mon, 2014-03-10 at 16:59 -0400, Alan Stern wrote: >> On Mon, 10 Mar 2014, Dan Williams wrote: >> >> > > The only thing which is a bit concerning is that this doesn't have any >> > > throttling mechanism for simultaneous wakeups. Would this be able to >> > > blow up the PSU if used on a machine with a lot of spindles? >> > >> > Good point. The primary benefit is completing userspace resume >> > without needlessly waiting for the disk. For now I think it would be >> > enough to have a mutex to maintain one disk at a time. We can follow >> > on later with something more complex to enable a max simultaneous >> > spin-up tunable. >> >> Why? The existing code doesn't have anything like that. > > This only becomes a problem if enterprise class systems want to suspend > and resume. Last I heard, this wasn't on the cards. > > We also don't usually need to bother with in-kernel staggered spin ups > because if the device can blow the PSU by doing simultaneous spinups, > staggering is usually hard jumpered in the system. The main reason for > this is that there's no way for us to work out the PSU topology to do > the stagger in kernel, so the enterprise likes to be sure. > In that case shouldn't we limit it to a disk at a time? Enterprise doesn't care and home brew folks that both put a lot of disks in a system and suspend/resume them are less likely to burn themselves. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/3] scsi: async sd resume
On Mon, Mar 10, 2014 at 10:16 PM, James Bottomley wrote: > On Mon, 2014-03-10 at 21:47 -0700, Dan Williams wrote: >> On Mon, Mar 10, 2014 at 9:19 PM, James Bottomley >> wrote: >> > On Mon, 2014-03-10 at 16:59 -0400, Alan Stern wrote: >> >> On Mon, 10 Mar 2014, Dan Williams wrote: >> >> >> >> > > The only thing which is a bit concerning is that this doesn't have any >> >> > > throttling mechanism for simultaneous wakeups. Would this be able to >> >> > > blow up the PSU if used on a machine with a lot of spindles? >> >> > >> >> > Good point. The primary benefit is completing userspace resume >> >> > without needlessly waiting for the disk. For now I think it would be >> >> > enough to have a mutex to maintain one disk at a time. We can follow >> >> > on later with something more complex to enable a max simultaneous >> >> > spin-up tunable. >> >> >> >> Why? The existing code doesn't have anything like that. >> > >> > This only becomes a problem if enterprise class systems want to suspend >> > and resume. Last I heard, this wasn't on the cards. >> > >> > We also don't usually need to bother with in-kernel staggered spin ups >> > because if the device can blow the PSU by doing simultaneous spinups, >> > staggering is usually hard jumpered in the system. The main reason for >> > this is that there's no way for us to work out the PSU topology to do >> > the stagger in kernel, so the enterprise likes to be sure. >> > >> >> In that case shouldn't we limit it to a disk at a time? Enterprise >> doesn't care and home brew folks that both put a lot of disks in a >> system and suspend/resume them are less likely to burn themselves. > > Where do you get that idea from? the enterprise is amazingly sensitive > to system start times; they'd demand my head on a plate if we started a > disk at a time and their start times went up by 10x. That's why we > leave it to hard wiring ... they can start in batches the maximum number > allowable and thus get the shortest start up times. Just considering the dpm_resume path, not device discovery... > In the long game, though this whole debate is moot: setups with hard > wired start times adhere to them regardless of what the system does, so > they ignore start unit commands. Systems without hard wired start times > can usually be started at once, so us introducing a delay is unnecessary > in either case. Ok, then I'll let the patch stand as is. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/3] scsi: async sd resume
On Mon, Mar 10, 2014 at 11:29 PM, James Bottomley wrote: >> > In the long game, though this whole debate is moot: setups with hard >> > wired start times adhere to them regardless of what the system does, so >> > they ignore start unit commands. Systems without hard wired start times >> > can usually be started at once, so us introducing a delay is unnecessary >> > in either case. >> >> Ok, then I'll let the patch stand as is. > > Sounds good. > > James > > Well, one more chirp about this. If the user has disabled async scanning by CONFIG_SCSI_SCAN_ASYNC=n or scsi_mod.scan != "async" then resume should follow suit. I'll include this in the next rev. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 0/3] Accelerate Storage Resume (2x or more)
Changes since v5: * Per Tejun, fixed up the naming of the ata_port helper routines * Cleaned up the return value of ata_port_pm helper routines to be 'void' * Per Tejun, added a check of whether async scanning is disabled to gate whether async resume will be performed out of caution for systems that do not handle staggered spin-up outside of the kernel. Clarified this in the changelog and added commentary in the code. === It is not everyday that a kernel operation gets > 2x faster! Great find by Todd and his AnalyzeSuspend tool [1]. Todd is on vacation so I am taking care of these patches to make sure they get in the queue for 3.15. The significant changes from Todd's last submission [2]: 1/ Split out the pure cleanup into its own patch, and expand it to clean up libsas as well 2/ Move the entirety of scsi_device resume to an async context. As written, v4 of the patchset overlapped the scsi-start-stop command with the scsi_device_resume(). Now in v5 the queue restart is properly ordered with the completion of the scsi-start-stop command. Resume completion time as reported by: "echo devices > /sys/power/pm_test && echo mem > /sys/power/state" BEFORE: PM: resume of devices complete after 2271.097 msecs AFTER: PM: resume of devices complete after 1057.404 msecs On a system with two SSDs attached via AHCI. [1]: Todd's testing showing up to 12x resume latency improvement in some cases: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach [2]: Todd's v4 patchset: [PATCH v4 0/2] http://marc.info/?l=linux-ide&m=138984698103487&w=2 [PATCH v4 1/2] http://marc.info/?l=linux-ide&m=138984713203515&w=2 [PATCH v4 2/2] http://marc.info/?l=linux-ide&m=138984722003539&w=2 --- Dan Williams (2): libata, libsas: kill pm_result and related cleanup scsi: async sd resume Todd Brandt (1): libata: async resume drivers/ata/libata-core.c | 135 ++--- drivers/ata/libata-eh.c | 13 +--- drivers/scsi/Kconfig |3 + drivers/scsi/libsas/sas_ata.c | 35 ++- drivers/scsi/scsi.c |3 + drivers/scsi/scsi_pm.c| 128 ++- drivers/scsi/scsi_priv.h |2 + drivers/scsi/scsi_scan.c |2 - drivers/scsi/sd.c |1 include/linux/libata.h| 11 +-- include/scsi/libsas.h |1 11 files changed, 180 insertions(+), 154 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 2/3] libata: async resume
From: Todd Brandt Improve overall system resume time by making libata link recovery actions asynchronous relative to other resume events. Link resume operations are performed using the scsi_eh thread, so commands, particularly the sd resume start/stop command, will be held off until the device exits error handling. Libata already flushes eh with ata_port_wait_eh() in the port teardown paths, so there are no concerns with async operation colliding with the end-of-life of the ata_port object. Also, libata-core is already careful to flush in-flight pm operations before another round of pm starts on the given ata_port. Reference: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi Cc: Alan Stern Signed-off-by: Todd Brandt [djbw: rebase on cleanup patch, changelog wordsmithing] Signed-off-by: Dan Williams --- drivers/ata/libata-core.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 66110ed2c1c0..c37eb02c7136 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5450,7 +5450,7 @@ static void ata_port_resume_async(struct ata_port *ap, pm_message_t mesg) static int ata_port_pm_resume(struct device *dev) { - ata_port_resume(to_ata_port(dev), PMSG_RESUME); + ata_port_resume_async(to_ata_port(dev), PMSG_RESUME); pm_runtime_disable(dev); pm_runtime_set_active(dev); pm_runtime_enable(dev); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 3/3] scsi: async sd resume
async_schedule() sd resume work to allow disks and other devices to resume in parallel. This moves the entirety of scsi_device resume to an async context to ensure that scsi_device_resume() remains ordered with respect to the completion of the start/stop command. For the duration of the resume, new command submissions (that do not originate from the scsi-core) will be deferred (BLKPREP_DEFER). It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container of these operations. Like scsi_sd_probe_domain it is flushed at sd_remove() time to ensure async ops do not continue past the end-of-life of the sdev. The implementation explicitly refrains from reusing scsi_sd_probe_domain directly for this purpose as it is flushed at the end of dpm_resume(), potentially defeating some of the benefit. Given sdevs are quiesced it is permissible for these resume operations to bleed past the async_synchronize_full() calls made by the driver core. We defer the resolution of which pm callback to call until scsi_dev_type_{suspend|resume} time and guarantee that the callback parameter is never NULL. With this in place the type of resume operation is encoded in the async function identifier. There is a concern that async resume could trigger PSU overload. In the enterprise, storage enclosures enforce staggered spin-up regardless of what the kernel does making async scanning safe by default. Outside of that context a user can disable asynchronous scanning via a kernel command line or CONFIG_SCSI_SCAN_ASYNC. Honor that setting when deciding whether to do resume asynchronously. Inspired by Todd's analysis and initial proposal [2]: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi [alan: bug fix and clean up suggestion] Acked-by: Alan Stern Suggested-by: Todd Brandt [djbw: kick all resume work to the async queue] Signed-off-by: Dan Williams --- drivers/scsi/Kconfig |3 + drivers/scsi/scsi.c |3 + drivers/scsi/scsi_pm.c | 128 -- drivers/scsi/scsi_priv.h |2 + drivers/scsi/scsi_scan.c |2 - drivers/scsi/sd.c|1 6 files changed, 109 insertions(+), 30 deletions(-) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index c8bd092fc945..02832d64d918 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -263,6 +263,9 @@ config SCSI_SCAN_ASYNC You can override this choice by specifying "scsi_mod.scan=sync" or async on the kernel's command line. + Note that this setting also affects whether resuming from + system suspend will be performed asynchronously. + menu "SCSI Transports" depends on SCSI diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index d8afec8317cf..990d4a302788 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -91,6 +91,9 @@ EXPORT_SYMBOL(scsi_logging_level); ASYNC_DOMAIN(scsi_sd_probe_domain); EXPORT_SYMBOL(scsi_sd_probe_domain); +ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain); +EXPORT_SYMBOL(scsi_sd_pm_domain); + /* NB: These are exposed through /proc/scsi/scsi and form part of the ABI. * You may not alter any existing entry (although adding new ones is * encouraged once assigned by ANSI/INCITS T10 diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c index 001e9ceda4c3..7454498c4091 100644 --- a/drivers/scsi/scsi_pm.c +++ b/drivers/scsi/scsi_pm.c @@ -18,35 +18,77 @@ #ifdef CONFIG_PM_SLEEP -static int scsi_dev_type_suspend(struct device *dev, int (*cb)(struct device *)) +static int do_scsi_suspend(struct device *dev, const struct dev_pm_ops *pm) { + return pm && pm->suspend ? pm->suspend(dev) : 0; +} + +static int do_scsi_freeze(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->freeze ? pm->freeze(dev) : 0; +} + +static int do_scsi_poweroff(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->poweroff ? pm->poweroff(dev) : 0; +} + +static int do_scsi_resume(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->resume ? pm->resume(dev) : 0; +} + +static int do_scsi_thaw(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->thaw ? pm->thaw(dev) : 0; +} + +static int do_scsi_restore(struct device *dev, const struct dev_pm_ops *pm) +{ + return pm && pm->restore ? pm->restore(dev) : 0; +} + +static int scsi_dev_type_suspend(struct device *dev, + int (*cb)(struct device *, const struct dev_pm_ops *)) +{ + const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; int err; + /* flush pending in-flight resume operations, suspend is synchronous */ + async_synchronize_full_domain(&scsi_sd_pm_domain); + err = scsi_device_quiesce(to_scsi_device(
[PATCH v6 1/3] libata, libsas: kill pm_result and related cleanup
Tejun says: "At least for libata, worrying about suspend/resume failures don't make whole lot of sense. If suspend failed, just proceed with suspend. If the device can't be woken up afterwards, that's that. There isn't anything we could have done differently anyway. The same for resume, if spinup fails, the device is dud and the following commands will invoke EH actions and will eventually fail. Again, there really isn't any *choice* to make. Just making sure the errors are handled gracefully (ie. don't crash) and the following commands are handled correctly should be enough." The only libata user that actually cares about the result from a suspend operation is libsas. However, it only cares about whether queuing a new operation collides with an in-flight one. All libsas does with the error is retry, but we can just let libata wait for the previous operation before continuing. Other cleanups include: 1/ Unifying all ata port pm operations on an ata_port_pm_ prefix 2/ Marking all ata port pm helper routines as returning void, only ata_port_pm_ entry points need to fake a 0 return value. 3/ Killing ata_port_{suspend|resume}_common() in favor of calling ata_port_request_pm() directly 4/ Killing the wrappers that just do a to_ata_port() conversion 5/ Clearly marking the entry points that do async operations with an _async suffix. Reference: http://marc.info/?l=linux-scsi&m=138995409532286&w=2 Cc: Phillip Susi Cc: Alan Stern Suggested-by: Tejun Heo Signed-off-by: Todd Brandt Signed-off-by: Dan Williams --- drivers/ata/libata-core.c | 135 ++--- drivers/ata/libata-eh.c | 13 +--- drivers/scsi/libsas/sas_ata.c | 35 ++- include/linux/libata.h| 11 +-- include/scsi/libsas.h |1 5 files changed, 71 insertions(+), 124 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 1a3dbd1b196e..66110ed2c1c0 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5351,22 +5351,17 @@ bool ata_link_offline(struct ata_link *link) } #ifdef CONFIG_PM -static int ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, - unsigned int action, unsigned int ehi_flags, - int *async) +static void ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, + unsigned int action, unsigned int ehi_flags, + bool async) { struct ata_link *link; unsigned long flags; - int rc = 0; /* Previous resume operation might still be in * progress. Wait for PM_PENDING to clear. */ if (ap->pflags & ATA_PFLAG_PM_PENDING) { - if (async) { - *async = -EAGAIN; - return 0; - } ata_port_wait_eh(ap); WARN_ON(ap->pflags & ATA_PFLAG_PM_PENDING); } @@ -5375,11 +5370,6 @@ static int ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, spin_lock_irqsave(ap->lock, flags); ap->pm_mesg = mesg; - if (async) - ap->pm_result = async; - else - ap->pm_result = &rc; - ap->pflags |= ATA_PFLAG_PM_PENDING; ata_for_each_link(link, ap, HOST_FIRST) { link->eh_info.action |= action; @@ -5390,87 +5380,81 @@ static int ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, spin_unlock_irqrestore(ap->lock, flags); - /* wait and check result */ if (!async) { ata_port_wait_eh(ap); WARN_ON(ap->pflags & ATA_PFLAG_PM_PENDING); } - - return rc; } -static int __ata_port_suspend_common(struct ata_port *ap, pm_message_t mesg, int *async) +/* + * On some hardware, device fails to respond after spun down for suspend. As + * the device won't be used before being resumed, we don't need to touch the + * device. Ask EH to skip the usual stuff and proceed directly to suspend. + * + * http://thread.gmane.org/gmane.linux.ide/46764 + */ +static const unsigned int ata_port_suspend_ehi = ATA_EHI_QUIET +| ATA_EHI_NO_AUTOPSY +| ATA_EHI_NO_RECOVERY; + +static void ata_port_suspend(struct ata_port *ap, pm_message_t mesg) { - /* -* On some hardware, device fails to respond after spun down -* for suspend. As the device won't be used before being -* resumed, we don't need to touch the device. Ask EH to skip -* the usual stuff and proceed directly to suspend. -* -* http://thread.gmane.org/gmane.linux.ide/46764 -*/ - unsigned int ehi_flags = ATA_EHI_QUIET | ATA_EHI_NO_AUTOPSY | -
Re: [PATCH v5 3/3] scsi: async sd resume
On Sat, Mar 15, 2014 at 2:47 PM, James Bottomley wrote: > On Fri, 2014-03-14 at 13:11 -0700, Dan Williams wrote: >> On Mon, Mar 10, 2014 at 11:29 PM, James Bottomley >> wrote: >> >> > In the long game, though this whole debate is moot: setups with hard >> >> > wired start times adhere to them regardless of what the system does, so >> >> > they ignore start unit commands. Systems without hard wired start times >> >> > can usually be started at once, so us introducing a delay is unnecessary >> >> > in either case. >> >> >> >> Ok, then I'll let the patch stand as is. >> > >> > Sounds good. >> > >> > James >> > >> > >> >> Well, one more chirp about this. If the user has disabled async >> scanning by CONFIG_SCSI_SCAN_ASYNC=n or scsi_mod.scan != "async" then >> resume should follow suit. I'll include this in the next rev. > > Hm, OK, if this is tied at the hip to async scanning, why do you need > another async domain for it? Why not just use the current async > scanning domain ... it will actually probably resolve a few nasty (but > wholly manufactured) races where scanning races with suspend. I considered that initially, but it ends up destroying most/all of the benefit of doing it asynchronously. This is due to the fact that scsi_sd_probe_domain is flushed by the async_synchronize_full() performed in dpm_resume(). We want userspace to resume while the disk may still be starting. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/3] scsi: async sd resume
On Sun, 2014-03-16 at 11:21 -0700, James Bottomley wrote: > On Sat, 2014-03-15 at 16:35 -0700, Dan Williams wrote: > > On Sat, Mar 15, 2014 at 2:47 PM, James Bottomley > > wrote: > > > On Fri, 2014-03-14 at 13:11 -0700, Dan Williams wrote: > > >> On Mon, Mar 10, 2014 at 11:29 PM, James Bottomley > > >> wrote: > > >> >> > In the long game, though this whole debate is moot: setups with hard > > >> >> > wired start times adhere to them regardless of what the system > > >> >> > does, so > > >> >> > they ignore start unit commands. Systems without hard wired start > > >> >> > times > > >> >> > can usually be started at once, so us introducing a delay is > > >> >> > unnecessary > > >> >> > in either case. > > >> >> > > >> >> Ok, then I'll let the patch stand as is. > > >> > > > >> > Sounds good. > > >> > > > >> > James > > >> > > > >> > > > >> > > >> Well, one more chirp about this. If the user has disabled async > > >> scanning by CONFIG_SCSI_SCAN_ASYNC=n or scsi_mod.scan != "async" then > > >> resume should follow suit. I'll include this in the next rev. > > > > > > Hm, OK, if this is tied at the hip to async scanning, why do you need > > > another async domain for it? Why not just use the current async > > > scanning domain ... it will actually probably resolve a few nasty (but > > > wholly manufactured) races where scanning races with suspend. > > > > I considered that initially, but it ends up destroying most/all of the > > benefit of doing it asynchronously. This is due to the fact that > > scsi_sd_probe_domain is flushed by the async_synchronize_full() > > performed in dpm_resume(). We want userspace to resume while the disk > > may still be starting. > > OK, finally got it, the new domain doesn't participate in > async_synchronize_full() but scsi_sd_probe_domain does (and has to > because of the device and module operations). That might actually be > worth a comment somewhere. Fair enough, see the comment added to the declaration of scsi_sd_pm_domain below. BTW, these patches are independent. Patch 1 and 2 can go through Tejun/libata and this one can go through you/scsi by itself. 8<-- Subject: scsi: async sd resume From: Dan Williams async_schedule() sd resume work to allow disks and other devices to resume in parallel. This moves the entirety of scsi_device resume to an async context to ensure that scsi_device_resume() remains ordered with respect to the completion of the start/stop command. For the duration of the resume, new command submissions (that do not originate from the scsi-core) will be deferred (BLKPREP_DEFER). It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container of these operations. Like scsi_sd_probe_domain it is flushed at sd_remove() time to ensure async ops do not continue past the end-of-life of the sdev. The implementation explicitly refrains from reusing scsi_sd_probe_domain directly for this purpose as it is flushed at the end of dpm_resume(), potentially defeating some of the benefit. Given sdevs are quiesced it is permissible for these resume operations to bleed past the async_synchronize_full() calls made by the driver core. We defer the resolution of which pm callback to call until scsi_dev_type_{suspend|resume} time and guarantee that the callback parameter is never NULL. With this in place the type of resume operation is encoded in the async function identifier. There is a concern that async resume could trigger PSU overload. In the enterprise, storage enclosures enforce staggered spin-up regardless of what the kernel does making async scanning safe by default. Outside of that context a user can disable asynchronous scanning via a kernel command line or CONFIG_SCSI_SCAN_ASYNC. Honor that setting when deciding whether to do resume asynchronously. Inspired by Todd's analysis and initial proposal [2]: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi [alan: bug fix and clean up suggestion] Acked-by: Alan Stern Suggested-by: Todd Brandt [djbw: kick all resume work to the async queue] Signed-off-by: Dan Williams --- drivers/scsi/Kconfig |3 + drivers/scsi/scsi.c |9 +++ drivers/scsi/scsi_pm.c | 128 -- drivers/scsi/scsi_priv.h |2 + drivers/scsi/scsi_scan.c |2 - drivers/scsi/sd.c|1 6 files changed, 115 insertions(+), 30 deletions(-) diff --git
Re: [PATCH v6 1/3] libata, libsas: kill pm_result and related cleanup
On Tue, Mar 18, 2014 at 9:03 AM, Tejun Heo wrote: > On Fri, Mar 14, 2014 at 01:52:48PM -0700, Dan Williams wrote: >> Tejun says: >> "At least for libata, worrying about suspend/resume failures don't make >>whole lot of sense. If suspend failed, just proceed with suspend. If >>the device can't be woken up afterwards, that's that. There isn't >>anything we could have done differently anyway. The same for resume, if >>spinup fails, the device is dud and the following commands will invoke >>EH actions and will eventually fail. Again, there really isn't any >>*choice* to make. Just making sure the errors are handled gracefully >>(ie. don't crash) and the following commands are handled correctly >>should be enough." >> >> The only libata user that actually cares about the result from a suspend >> operation is libsas. However, it only cares about whether queuing a new >> operation collides with an in-flight one. All libsas does with the >> error is retry, but we can just let libata wait for the previous >> operation before continuing. >> >> Other cleanups include: >> 1/ Unifying all ata port pm operations on an ata_port_pm_ prefix >> 2/ Marking all ata port pm helper routines as returning void, only >>ata_port_pm_ entry points need to fake a 0 return value. >> 3/ Killing ata_port_{suspend|resume}_common() in favor of calling >>ata_port_request_pm() directly >> 4/ Killing the wrappers that just do a to_ata_port() conversion >> 5/ Clearly marking the entry points that do async operations with an >> _async suffix. >> >> Reference: http://marc.info/?l=linux-scsi&m=138995409532286&w=2 >> >> Cc: Phillip Susi >> Cc: Alan Stern >> Suggested-by: Tejun Heo >> Signed-off-by: Todd Brandt >> Signed-off-by: Dan Williams > > Can somebody ack the sas part? > Hmm I'd just as soon self-ack at this point. I'm cleaning up code I wrote a couple years ago, and libsas bugs at this point are self-inflicted. Top-5 in drivers/scsi/libsas by changeset in the past 5 years: 75 Dan Williams 12 James Bottomley 9 Linus Torvalds 7 Tejun Heo 4 Sergei Shtylyov -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next] libata.h: add stub for ata_sas_port_resume
On Wed, 2014-03-19 at 13:58 -0400, Tejun Heo wrote: > On Wed, Mar 19, 2014 at 10:46:25AM -0700, Randy Dunlap wrote: > > From: Randy Dunlap > > > > Fix build error when CONFIG_PM is not enabled by adding a stub > > function in . > > > > drivers/scsi/libsas/sas_ata.c: In function 'sas_resume_sata': > > drivers/scsi/libsas/sas_ata.c:756:3: error: implicit declaration of > > function 'ata_sas_port_resume' [-Werror=implicit-function-declaration] > > > > Signed-off-by: Randy Dunlap > > Reported-by: Jim Davis > > --- > > include/linux/libata.h |3 +++ > > 1 file changed, 3 insertions(+) > > > > --- linux-next-20140319.orig/include/linux/libata.h > > +++ linux-next-20140319/include/linux/libata.h > > @@ -1148,6 +1148,9 @@ static inline void ata_sas_port_suspend( > > static inline void ata_sas_port_async_resume(struct ata_port *ap) > > { > > } > > +static inline void ata_sas_port_resume(struct ata_port *ap) > > +{ > > +} > > #endif > > extern int ata_ratelimit(void); > > extern void ata_msleep(struct ata_port *ap, unsigned int msecs); > > Applied to libata/for-3.15. > > Thanks. Agh, my mistake... one more incremental fixlet: 8<-- Subject: libata: remove unused ata_sas_port_async_resume() stub From: Dan Williams Commit bc6e7c4b0d1a "libata, libsas: kill pm_result and related cleanup" renamed ata_sas_port_async_resume() to ata_sas_port_resume(), but missed a CONFIG_PM=n stub conversion. Randy fixed that up in commit a5a6569959fc "libata.h: add stub for ata_sas_port_resume", but missed the deletion of the now unused ata_sas_port_async_resume() routine. Cc: Randy Dunlap Signed-off-by: Dan Williams --- include/linux/libata.h |3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/include/linux/libata.h b/include/linux/libata.h index 52723789b991..1de36be64df4 100644 --- a/include/linux/libata.h +++ b/include/linux/libata.h @@ -1145,9 +1145,6 @@ extern void ata_sas_port_resume(struct ata_port *ap); static inline void ata_sas_port_suspend(struct ata_port *ap) { } -static inline void ata_sas_port_async_resume(struct ata_port *ap) -{ -} static inline void ata_sas_port_resume(struct ata_port *ap) { } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deadlock in usb-storage error handling
On Wed, Mar 19, 2014 at 1:31 PM, Alan Stern wrote: > On Wed, 19 Mar 2014, Andreas Reis wrote: > >> I've uploaded a dmesg with the new debugging patch to bugzilla: >> https://bugzilla.kernel.org/attachment.cgi?id=130041 > > Thanks. I have now managed to reproduce many of the features of this > problem on my own computer. > > James, I will need your help (or help from somebody who understands the > SCSI error handler) to figure out how this problem should be fixed. > > Basically, usb-storage deadlocks when the SCSI error handler invokes > the eh_device_reset_handler callback while a command is running. The > command has timed out and will never complete normally, because the > device's firmware has crashed. But usb-storage's device-reset routine > waits for the current command to finish, which brings everything to a > standstill. > > Is this design wrong? That is, should the device-reset routine wait > for currently executing commands to finish, or should it abort them, or > what? > > Or should the SCSI error handler abort the running command before > invoking the eh_device_reset_handler callback? > > For the record, and in case anyone is curious, here's the detailed > sequence of events during my test: > > sd issues a READ(10) command. For whatever reason, the device > goes nuts and the command times out. > > scsi_times_out() calls scsi_abort_command(), which queues an > abort request. > > scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which > succeeds in aborting the READ. > > The READ command is retried (I didn't trace through the details > of this). The retry fails with a Unit Attention (SK=6, > ASC=0x29, Reset or Bus Device Reset Occurred). > > The READ command is retried a second time, and it times out > again. > > This time around, scsi_times_out() calls scsi_abort_command() > unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is > still set). > > As a result, scsi_error_handler() calls scsi_unjam_host(), > which calls scsi_eh_get_sense(). > > That routine calls scsi_request_sense(), which goes into > scsi_send_eh_cmnd(). > > The calls to shost->hostt->queuecommand() all fail, because the > READ command is still running and usb-storage has a queue > depth of 1. The error messages produced by these failures are > disconcerting but not dangerous. > > Since the REQUEST SENSE command was never issued, > scsi_eh_get_sense() returns 0. > > scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which > does essentially nothing because the SCSI_EH_CANCEL_CMD flag > for the only command on work_q is clear. > scsi_eh_test_devices() returns 0 because check_list is empty > and work_q isn't. > > scsi_unjam_host() then calls scsi_eh_ready_devs(). This > routine ends up calling scsi_eh_bus_device_reset(), at which > point usb-storage deadlocks as described above. > > (On Andreas's system, the first READ retry times out as opposed to the > second retry as on my computer. I doubt this makes any difference.) > > I can't tell if this is all working as intended or if it went off the > tracks somewhere. > > Thanks for any guidance. > It seems to me we need an ->eh_strategy_handler() that understands the usb transport and will escalate to device reset around the time scsi_abort_command() fails. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/3] scsi: async sd resume
On Mon, 2014-03-17 at 11:20 -0700, Dan Williams wrote: > On Sun, 2014-03-16 at 11:21 -0700, James Bottomley wrote: > > On Sat, 2014-03-15 at 16:35 -0700, Dan Williams wrote: > > > On Sat, Mar 15, 2014 at 2:47 PM, James Bottomley > > > > Hm, OK, if this is tied at the hip to async scanning, why do you need > > > > another async domain for it? Why not just use the current async > > > > scanning domain ... it will actually probably resolve a few nasty (but > > > > wholly manufactured) races where scanning races with suspend. > > > > > > I considered that initially, but it ends up destroying most/all of the > > > benefit of doing it asynchronously. This is due to the fact that > > > scsi_sd_probe_domain is flushed by the async_synchronize_full() > > > performed in dpm_resume(). We want userspace to resume while the disk > > > may still be starting. > > > > OK, finally got it, the new domain doesn't participate in > > async_synchronize_full() but scsi_sd_probe_domain does (and has to > > because of the device and module operations). That might actually be > > worth a comment somewhere. > > Fair enough, see the comment added to the declaration of > scsi_sd_pm_domain below. > > BTW, these patches are independent. Patch 1 and 2 can go through > Tejun/libata and this one can go through you/scsi by itself. > ...and Tejun has now applied the ata bits for 3.15. James, when you get a chance, please apply this latest version of patch 3 or otherwise let me know your disposition. Regards, Dan (no changes since the version I sent on the 17th) 8<-- scsi: async sd resume From: Dan Williams async_schedule() sd resume work to allow disks and other devices to resume in parallel. This moves the entirety of scsi_device resume to an async context to ensure that scsi_device_resume() remains ordered with respect to the completion of the start/stop command. For the duration of the resume, new command submissions (that do not originate from the scsi-core) will be deferred (BLKPREP_DEFER). It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container of these operations. Like scsi_sd_probe_domain it is flushed at sd_remove() time to ensure async ops do not continue past the end-of-life of the sdev. The implementation explicitly refrains from reusing scsi_sd_probe_domain directly for this purpose as it is flushed at the end of dpm_resume(), potentially defeating some of the benefit. Given sdevs are quiesced it is permissible for these resume operations to bleed past the async_synchronize_full() calls made by the driver core. We defer the resolution of which pm callback to call until scsi_dev_type_{suspend|resume} time and guarantee that the callback parameter is never NULL. With this in place the type of resume operation is encoded in the async function identifier. There is a concern that async resume could trigger PSU overload. In the enterprise, storage enclosures enforce staggered spin-up regardless of what the kernel does making async scanning safe by default. Outside of that context a user can disable asynchronous scanning via a kernel command line or CONFIG_SCSI_SCAN_ASYNC. Honor that setting when deciding whether to do resume asynchronously. Inspired by Todd's analysis and initial proposal [2]: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi [alan: bug fix and clean up suggestion] Acked-by: Alan Stern Suggested-by: Todd Brandt [djbw: kick all resume work to the async queue] Signed-off-by: Dan Williams --- drivers/scsi/Kconfig |3 + drivers/scsi/scsi.c |9 +++ drivers/scsi/scsi_pm.c | 128 -- drivers/scsi/scsi_priv.h |2 + drivers/scsi/scsi_scan.c |2 - drivers/scsi/sd.c|1 6 files changed, 115 insertions(+), 30 deletions(-) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index c8bd092fc945..02832d64d918 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -263,6 +263,9 @@ config SCSI_SCAN_ASYNC You can override this choice by specifying "scsi_mod.scan=sync" or async on the kernel's command line. + Note that this setting also affects whether resuming from + system suspend will be performed asynchronously. + menu "SCSI Transports" depends on SCSI diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index d8afec8317cf..1b345bf41a91 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -91,6 +91,15 @@ EXPORT_SYMBOL(scsi_logging_level); ASYNC_DOMAIN(scsi_sd_probe_domain); EXPORT_SYMBOL(scsi_sd_probe_domain); +/* + * Separate domain (from scsi_sd_probe_domain) to maximize the benefit of + * asynchronous system resume operations. It i
[GIT PULL] async scsi resume for 3.15
Hi Linus, James might still be in the process of sending this your way. However, given the proximity to -rc1, my reasoning for sending this directly is: 1/ It provides a tangible speed up for a non-esoteric use case (laptop resume): https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach 2/ You already pulled the first half of this enabling from Tejun. Quoting Tejun's ATA pull request: "Dan finishes the patchset to make libata PM operations asynchronous. Combined with one patch being routed through scsi, this should speed resume measurably." 3/ As far as I can tell it is acceptable to James: http://marc.info/?l=linux-scsi&m=139499409510791&w=2 http://marc.info/?l=linux-scsi&m=139508044602605&w=2 http://marc.info/?l=linux-scsi&m=139536062515216&w=2 4/ I promised Todd I would get it upstream before he returns from vacation. Please pull, thank you. -- Dan The following changes since commit 455c6fdbd219161bd09b1165f11699d6d73de11c: Linux 3.14 (2014-03-30 20:40:15 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/isci async-scsi-resume for you to fetch changes up to 3c31b52f96f7b559d950b16113c0f68c72a1985e: scsi: async sd resume (2014-04-10 15:30:35 -0700) -------- Dan Williams (1): scsi: async sd resume drivers/scsi/Kconfig | 3 ++ drivers/scsi/scsi.c | 9 drivers/scsi/scsi_pm.c | 128 --- drivers/scsi/scsi_priv.h | 2 + drivers/scsi/scsi_scan.c | 2 +- drivers/scsi/sd.c| 1 + 6 files changed, 115 insertions(+), 30 deletions(-) 8<- >From 3c31b52f96f7b559d950b16113c0f68c72a1985e Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Thu, 10 Apr 2014 15:30:35 -0700 Subject: [PATCH] scsi: async sd resume async_schedule() sd resume work to allow disks and other devices to resume in parallel. This moves the entirety of scsi_device resume to an async context to ensure that scsi_device_resume() remains ordered with respect to the completion of the start/stop command. For the duration of the resume, new command submissions (that do not originate from the scsi-core) will be deferred (BLKPREP_DEFER). It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container of these operations. Like scsi_sd_probe_domain it is flushed at sd_remove() time to ensure async ops do not continue past the end-of-life of the sdev. The implementation explicitly refrains from reusing scsi_sd_probe_domain directly for this purpose as it is flushed at the end of dpm_resume(), potentially defeating some of the benefit. Given sdevs are quiesced it is permissible for these resume operations to bleed past the async_synchronize_full() calls made by the driver core. We defer the resolution of which pm callback to call until scsi_dev_type_{suspend|resume} time and guarantee that the callback parameter is never NULL. With this in place the type of resume operation is encoded in the async function identifier. There is a concern that async resume could trigger PSU overload. In the enterprise, storage enclosures enforce staggered spin-up regardless of what the kernel does making async scanning safe by default. Outside of that context a user can disable asynchronous scanning via a kernel command line or CONFIG_SCSI_SCAN_ASYNC. Honor that setting when deciding whether to do resume asynchronously. Inspired by Todd's analysis and initial proposal [2]: https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach Cc: Len Brown Cc: Phillip Susi [alan: bug fix and clean up suggestion] Acked-by: Alan Stern Suggested-by: Todd Brandt [djbw: kick all resume work to the async queue] Signed-off-by: Dan Williams --- drivers/scsi/Kconfig | 3 ++ drivers/scsi/scsi.c | 9 drivers/scsi/scsi_pm.c | 128 --- drivers/scsi/scsi_priv.h | 2 + drivers/scsi/scsi_scan.c | 2 +- drivers/scsi/sd.c| 1 + 6 files changed, 115 insertions(+), 30 deletions(-) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index c8bd092..02832d6 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -263,6 +263,9 @@ config SCSI_SCAN_ASYNC You can override this choice by specifying "scsi_mod.scan=sync" or async on the kernel's command line. + Note that this setting also affects whether resuming from + system suspend will be performed asynchronously. + menu "SCSI Transports" depends on SCSI diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index d8afec8..1b345bf 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -91,6 +91,15 @@ EXPORT_SYMBOL(scsi_logging_level); ASYNC_DOMAIN(scsi_sd_probe_domain); EXPORT_SYMBOL(scsi_sd_probe_dom
Re: [GIT PULL] async scsi resume for 3.15
On Fri, Apr 11, 2014 at 11:20 AM, James Bottomley wrote: > On Thu, 2014-04-10 at 18:24 -0700, Dan Williams wrote: >> Hi Linus, >> >> James might still be in the process of sending this your way. However, >> given the proximity to -rc1, my reasoning for sending this directly is: >> >> 1/ It provides a tangible speed up for a non-esoteric use case (laptop >> resume): >> >> >> https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach >> >> 2/ You already pulled the first half of this enabling from Tejun. >> Quoting Tejun's ATA pull request: >> >> "Dan finishes the patchset to make libata PM operations >>asynchronous. Combined with one patch being routed through scsi, >>this should speed resume measurably." >> >> 3/ As far as I can tell it is acceptable to James: >> >> http://marc.info/?l=linux-scsi&m=139499409510791&w=2 >> http://marc.info/?l=linux-scsi&m=139508044602605&w=2 >> http://marc.info/?l=linux-scsi&m=139536062515216&w=2 >> >> 4/ I promised Todd I would get it upstream before he returns from >> vacation. >> >> Please pull, thank you. > > OK, so this missed the SCSI cutoff window because the series wasn't > finalised until 17 March and, originally, I thought Linus would cut us > off at -rc7 which was 16 March. In reality we got an extra week, but I > spent that ironing out a few late arriving bugs in the tree itself > rather than integrating features. > > For drivers, I'm willing to buck the release protocols a bit because > they're self contained, but for core features, they are supposed to be > all finalised by -rc5. Plus Linus bites me every time I do it even for > drivers, so I was actually letting the scars heal a bit. > > I also don't see this in linux-next, unless I'm not looking in the right > place, so it would be a bit of a risk adding it just before -rc1. > [forgive the resend... google apps converted the last one to html] Correct, I did not push to -next as not to collide with an appearance in scsi-misc. However, it is on korg, so the 0-day kbuild robot has seen it. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please help me to review the patch about supporting SATA PM in LIBSAS
On Wed, Apr 16, 2014 at 8:07 PM, xiangliang yu wrote: > hi, > > The patch is support SATA PM device and can find all devices that is > attached to PM. > Until now, i have tested the identified, hot-plug and IO function and > result is ok except one mvsas timeout issue. > i'll continue to debug mvsas issue, but i don't know whether the > libsas code of the patch is ok. > So, please help me to review the patch if you feel free. thanks! > the patch is as below: Can you help me with a bit more description on the design of the patch? Details like why you decided to implement pieces the way you did, and comments on the difficulties / trade-offs you ran into would really be helpful for reviewing. Not just for me, but for anyone else on the mailing list who might be able to offer some review cycles if they had a small amount of background on the libsas/libata details. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please help me to review the patch about supporting SATA PM in LIBSAS
On Thu, Apr 17, 2014 at 9:03 AM, Dan Williams wrote: > On Wed, Apr 16, 2014 at 8:07 PM, xiangliang yu wrote: >> hi, >> >> The patch is support SATA PM device and can find all devices that is >> attached to PM. >> Until now, i have tested the identified, hot-plug and IO function and >> result is ok except one mvsas timeout issue. >> i'll continue to debug mvsas issue, but i don't know whether the >> libsas code of the patch is ok. >> So, please help me to review the patch if you feel free. thanks! >> the patch is as below: > > Can you help me with a bit more description on the design of the > patch? Details like why you decided to implement pieces the way you > did, and comments on the difficulties / trade-offs you ran into would > really be helpful for reviewing. Not just for me, but for anyone else > on the mailing list who might be able to offer some review cycles if > they had a small amount of background on the libsas/libata details. > ...also, please resend as an attachment so that the whitespace remains in tact. Thanks. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.2.57 regression: isci driver broken: Unable to reset I T nexus?
[ adding Ben ] On Mon, Apr 28, 2014 at 10:22 AM, Ondrej Zary wrote: > On Monday 28 April 2014 18:51:44 Jiang, Dave wrote: >> On Mon, 2014-04-28 at 16:28 +, Ondrej Zary wrote: >> > On Monday 28 April 2014 17:50:29 Jiang, Dave wrote: >> > > On Mon, 2014-04-28 at 13:03 +0200, Ondrej Zary wrote: >> > > > Hello, >> > > > just upgraded a server running 3.2.54-2 to 3.2.57-3 (Debian Wheezy) >> > > > and it does not boot anymore because of isci driver breakage. >> > > >> > > I would not run anything less than 3.8 for the isci controller. 3.2 is >> > > VERY old for that particular driver and likely very unstable. The >> > > product version of that driver plus libsas started with 3.8. Also I'm >> > > concerned that you aren't using the platform OEM parameters. You need >> > > to turn your OROM or EFI driver on for the SAS controller. >> > >> > It's a Cisco UCS C22 M3 server with a crappy LSI fakeraid that cannot >> > even be disabled. It was a pain to make it boot properly - had to use >> > dmraid. But it has been working fine since then (2012). Until now. >> >> Yes but just because it has been working doesn't mean it is a good idea >> to run unstable code You need the driver updates and the libsas >> updates for it to function properly. Does this fail on 3.14? If it is >> that patch I have a feeling it may be interacting badly with whatever is >> was in 3.2 libsas that may not be a problem with latest kernels It >> is odd to see all those hard resets however Did you have them when >> it was working for you? > > Didn't know that it was unstable - it worked with no problems, better than > some products marked as stable :) > 3.13 works fine - I've installed it from wheezy-backports to work-around the > bug. > > The log from working 3.2.54 is below (at the end) - there's one reset for each > port. > I think the right answer for 3.2 is to drop commit 584ec1226519 "isci: fix reset timeout handling". libsas and its libata interaction went through significant overhaul after 3.2 so it's not surprising that a change to reset handling regresses like this. Ideally there would be a backport of latest libsas available for 3.2, but no one to my knowledge is working on that. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] libsas: add support for port multiplier
On Fri, May 2, 2014 at 8:36 AM, Tejun Heo wrote: > On Thu, Apr 24, 2014 at 09:27:03PM +0800, Xiangliang Yu wrote: >> This patch set will support SATA port multiplier(PM) in LIBSAS. >> >> LIBSAS is need to implement several key handling to support SATA PM: >> First,low level driver notify libsas that SATA PM is attached to HBA port. >> Then, LIBSAS will need to schedule SATA PMP error handler to scan SATA device >> that is attached to PM, so we must implemet port softreset and PMP reset >> handling. >> If find SATA devices, we should add the devices into SAS transport layer. >> In order to find SAS domain device we must implemet device type >> transformation >> : scsi device -> (port number) -> ata device -> domain device. Then, we can >> execute IO operation. >> >> In these patches, patch 1 is change SATA error handler. >> And patch 2 will add SATA device into SAS layer and implement device type >> transformation. Patch 3 implement SATA functional interface in mvsas. > > Please properly thread the patchset from the next posting. > > Dan, can you please take a look at this? > Yeah, it's on my plate to look at... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] scsi_transport_sas: move bsg destructor into sas_rphy_remove
On Fri, May 2, 2014 at 12:01 PM, Joe Lawrence wrote: > On Wed, 2 Apr 2014 16:48:52 -0400 > Joe Lawrence wrote: > >> On Wed, 2 Apr 2014 16:40:41 -0400 >> Joe Lawrence wrote: >> >> > Since bsg creation is a side effect of sas_rphy_add, move its >> > complementary removal call into sas_rphy_remove. >> >> Hello James, >> >> This a resend of a patch posted to quiet an end-device sysfs warning in >> its device deletion path when removing the mpt2sas driver: >> >> (initial report) >> sysfs group not found for kobject on mpt2sas unload >> http://marc.info/?t=13849746004 >> >> (patch + ACK + comments) >> http://marc.info/?t=13860945551 >> >> (gentoo, LSI repro) >> mpt2sas driver barfs when force removing a drive on 3.13.1 >> http://marc.info/?t=13912235131 >> >> Dan Williams had a few other suggestions for cleanup in this area, but >> those could be handled in a subsequent patch. This one missed 3.14 >> inclusion and the warning still occurs in Linus's tree as of today >> (post scsi/block merge) hence the repost. > > Ping? Just reproduced on 3.15-rc3. Also found this: > > (~500 repro count on Fedora Problem Tracker) > http://retrace.fedoraproject.org/faf/problems/1623421/ > Hmm, looks like the re-send is missing my ack. Feel free to add: Acked-by: Dan Williams James, do you want me to shepherd sas infrastructure bits (libsas/scsi_transport_sas) along in my own tree, or...? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] libsas: modify SATA error handler
On Tue, Jun 3, 2014 at 12:13 AM, Xiangliang Yu wrote: > Hi, > > How about the patch? > I'll take a look today. Sorry for the latency. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html