Re: [PATCH] virtio_scsi: don't select CONFIG_BLK_DEV_INTEGRITY
On 23/04/2015 20:10, Christoph Hellwig wrote: > T10 PI is just another optional feature, LLDDs should work without > the infrastructure. > > Signed-off-by: Christoph Hellwig > --- > drivers/scsi/Kconfig | 1 - > drivers/scsi/virtio_scsi.c | 11 ++- > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig > index b021bcb..896bea6 100644 > --- a/drivers/scsi/Kconfig > +++ b/drivers/scsi/Kconfig > @@ -1743,7 +1743,6 @@ config SCSI_BFA_FC > config SCSI_VIRTIO > tristate "virtio-scsi support" > depends on VIRTIO > - select BLK_DEV_INTEGRITY > help >This is the virtual HBA driver for virtio. If the kernel will >be used in a virtual machine, say Y or M. > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index f164f24..0f2bb5b 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -501,6 +501,7 @@ static void virtio_scsi_init_hdr(struct virtio_device > *vdev, > cmd->crn = 0; > } > > +#ifdef CONFIG_BLK_DEV_INTEGRITY > static void virtio_scsi_init_hdr_pi(struct virtio_device *vdev, > struct virtio_scsi_cmd_req_pi *cmd_pi, > struct scsi_cmnd *sc) > @@ -524,6 +525,7 @@ static void virtio_scsi_init_hdr_pi(struct virtio_device > *vdev, > blk_rq_sectors(rq) * > bi->tuple_size); > } > +#endif > > static int virtscsi_queuecommand(struct virtio_scsi *vscsi, >struct virtio_scsi_vq *req_vq, > @@ -546,11 +548,14 @@ static int virtscsi_queuecommand(struct virtio_scsi > *vscsi, > > BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE); > > +#ifdef CONFIG_BLK_DEV_INTEGRITY > if (virtio_has_feature(vscsi->vdev, VIRTIO_SCSI_F_T10_PI)) { > virtio_scsi_init_hdr_pi(vscsi->vdev, &cmd->req.cmd_pi, sc); > memcpy(cmd->req.cmd_pi.cdb, sc->cmnd, sc->cmd_len); > req_size = sizeof(cmd->req.cmd_pi); > - } else { > + } else > +#endif > + { All good so far. > virtio_scsi_init_hdr(vscsi->vdev, &cmd->req.cmd, sc); > memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len); > req_size = sizeof(cmd->req.cmd); > @@ -1002,6 +1007,7 @@ static int virtscsi_probe(struct virtio_device *vdev) > shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE; > shost->nr_hw_queues = num_queues; > > +#ifdef VIRTIO_SCSI_F_T10_PI This symbol is always defined; it is part of the uapi/ header. I think you wanted #ifdef CONFIG_BLK_DEV_INTEGRITY here as well. However, you can remove this #ifdef completely. This is because... > if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) { ... this condition should always be false unless the feature is negotiated, and the feature should not be negotiated when CONFIG_BLK_DEV_INTEGRITY is disabled. Maybe you can add a "WARN_ON(!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY))" depending on your taste. > host_prot = SHOST_DIF_TYPE1_PROTECTION | > SHOST_DIF_TYPE2_PROTECTION | > SHOST_DIF_TYPE3_PROTECTION | > SHOST_DIX_TYPE1_PROTECTION | Also, if you leave the #ifdef, host_prot is now unused if CONFIG_BLK_DEV_INTEGRITY is disabled. If you prefer to keep the #ifdef you should also change host_prot to a macro (e.g. VIRTIO_SCSI_HOST_PROT) to silence the compiler warning. > @@ -1010,6 +1016,7 @@ static int virtscsi_probe(struct virtio_device *vdev) > scsi_host_set_prot(shost, host_prot); > scsi_host_set_guard(shost, SHOST_DIX_GUARD_CRC); > } > +#endif > > err = scsi_add_host(shost, &vdev->dev); > if (err) > @@ -1090,7 +1097,9 @@ static struct virtio_device_id id_table[] = { > static unsigned int features[] = { > VIRTIO_SCSI_F_HOTPLUG, > VIRTIO_SCSI_F_CHANGE, > +#ifdef VIRTIO_SCSI_F_T10_PI This one definitely should be #ifdef CONFIG_BLK_DEV_INTEGRITY. Thanks, Paolo > VIRTIO_SCSI_F_T10_PI, > +#endif > }; > > static struct virtio_driver virtio_scsi_driver = { > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] virtio_scsi: don't select CONFIG_BLK_DEV_INTEGRITY
On 27/04/2015 14:56, Christoph Hellwig wrote: > T10 PI is just another optional feature, LLDDs should work without > the infrastructure. > > Signed-off-by: Christoph Hellwig > --- > drivers/scsi/Kconfig | 1 - > drivers/scsi/virtio_scsi.c | 11 ++- > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig > index b021bcb..896bea6 100644 > --- a/drivers/scsi/Kconfig > +++ b/drivers/scsi/Kconfig > @@ -1743,7 +1743,6 @@ config SCSI_BFA_FC > config SCSI_VIRTIO > tristate "virtio-scsi support" > depends on VIRTIO > - select BLK_DEV_INTEGRITY > help >This is the virtual HBA driver for virtio. If the kernel will >be used in a virtual machine, say Y or M. > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index f164f24..285f775 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -501,6 +501,7 @@ static void virtio_scsi_init_hdr(struct virtio_device > *vdev, > cmd->crn = 0; > } > > +#ifdef CONFIG_BLK_DEV_INTEGRITY > static void virtio_scsi_init_hdr_pi(struct virtio_device *vdev, > struct virtio_scsi_cmd_req_pi *cmd_pi, > struct scsi_cmnd *sc) > @@ -524,6 +525,7 @@ static void virtio_scsi_init_hdr_pi(struct virtio_device > *vdev, > blk_rq_sectors(rq) * > bi->tuple_size); > } > +#endif > > static int virtscsi_queuecommand(struct virtio_scsi *vscsi, >struct virtio_scsi_vq *req_vq, > @@ -546,11 +548,14 @@ static int virtscsi_queuecommand(struct virtio_scsi > *vscsi, > > BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE); > > +#ifdef CONFIG_BLK_DEV_INTEGRITY > if (virtio_has_feature(vscsi->vdev, VIRTIO_SCSI_F_T10_PI)) { > virtio_scsi_init_hdr_pi(vscsi->vdev, &cmd->req.cmd_pi, sc); > memcpy(cmd->req.cmd_pi.cdb, sc->cmnd, sc->cmd_len); > req_size = sizeof(cmd->req.cmd_pi); > - } else { > + } else > +#endif > + { > virtio_scsi_init_hdr(vscsi->vdev, &cmd->req.cmd, sc); > memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len); > req_size = sizeof(cmd->req.cmd); > @@ -1002,6 +1007,7 @@ static int virtscsi_probe(struct virtio_device *vdev) > shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE; > shost->nr_hw_queues = num_queues; > > +#ifdef CONFIG_BLK_DEV_INTEGRITY > if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) { > host_prot = SHOST_DIF_TYPE1_PROTECTION | > SHOST_DIF_TYPE2_PROTECTION | > SHOST_DIF_TYPE3_PROTECTION | > SHOST_DIX_TYPE1_PROTECTION | > @@ -1010,6 +1016,7 @@ static int virtscsi_probe(struct virtio_device *vdev) > scsi_host_set_prot(shost, host_prot); > scsi_host_set_guard(shost, SHOST_DIX_GUARD_CRC); > } > +#endif > > err = scsi_add_host(shost, &vdev->dev); > if (err) > @@ -1090,7 +1097,9 @@ static struct virtio_device_id id_table[] = { > static unsigned int features[] = { > VIRTIO_SCSI_F_HOTPLUG, > VIRTIO_SCSI_F_CHANGE, > +#ifdef CONFIG_BLK_DEV_INTEGRITY > VIRTIO_SCSI_F_T10_PI, > +#endif > }; > > static struct virtio_driver virtio_scsi_driver = { > Reviewed-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] mm: remove write/force parameters from __get_user_pages_unlocked()
_t rc = 0; > unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES > / sizeof(struct pages *); > + unsigned int flags = FOLL_REMOTE; > > /* Work out address and page range required */ > if (len == 0) > return 0; > nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1; > > + if (vm_write) > + flags |= FOLL_WRITE; > + > while (!rc && nr_pages && iov_iter_count(iter)) { > int pages = min(nr_pages, max_pages_per_loop); > size_t bytes; > @@ -104,8 +108,7 @@ static int process_vm_rw_single_vec(unsigned long addr, >* current/current->mm >*/ > pages = __get_user_pages_unlocked(task, mm, pa, pages, > - vm_write, 0, process_pages, > - FOLL_REMOTE); > + process_pages, flags); > if (pages <= 0) > return -EFAULT; > > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c > index db96688..8035cc1 100644 > --- a/virt/kvm/async_pf.c > +++ b/virt/kvm/async_pf.c > @@ -84,7 +84,8 @@ static void async_pf_execute(struct work_struct *work) >* mm and might be done in another context, so we must >* use FOLL_REMOTE. >*/ > - __get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL, FOLL_REMOTE); > + __get_user_pages_unlocked(NULL, mm, addr, 1, NULL, > + FOLL_WRITE | FOLL_REMOTE); > > kvm_async_page_present_sync(vcpu, apf); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 81dfc73..28510e7 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1416,10 +1416,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool > *async, bool write_fault, > down_read(¤t->mm->mmap_sem); > npages = get_user_page_nowait(addr, write_fault, page); > up_read(¤t->mm->mmap_sem); > - } else > + } else { > + unsigned int flags = FOLL_TOUCH | FOLL_HWPOISON; > + > + if (write_fault) > + flags |= FOLL_WRITE; > + > npages = __get_user_pages_unlocked(current, current->mm, addr, > 1, > -write_fault, 0, page, > -FOLL_TOUCH|FOLL_HWPOISON); > +page, flags); > + } > if (npages != 1) > return npages; > > Acked-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
On 08/12/2016 00:31, Tyrel Datwyler wrote: > The first byte of each CRQ entry is used to indicate whether an entry is > a valid response or free for the VIOS to use. After processing a > response the driver sets the valid byte to zero to indicate the entry is > now free to be reused. Add a memory barrier after this write to ensure > no other stores are reordered when updating the valid byte. > > Signed-off-by: Tyrel Datwyler > --- > drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c > b/drivers/scsi/ibmvscsi/ibmvscsi.c > index d9534ee..2f5b07e 100644 > --- a/drivers/scsi/ibmvscsi/ibmvscsi.c > +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c > @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data) > while ((crq = crq_queue_next_crq(&hostdata->queue)) != NULL) { > ibmvscsi_handle_crq(crq, hostdata); > crq->valid = VIOSRP_CRQ_FREE; > + wmb(); > } > > vio_enable_interrupts(vdev); > @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data) > vio_disable_interrupts(vdev); > ibmvscsi_handle_crq(crq, hostdata); > crq->valid = VIOSRP_CRQ_FREE; > + wmb(); Should this driver use virt_wmb instead? Paolo > } else { > done = 1; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_scsi: Implement fc_host
On 16/01/2017 17:04, Fam Zheng wrote: > + node_name = virtio_cread64(vdev, > + offsetof(struct virtio_scsi_config, primary_wwnn)); > + port_name = virtio_cread64(vdev, > + offsetof(struct virtio_scsi_config, primary_wwpn)); > + } else { > + node_name = virtio_cread64(vdev, > + offsetof(struct virtio_scsi_config, secondary_wwnn)); > + port_name = virtio_cread64(vdev, > + offsetof(struct virtio_scsi_config, secondary_wwpn)); Is the endianness correct for big-endian host here? Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] virtio-scsi: Implement FC_HOST feature
> How it this supposed to work? > You do export the WWPN/WWNN of the associated host to the guest (nb: > will get interesting for non NPIV setups ...), but virtio scsi will > still do a LUN remapping. > IE the LUNs you see on the host will be different from the LUNs > presented to the guest. This is taken care of in the host by presenting to the host all LUNs from a host's NPIV vHBA. (Libvirt probably would be the one taking care of this, because QEMU may not have enough permissions). > Plus you don't _actually_ expose the FC host, but rather the WWPN of the > host presenting the LUN. > So how do you handle LUNs from different FC hosts on the guest? I'm not sure I understand. Neither I nor Fam know this stuff very well, but we are trying to do the same as Hyper-V (and other proprietary hypervisors too). > Overall, I'm not overly happy with this approach. > You already added WWPN ids to the virtio transport, so why didn't you > update the LUN field, too, to avoid this ominous LUN remapping? Is this your old idea of adding a separate target field to commands, in order to support 64-bit LUNs? That is separate, and most FC drivers only default to 16-bit LUNs anyway. > And we really should make sure to have a single FC host in the guest > presenting all LUNs. Yes, of course. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio_scsi: Implement fc_host
On 16/01/2017 18:26, Fam Zheng wrote: >> Is the endianness correct for big-endian host here? > > I think so. The fc_host sysfs uses u64 to represent port_name and node_name, > this patch does the same, so using virtio_* helpers for these fields should > handle the endianness correctly. I was suspicious about it because they are defined as "u8 x[8]" in the virtio_scsi_config struct. So you would need to read with virtio_cread_bytes and pass the result to wwn_to_u64. For example, if you have 0x500123456789abcd this would be 0x50 0x01 0x23 0x45 0x67 0x89 0xab 0cd in virtio_scsi_config, and then virtio_cread64 would read it as a little-endian u64, 0xcdab896745230150. Maybe your QEMU patch is also writing things as little-endian 64-bit integers, rather than 8-element arrays of bytes? Paolo > Maybe we should use u64 in struct virtio_scsi_config as well? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] virtio_scsi: Implement fc_host
- Original Message - > From: "Michael S. Tsirkin" > To: "Fam Zheng" > Cc: linux-ker...@vger.kernel.org, "Paolo Bonzini" , > linux-scsi@vger.kernel.org, "James E.J. > Bottomley" , "Jason Wang" , > "Martin K. Petersen" > , stefa...@redhat.com, > virtualizat...@lists.linux-foundation.org > Sent: Thursday, January 26, 2017 11:06:27 PM > Subject: Re: [PATCH v2 2/2] virtio_scsi: Implement fc_host > > On Thu, Jan 26, 2017 at 11:41:09AM +0800, Fam Zheng wrote: > > This implements the VIRTIO_SCSI_F_FC_HOST feature by reading the config > > fields and presenting them as sysfs fc_host attributes. The config > > change handler is added here because primary_active will toggle during > > migration. > > Looks like there's active discussion on virtio tc mailing list. > It's ok to post patches meanwhile but best as RFC, > and repost after controversy is resolved. Discussion on the TC mailing list was not about the merit of the feature, only about the timing of the vote. Paolo > > > > > Signed-off-by: Fam Zheng > > --- > > drivers/scsi/virtio_scsi.c | 60 > > +- > > 1 file changed, 59 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > > index ec91bd0..1bb330c 100644 > > --- a/drivers/scsi/virtio_scsi.c > > +++ b/drivers/scsi/virtio_scsi.c > > @@ -28,6 +28,7 @@ > > #include > > #include > > #include > > +#include > > #include > > > > #define VIRTIO_SCSI_MEMPOOL_SZ 64 > > @@ -795,6 +796,15 @@ static struct scsi_host_template > > virtscsi_host_template_multi = { > > .track_queue_depth = 1, > > }; > > > > +static struct fc_function_template virtscsi_fc_template = { > > + .show_host_node_name = 1, > > + .show_host_port_name = 1, > > + .show_host_port_type = 1, > > + .show_host_port_state = 1, > > +}; > > + > > +static struct scsi_transport_template *virtscsi_fc_transport_template; > > + > > #define virtscsi_config_get(vdev, fld) \ > > ({ \ > > typeof(((struct virtio_scsi_config *)0)->fld) __val; \ > > @@ -956,15 +966,42 @@ static int virtscsi_init(struct virtio_device *vdev, > > return err; > > } > > > > +static void virtscsi_update_fc_host_attrs(struct virtio_device *vdev) > > +{ > > + struct Scsi_Host *shost = vdev->priv; > > + u8 node_name[8], port_name[8]; > > + > > + if (virtscsi_config_get(vdev, primary_active)) { > > + virtio_cread_bytes(vdev, > > + offsetof(struct virtio_scsi_config, primary_wwnn), > > + &node_name, 8); > > + virtio_cread_bytes(vdev, > > + offsetof(struct virtio_scsi_config, primary_wwpn), > > + &port_name, 8); > > + } else { > > + virtio_cread_bytes(vdev, > > + offsetof(struct virtio_scsi_config, secondary_wwnn), > > + &node_name, 8); > > + virtio_cread_bytes(vdev, > > + offsetof(struct virtio_scsi_config, secondary_wwpn), > > + &port_name, 8); > > + } > > This is racy, isn't it? You need to wrap this in a generation check > otherwise read can race with primary_active changing. > And you might want a wrapper to virtio_cread_bytes that does not > include generation check. > > > + fc_host_node_name(shost) = wwn_to_u64(node_name); > > + fc_host_port_name(shost) = wwn_to_u64(port_name); > > + fc_host_port_type(shost) = FC_PORTTYPE_NPORT; > > + fc_host_port_state(shost) = FC_PORTSTATE_ONLINE; > > +} > > + > > static int virtscsi_probe(struct virtio_device *vdev) > > { > > - struct Scsi_Host *shost; > > + struct Scsi_Host *shost = NULL; > > struct virtio_scsi *vscsi; > > int err; > > u32 sg_elems, num_targets; > > u32 cmd_per_lun; > > u32 num_queues; > > struct scsi_host_template *hostt; > > + bool fc_host_enabled; > > > > if (!vdev->config->get) { > > dev_err(&vdev->dev, "%s failure: config access disabled\n", > > @@ -987,6 +1024,9 @@ static int virtscsi_probe(struct virtio_device *vdev) > > if (!shost) > > return -ENOMEM; > > > > + fc_host_enabled = virtio_has_feature(vdev, VIRTIO_SCSI_F_FC_HOST); > > + if (fc_host_enabled) > > +
Re: [PATCH] virtio_scsi: check on resp->sense_len instead of 'sense_buffer'
Il 18/07/2014 16:57, Ming Lei ha scritto: > - if (sc->sense_buffer) { > + if (resp->sense_len) { In the (unlikely) case that sc->sense_buffer == NULL, you'd pass a NULL to memcpy. If you want, you can change this if to if (sc->sense_buffer && resp->sense_len) but frankly it seems like slightly pointless churn to me. Paolo > memcpy(sc->sense_buffer, resp->sense, > min_t(u32, resp->sense_len, VIRTIO_SCSI_SENSE_SIZE)); > - if (resp->sense_len) > - set_driver_byte(sc, DRIVER_SENSE); > + set_driver_byte(sc, DRIVER_SENSE); > } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 23/08/2014 16:52, Hans de Goede ha scritto: > Hi All, > > Now that the UAS driver is no longer marked as CONFIG_BROKEN, > I'm getting quite a few bug reports about issues with UAS drives. > > One if the issues is that there might be a number of bugs in the > abort handling path, as I don't think that was ever tested properly. > > So I'm wondering is there a way to test the abort path with a real > drive? E.G. submit some command which is known to take a significant > amount of time, and then abort it right after submitting ? You could have some luck with QEMU's UAS emulation. If you set QEMU's I/O throttling options low enough (e.g. 100KB/sec), and then use dd with iflag=direct and a largish block size (a few MBs), the guest should abort its I/O soon enough. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 25/08/2014 12:28, Bart Van Assche ha scritto: > > From SPC-4: "7.5.8 Control mode page [ ... ] A task aborted status (TAS) > bit set to zero specifies that aborted commands shall be terminated by > the device server without any response to the application client. A TAS > bit set to one specifies that commands aborted by the actions of an I_T > nexus other than the I_T nexus on which the command was received shall > be completed with TASK ABORTED status (see SAM-5)." Note the "aborted by the actions of an I_T nexus other than the I_T nexus on which the command was received". In practice, this means that TASK ABORTED should only happen if you use the CLEAR TASK SET tmf and TST is not set to 001b (i.e. _not_ to "per I_T nexus") in the Control mode page. It should never happen for a pen drive. Setting TASK ABORTED aside, the important part is that an abort can do one of two things: - complete the command, and then eh_abort should return after the driver has noticed the completion and called the ->scsi_done callback for the Scsi_Cmnd*. - abort the command, and then the driver should never call the ->scsi_done callback for the Scsi_Cmnd*. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 25/08/2014 13:26, Hans de Goede ha scritto: > Thanks Bart and Paolo, your insights into this are greatly appreciated. > > So with uas there are separate usb transaction for cmd, data in, data out > and sense for each tag. At the time of abort, usually one of data in / data > out and a sense usb transaction will be outstanding. > > There already is logic in the driver to kill the data in / out transactions > if a sense gets returned (usually with an error) before they are done. > > So if I'm reading this correctly, then on a successful abort, the sense > transaction (if not already completed by the target) should be cancelled as > it will never complete, correct ? Yes. More precisely, once the response IU comes back for the abort, there should be no more IUs for that task. Figure 13 ("Multiple Command Example") in the UAS spec actually shows that. At least, that's what it should be like in theory. I suspect firmware bugs will abound in this area, but at least you shouldn't be actively expecting an IU for an aborted task. > I wish the uas spec contained some more details on this, but it is very > vague wrt task management (well it is vague in general, task management just > is extra hard to test). That may complicate usage of the spec, but it's not necessarily a bad thing; a SCSI transport should heavily rely on the SCSI architecture and thus the SAM spec, and vagueness can be a hint that the transport was done right. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 28/08/2014 14:04, Hannes Reinecke ha scritto: >> >> Setting TASK ABORTED aside, the important part is that an abort can do >> one of two things: >> >> - complete the command, and then eh_abort should return after the driver >> has noticed the completion and called the ->scsi_done callback for the >> Scsi_Cmnd*. >> >> - abort the command, and then the driver should never call the >> ->scsi_done callback for the Scsi_Cmnd*. >> > In practice we rely on the latter behaviour; when ->scsi_done is called > while the command is under eh_abort _really bad things_ > will happen. > As soon as eh_abort is called control is transferred back to the > SCSI midlayer, so any LLDD should never send completions for these > commands back to the midlayer. No, this is wrong. I think we have sorted it out a couple of months ago. virtio-scsi for example (due to QEMU quirks) will do the former more often than not. Ignoring scsi_eh_done which is just as harmless, ->scsi_done does nothing more than calling blk_complete_request. If the command is under abort, it has already been marked as complete by the block layer's timeout timer---see blk_rq_timed_out_timer and blk_rq_check_expired---or by blk_abort_request. Then, blk_complete_request will do nothing because its call to blk_mark_rq_complete returns true. All this, of course, as long as ->scsi_done is called _before_ eh_abort returns. Otherwise, occasions abound for uses-after-free, which is what virtio-scsi got until commit 8faeb529b2da (virtio-scsi: fix various bad behavior on aborted requests, 2014-06-04). Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 28/08/2014 14:26, Hans de Goede ha scritto: >> > Then, blk_complete_request will do nothing because its call to >> > blk_mark_rq_complete returns true. >> > >> > All this, of course, as long as ->scsi_done is called _before_ eh_abort >> > returns. > What about calling scsi_done after eh_abort if eh_abort returned FAILED? I invoke the fifth amendment. :) Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 28/08/2014 16:17, Hannes Reinecke ha scritto: >> > As mentioned earlier, as soon as SCSI EH is invoked control > is assumed to be transferred back to the SCSI midlayer. > How the midlayer interprets any return value from the various eh_XX > callbacks is immaterial to the LLDD. > > So even if the eh_abort returns FAILED control is still with the SCSI > midlayer, so the earlier statements apply. > IE the command will be short-circuited by the block layer anyway if > ->scsi_done() is called. As I parsed it, the question is not whether the short-circuiting will happen. It's whether you will have use-after-free bugs or not if you call ->scsi_done() after eh_abort returns FAILED. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 28/08/2014 17:50, Elliott, Robert (Server Storage) ha scritto: > Is the block layer prevented from issuing a new command with the > same tag before the error handling is finished? Tags are chosen by the LLDs, so it's up to it to pick the right tags. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging scsi abort handling ?
Il 29/08/2014 08:08, Hannes Reinecke ha scritto: >> > No. > FAILED for any eh_abort_cmd() means that the TMF hasn't been sent. > So the midlayer escalates to the next EH step. > The command will only ever be re-issued once EH completes. Then the answer to Hans's question is yes. It is legal to call ->scsi_done() after the eh_abort handler returns FAILED. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT
Il 09/10/2014 06:14, Nicholas A. Bellinger ha scritto: > AFAICT from qemu code, the ioctl VHOST_SCSI_CLEAR_ENDPOINT is always > called during shutdown in order to release the endpoint and drop this > new configfs dependency. As far as I can see, the only path leading to the ioctl is vhost_scsi_set_status->vhost_scsi_stop. That only happens if the guest driver resets the device upon shutdown, or via vhost_scsi_unrealize as you pointed out. But unrealize() is only called when a device is hot-unplugged. It does not happen if you close QEMU with SIGTERM, ctrl-c, or with the "quit" command, because no attempt is done to bring down the VM data structures (or free memory, or close file descriptors) in case of a fatal exit. The kernel should do that for us. Besides that... > The question is, what happens when qemu crashes..? Is there currently > an assurance that VHOST_SCSI_CLEAR_ENDPOINT is called via the normal > VirtioDeviceClass->unrealize() when qemu exits abnormally..? ... of course nothing is called if you SIGKILL QEMU. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT
Il 09/10/2014 10:49, Paolo Bonzini ha scritto: > > It does not happen if you close QEMU with SIGTERM, ctrl-c, or with the > "quit" command, because no attempt is done to bring down the VM data > structures (or free memory, or close file descriptors) in case of a > fatal exit. The kernel should do that for us. ... and in the case of vhost-scsi, doesn't it do that when vhost_scsi_release calls vhost_scsi_clear_endpoint? Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Application error handling with write-back caching
On 10/05/2016 16:16, James Bottomley wrote: > > If "is performed" just means "completes", maybe with an error, the > > application would have to resubmit write requests and then try to > > flush the write cache again. > > > > I'm not aware of applications that keep acknowledged write data > > around until the cache flush completion in order to retry writes. > > I think you may be misunderstanding the nature of the returned error. > It will be permanent and fatal and usually signal that the device has > a failed sector that can't be remapped and so the device itself has for > most purposes failed. The only recovery is if you happen to have RAID, > in which case the RAID layer will mostly take care of it. What about a SPACE ALLOCATION FAILED error or a similar error that can be fixed by administrator actions (or just by a concurrent process doing an UNMAP)? Would a subsequent cache flush cause data loss? Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Application error handling with write-back caching
On 10/05/2016 19:31, James Bottomley wrote: > > What about a SPACE ALLOCATION FAILED error or a similar error that > > can be fixed by administrator actions (or just by a concurrent > > process doing an UNMAP)? Would a subsequent cache flush cause data > > loss? > > You're now asking about how these are handled? It's not a SCSI > problem. I believe if you look at the various layers, RAID would still > treat it as fatal and fail the drive and so would most filesystems. > The AEN warnings for TP are reported, but the admin has to sort it out > before they become a fatal error. Thanks, fatal errors are fine I guess. We were worried that the next SYNCHRONIZE CACHE would succeed and throw away the writes because it has already "performed a write medium operation". POSIX fsync is pretty underspecified in this respect too; gluster has been throwing away those writes for a long time! It stopped now because we explained the issue to them, but it's pointless if the next layer below does the same---hence Stefan's question. (In our case the next layer is not the page cache, because we generally use O_DIRECT. Evicting dirty pages from the page cache would be okay if the process(es) that wrote them are SIGKILLed, but in general it would be a problem too). Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist
On 06/06/2016 16:22, Hannes Reinecke wrote: > So either we dig into what went wrong with qemu 0.8, or we figure out > from which qemu version things start to behave nicely, and blacklist > earlier versions. > > > > Either way, this patch is wrong. > > > > If we can identify which versions work, we can update it. Otherwise > > I think we have to be conservative. > > So far we just had this single report where the upstream kernel didn't > work correctly with a (really old) version of qemu. > Hardly justifying blacklisting qemu CD-ROM in general. To further complicate the matter there are two QEMU MMC devices: 1) ATAPI - vendor "QEMU" / product name "QEMU CD-ROM" before QEMU 0.10.0 - vendor "QEMU" / product name "QEMU DVD-ROM" since QEMU 0.10.0 2) native SCSI - vendor "QEMU" / product name "QEMU CD-ROM" VPD in the SCSI CD-ROM probably has always worked, but I would blacklist up to 0.11 inclusive just to be safe. Those versions are dead anyway. VPD in the ATAPI CD-ROM is newer, and that's where the bug was reported on: > [4.439488] ata2.00: ATAPI: QEMU CD-ROM, 0.8.2, max UDMA/100 > [4.443649] ata2.00: configured for MWDMA2 > [4.450267] scsi 1:0:0:0: CD-ROMQEMU QEMU CD-ROM 0.8. > PQ: 0 ANSI: 5 > [4.464317] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 > frozen > [4.464319] ata2.00: BMDMA stat 0x5 > [4.464339] ata2.00: cmd a0/01:00:00:00:01/00:00:00:00:00/a0 tag 0 dma > 16640 in > [4.464339] Inquiry 12 01 00 00 ff 00res > 48/20:02:00:24:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation) > [4.464341] ata2.00: status: { DRDY DRQ } For ATAPI, you have to blacklist all versions up to 2.2 inclusive. This gives: - QEMU / QEMU CD-ROM / 0.8.(this is IDE and SCSI) - QEMU / QEMU CD-ROM / 0.9.(this is IDE and SCSI) - QEMU / QEMU CD-ROM / 0.10(this is SCSI only) - QEMU / QEMU CD-ROM / 0.11(this is SCSI only) - QEMU / QEMU DVD-ROM / 0.8. (this is IDE only) - QEMU / QEMU DVD-ROM / 0.9. (this is IDE only) - QEMU / QEMU DVD-ROM / 0.10 (this is IDE only) - QEMU / QEMU DVD-ROM / 0.11 (this is IDE only) - QEMU / QEMU DVD-ROM / 0.12 (this is IDE only) - QEMU / QEMU DVD-ROM / 0.13 (this is IDE only) - QEMU / QEMU DVD-ROM / 0.14 (this is IDE only) - QEMU / QEMU DVD-ROM / 0.15 (this is IDE only) - QEMU / QEMU DVD-ROM / 1.0(this is IDE only) - QEMU / QEMU DVD-ROM / 1.1(this is IDE only) - QEMU / QEMU DVD-ROM / 1.2(this is IDE only) - QEMU / QEMU DVD-ROM / 1.3(this is IDE only) - QEMU / QEMU DVD-ROM / 1.4(this is IDE only) - QEMU / QEMU DVD-ROM / 1.5(this is IDE only) - QEMU / QEMU DVD-ROM / 1.6(this is IDE only) - QEMU / QEMU DVD-ROM / 1.7(this is IDE only) - QEMU / QEMU DVD-ROM / 2.0(this is IDE only) - QEMU / QEMU DVD-ROM / 2.1(this is IDE only) - QEMU / QEMU DVD-ROM / 2.2(this is IDE only) Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist
On 06/06/2016 17:41, John Snow wrote: > On 06/06/2016 11:05 AM, Paolo Bonzini wrote: >> For ATAPI, you have to blacklist all versions up to 2.2 inclusive. >> >> This gives: >> >> - QEMU / QEMU CD-ROM / 0.8.(this is IDE and SCSI) >> - QEMU / QEMU CD-ROM / 0.9.(this is IDE and SCSI) >> - QEMU / QEMU CD-ROM / 0.10(this is SCSI only) >> - QEMU / QEMU CD-ROM / 0.11(this is SCSI only) >> - QEMU / QEMU DVD-ROM / 0.8. (this is IDE only) >> - QEMU / QEMU DVD-ROM / 0.9. (this is IDE only) >> - QEMU / QEMU DVD-ROM / 0.10 (this is IDE only) >> - QEMU / QEMU DVD-ROM / 0.11 (this is IDE only) >> - QEMU / QEMU DVD-ROM / 0.12 (this is IDE only) >> - QEMU / QEMU DVD-ROM / 0.13 (this is IDE only) >> - QEMU / QEMU DVD-ROM / 0.14 (this is IDE only) >> - QEMU / QEMU DVD-ROM / 0.15 (this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.0(this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.1(this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.2(this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.3(this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.4(this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.5(this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.6(this is IDE only) >> - QEMU / QEMU DVD-ROM / 1.7(this is IDE only) >> - QEMU / QEMU DVD-ROM / 2.0(this is IDE only) >> - QEMU / QEMU DVD-ROM / 2.1(this is IDE only) >> - QEMU / QEMU DVD-ROM / 2.2(this is IDE only) >> > > If this bug is caused by a missing VPD response, Paolo's version history > here is correct for upstream versions. > > Various downstreams may have backported the VPD fix to older versions, > we need to be careful not to block those, too ... so targeting the core > behavior seems like the more strictly correct, easily maintainable solution. I think this is not practical. I'm okay with the big hammer if an algorithmic fix is not feasible; but otherwise it does seem a better idea than blacklisting based on inquiry data... Thanks, Paolo > Why not just dynamically blacklist devices that fail to respond to VPD > inquiries? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist
On 06/06/2016 17:47, John Snow wrote: > > > Various downstreams may have backported the VPD fix to older versions, > > > we need to be careful not to block those, too ... so targeting the core > > > behavior seems like the more strictly correct, easily maintainable > > > solution. > > > > I think this is not practical. I'm okay with the big hammer if an > > algorithmic fix is not feasible; but otherwise it does seem a better > > idea than blacklisting based on inquiry data... > > You think the more practical solution is a SCSI driver that can hang > because of an incorrect/missing response and to maintain a carefully > curated blacklist to work around this behavior? The best solution would be an algorithmic fix, perhaps predicated by some kind of quirk bit. A carefully curated blacklist is impossible because you cannot account for a zillion downstreams, most of which probably don't change the inquire vendor/product data; version numbers are awful. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
I'm not sure what is more ridiculous, whether the seven pings or the lack of review... Paolo Il 06/02/2013 16:15, Paolo Bonzini ha scritto: > This series regards the whitelist that is used for the SG_IO ioctl. This > whitelist has three problems: > > * the bitmap of allowed commands is designed for MMC devices (roughly, > "play/burn CDs without requiring root") but some opcodes overlap across SCSI > device classes and have different meanings for different classes. > > * also because the bitmap of allowed commands is designed for MMC devices > only, some commands are missing even though they are generally useful and > not insecure. At least not more insecure than anything else you can > do if you have access to /dev/sdX or /dev/stX nodes. > > * the whitelist can be disabled per-process but not per-disk. In addition, > the required capability (CAP_SYS_RAWIO) gives access to a range of other > resources, enough to make it insecure. > > The series corrects these problems. Patches 1-4 solve the first problem, > which also has an assigned CVE, by using different bitmaps for the various > device classes. Patches 5-11 solve the second by adding more commands > to the bitmaps. Patches 12 and 13 solve the third, and were already > posted but ignored by the maintainers despite multiple pings. > > Note: checkpatch hates the formatting of the command table. I know about > this, > and ensured that there are no errors in the rest of the code. The current > formatting is IMHO quite handy, and roughly based on the files available > from the SCSI standard body. > > Ok for the next merge window? > > Paolo > > v1->v2: remove 2 MMC commands and 6 SBC commands (see patches 6 and 9 > for details). Added patch 14 and added a few more scanner > commands based on SANE (scanners are not whitelisted by default, > also were not in v1, but this makes it possible to opt into the > whitelist out of paranoia). Removed C++ comments. Removed the > large #if 0'd list of commands that the kernel does not pass > though. Marked blk_set_cmd_filter_defaults as __init. > > > Paolo Bonzini (14): > sg_io: pass request_queue to blk_verify_command > sg_io: reorganize list of allowed commands > sg_io: use different default filters for each device class > sg_io: resolve conflicts between commands assigned to multiple > classes (CVE-2012-4542) > sg_io: whitelist a few more commands for rare & obsolete device types > sg_io: whitelist another command for multimedia devices > sg_io: whitelist a few more commands for media changers > sg_io: whitelist a few more commands for tapes > sg_io: whitelist a few more commands for disks > sg_io: whitelist a few obsolete commands > sg_io: mark blk_set_cmd_filter_defaults as __init > sg_io: remove remnants of sysfs SG_IO filters > sg_io: introduce unpriv_sgio queue flag > sg_io: use unpriv_sgio to disable whitelisting for scanners > > Documentation/block/queue-sysfs.txt |8 + > block/blk-sysfs.c | 33 +++ > block/bsg.c |2 +- > block/scsi_ioctl.c | 369 > ++- > drivers/scsi/scsi_scan.c| 14 ++- > drivers/scsi/sg.c |6 +- > include/linux/blkdev.h |8 +- > include/linux/genhd.h |9 - > include/scsi/scsi.h |3 + > 9 files changed, 344 insertions(+), 108 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 11:32, Tejun Heo ha scritto: > On Wed, May 22, 2013 at 08:35:54AM +0200, Paolo Bonzini wrote: >> I'm not sure what is more ridiculous, whether the seven pings or the >> lack of review... > > So, ummm, I don't know what Jens is thinking but at this point I'm > basically waiting for someone else to pick it up as review to return > ratio is too low to continue. It doesn't seem like I can get the > series into a shape I can ack with reasonable amount of effort. Then please say so. I didn't find any comment in your review that I missed. > My memory is kinda hazy now but here are two review points that came > to my mind before giving up. > > * The response that I got after asking for justification basically > boiled down to "it has to". Whatever that means. For patches 1-4, it means that you're allowed to write to media when a file is opened for reading. The patches fix this. For patches 5-12, it means that you currently need root-equivalent privileges (CAP_SYS_RAWIO) to do "regular business" on any SCSI device that is not a CD-ROM or a tape or a disk. For patches 13-14, it means that you currently need root-equivalent privileges (CAP_SYS_RAWIO) to do operations on SCSI devices that require some level of trust, hence there is no way to confine this to a single device. But all this is in the cover letter, I'm just paraphrasing. > * In the patch series, fixes and feature changes are still mixed in > order. I gave up after this. Bugfixes are in patch 1-4. The patches first introduce the new table format without any semantic change, then they introduce per-class filters while still leaving the conflicting commands accessible with O_RDONLY, and finally fix the bug. If you have any better ideas, please tell me. I did try to optimize for reviewability and bisectability, if I screwed up I'd like to hear why. Whitelisting of extra commands is in patch 5-10. Additional related changes are in patches 11-14. Again, all this is in the cover letter. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 12:02, Tejun Heo ha scritto: > On Wed, May 22, 2013 at 11:53:30AM +0200, Paolo Bonzini wrote: >> Il 22/05/2013 11:32, Tejun Heo ha scritto: >>> On Wed, May 22, 2013 at 08:35:54AM +0200, Paolo Bonzini wrote: >>>> I'm not sure what is more ridiculous, whether the seven pings or the >>>> lack of review... >>> >>> So, ummm, I don't know what Jens is thinking but at this point I'm >>> basically waiting for someone else to pick it up as review to return >>> ratio is too low to continue. It doesn't seem like I can get the >>> series into a shape I can ack with reasonable amount of effort. >> >> Then please say so. I didn't find any comment in your review that I missed. > > Well, I've tried that multiple times and didn't get the results that I > was expecting each time, so doing it all over again felt pointless. > Even now, you just repeat what you've been saying and I'd have to > fight through each and every point. Yes, because I have no idea what _your_ point is. Let's look at the first submission. Patch 1 - acked by you. Patch 2 - discussions on the formatting. Every comment of yours has been accounted for, except for one. I wrote: > If you want opcodes visible, you can make them the comments, right? Yes, like "/* 0x00 */ CONSTANT, MASK". I still have a slight preference for the opcodes because if the constant ends up wrong, the head-scratching would be higher than if the opcode is wrong (the opcode is what you see in the dumps). You didn't answer; v2 was posted 15 days after the end of the v1 thread, so you had enough time to post more comments and have them addressed in v2. Yo me that means "fair enough". Patch 3-5 - no comment. Patch 6 - long discussion, ending with "The vast majority of the commands are added because Linux itself is using them", and with me removing some commands from the list according to your request. Patch 7-13 - no comment. Cover letter has no comment either. So you haven't commented on most patches or on the cover letter, and now you ask about clarification generically rather than about specific points of the commit messages or the cover letter. There is only sensible conclusion I can make. Namely, that you haven't even read those commit messages. So, I'm sorry if you did, but I don't have a crystal ball to understand what you found wrong, and there's nothing I can do about it. > It just doesn't feel worth the > effort. It'd be far less effort to just slurp the patches and > regurgitate them myself. If there is a fundamental misunderstanding, that wouldn't help anyway. We would have the same discussion in reverse when I review your patches. > I don't care that much about the changes > right now, so I'm just waiting for either someone else picking it up > or my yield with you somehow magically improving and the next refresh > addressing most of the issues. Well, so far there was just one pass, and not even a full one. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
> OK, let me try. I did draw straws with Jens at LSF to see who would > look at this and he lost, but the complexity of the patch set probably > makes it hard for him to find the time. Thanks. > The first problem, which Tejun already pointed out is that you've > combined a "bug fix" with a large feature set in such a way that they > can't be separated, so saying the first four patches fix a bug isn't > helpful. Fair enough. Though to be precise, this is a combined patchset also because previous attempts to get similar patches reviewed proceeded with glacial pace. For example, the unpriv_sgio patch had been sent in November 2012 (https://patchwork.kernel.org/patch/1735871/). It got two acks from Tejun, and was never applied despite two pings. My attempt to include UNMAP and WRITE SAME into the whitelist (even before I found out by chance the overlap) dates back to July 2012, and was only reviewed two months after. > The second problem is that it's not at all clear what the bug actually > is. You have to wade through tons of red hat bugzillas before you come > up with the fact that there's one command which we allow users to send > which is ambiguous: GPCMD_READ_SUBCHANNEL which has the same opcode as > UNMAP on a disk. No, you don't need to. Just read the commit message of patch 4. > Once you finally work this out, you wonder what all > the fuss is about because UNMAP is advisory ... even if an unprivileged > user can now send the command, it can't be used to damage any data or > even get access to any data, so there doesn't seem to be an actual bug > to fix at all. This doesn't look advisory to me: $ sudo chmod 444 /dev/sdb $ sha1sum /dev/sdb e15720aa0d26856f0486867634b6737c30ea7346 /dev/sdb $ echo -n $'%\x16%\x10%%\x40%' | tr % \\0 | \ sg_raw --readonly /dev/sdb -s24 42 00 00 00 00 00 00 00 18 00 $ sha1sum /dev/sdb aab335ccc669bbbc85d727870b5941dc3581f025 /dev/sdb It doesn't look readonly, either. > The various committees do try hard to ensure there's no opcode > overlap ... although they don't always succeed as you see with the UNMAP > above, so I'm not at all sure we need the huge complexity of per scsi > device type command filter lists, which is what the rest of the feature > additions is about. That was introduced because Tejun didn't want to enable more than the bare minimum for MMC devices. He was worried about hypothetical scenarios with vendors recycling opcodes in a way that is not suitable for exposing to unprivileged userland (he reinforced this, for example, here: http://article.gmane.org/gmane.linux.kernel/1429863). > The third problem is that in order to verify that all the code motion > doesn't actually introduce a bug, you have to wade through about seven > patches ... the patch split really isn't at all conducive to reviewing > this critical piece. Not true. Each patch can be reviewed independently. The big and hard-to- review movement is all in patch 2. The next patches can be tedious to review, but that does not meen hard. Just do "grep ^[-+].*sgio_bitmap_set patchNN | sort -k2 -k1" for example > Finally, the patch for the feature I think you actually want, which is > 13/14, could have been implemented fairly simply as a single patch and > doesn't have to be part of this series. It was, and it was ignored. I sent it together because of the common dependency on the first patch. However, it is not the only feature I need; that patch should be just for things like reservations or vendor-specific commands. I also need that SG_IO works well without any privileges, neither CAP_SYS_RAWIO (needed for a process to bypass the whitelist) neither CAP_SYS_ADMIN (needed for a process to disable the whitelist for others as in patch 13/14). I need that at least for disks, tapes and media changers. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 15:41, Tejun Heo ha scritto: > On Wed, May 22, 2013 at 12:23:56PM +0200, Paolo Bonzini wrote: >> Yes, because I have no idea what _your_ point is. > > Isolate the actual fixes and just submit them as it seems impossible > for you to provide proper justifications for the things you want to > add. Quoting myself on January 26, 2013: "The vast majority of the commands are added because Linux itself is using them". Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 16:30, Tejun Heo ha scritto: > * Separate fixes from additions. Transform existing code so that the > visible behavior doesn't change but the required fix can be > implemented on top. Explicitly note what's going on in the commit > messages. Been there, done that. Have you read the commit messages *at all*? Patch 2: "Right now, there is still just one bitmap and the mask is ignored, so there is no semantic change yet". Patch 3: "This patch already introduces some semantic change, albeit very limited; [...] a few commands are now forbidden for devices of type other than TYPE_ROM, where they are reserved or vendor-specific". > * Fix the frigging CVE bug that you've been waving around and do > *just* that. Perhaps you've missed that this is v2, not v27. Tell me a place in the original review where you've told me to split the series. > * Add the frigging "count me out" feature that you want for your use > case. It isn't controversial and is what you need and the > maintainer can apply to the point where [s]he thinks acceptable. Sure. Except I had sent it six months ago, and it lied unreviewed despite your acks for two months. It is in the archives waiting to be picked up. > * If for whatever reason you have to add more command codes to the > exception table, do them with explicit justifications. How the hell > "the vast majority of the commands are added because Linux itself is > using them" a proper justification? How are they used for what > reason and why is adding them beneficial? For example, WRITE SAME is used to discard blocks. If a Linux guest wants to discard blocks, it may send WRITE SAME. If a disk advertises support for WRITE SAME, it is not nice if WRITE SAME then fails because of a stupid whitelist that was designed for CD-ROMs. And another disk instead works because it uses UNMAP instead of WRITE SAME. It's a support nightmare, perhaps the cause is obvious to me by the time it reaches me, but that takes time---which is wasted time for everyone involved. > How many times have I > asked you to give at least some useful use cases? And WTF is "vast > majority", what about others then? The others are not in v2, or are sent by udev, or were added to the standard for a reason and applications are using them (e.g. COMPARE AND WRITE). > Why do you need this at all if > you have the "count me out" knob in the first place? Because the "count me out" knob needs privileges. It opens up way, way more than what these patches do. And the frigging patch to make the whole whitelisting userspace-configurable with finer-grain (but still with appropriate capabilities) was nacked. By you, for God's sake. http://permalink.gmane.org/gmane.linux.kernel/1378071 > You first > built that command list by scanning the spec and just adding the > commands that seemed "right" to you. And then in v2 I stated that I removed some disk commands because of discussion with you. But of course you don't know, because you've not read the damn commit messages. > I have near-zero confidence in > your perception of the relationship between the specs and actual > world. Thankyouverymuch. Perhaps you should have read the commit messages (oh sorry, have I said it already?) and seen that it's not about commands that seemed "right": Only commands that affect the medium are added. Commands that affect other state of the LUN are all privileged, with the sole exception of START STOP UNIT (which has always been allowed for all file descriptors. I do not really agree with that and it's probably an artifact of when /dev/cdrom had r--r--r-- permissions, but I'm not trying to change that. > So, stop quoting and repeating yourself. You're overdoing yourself on > that department already. Try to listen and understand for a change. Guy, calm down. We're two. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 17:03, Theodore Ts'o ha scritto: > Paolo, > > I'll probably regret butting my head into this, but it might be > helpful if you talk about your particular use case which is driving > your desire to make these changes. Ted, thank you very much. I understand that my discussion with Tejun is leading nowhere, and any outside help can only improve the situation. I hope you won't regret it. :) > For example, what do you think the > SG_IO whitelist _should_ be used, and why should it be made more > general? What's the use case that is being impaired by the current > state of how sg_io whitelists are being handled? Thanks for writing the questions down. I hope you don't mind the verbosity; perhaps we can use the opportunity to rewind the discussion. First of all, I'll note that SG_IO and block-device-specific ioctls both have their place. My usecase for SG_IO is virtualization, where I need to pass information from the LUN to the virtual machine with as much fidelity as possible if I choose to virtualize at the SCSI level. In that case, I'll use SG_IO. If people are okay with virtualizing at a higher-level, I'll use ioctls(BLK*) or fallocate (it depends on whether my target is a block device or a file). It depends on the application, the user, the context,... But in general, a program that cares about things like sense data must use SG_IO, a program that only cares about high-level concepts will use the ioctl. Now, SG_IO whitelist started so that you could "burn CDs without being root". But that can be extended to other device types; the SG_IO whitelist provides a way to allow low-level operations on devices while ensuring: a) respect of file permissions and no need for special privileges; b) no disruption of the devices or the storage fabric (for disks, no disruption beyond what would be possible anyway with read/write or other ioctls). There are some non-disk devices that (like CD-ROMs) have their own set of commands and can only be accessed via /dev/sg, for example media changers. Tapes also have some functionality that is not accessible via /dev/st. It would be useful (for virt, but I'm sure there are other usecases) to make the common uses of these devices possible without privileges, just by granting the appropriate permissions to the uids who will operate on them. This is why the SG_IO whitelist should be widened for these devices. But even for block devices, it should be made more general because the limitations in the current list are not really justified, except as an artifact of how the whitelist developed (again, "burn CDs without being root"). I think that there's no reason to forbid unprivileged programs from doing with SG_IO what they could do with other ioctls. By extension, if it makes sense to add an unprivileged ioctl in the future (e.g. for atomic compare and write) it should also be available from now via SG_IO. > Secondly, when you are trying to get a security vulnerability fixed, > it's helpful if you give the precise nature of the problem, and what > the an attacker can do with it. I think you are worried that if an > attacker has read-only access, they can still send the UNMAP command > which may (since it is advisory) result in a block no longer > containing valid data, such that a read will return zero's or some > other undefined garbage. Yes? Yes. > Now consider that if this is a high-priority fix, it's important to > make the patch as small as possible, since distributions (like your > employer) may want to backport the patch to older kernels. And > distribution release engineers will appreciate things if the patch is > as small as possible, making the _minimum_ necessary changes to fix > said security exposure. Generally, a series of 14 patches is __not__ > the minimum necessary patch. It is not a high-priority fix, or Red Hat would have agreed on an embargo date with other distros, and done all the security stuff that they routinely do. Still, it is a fix that I would like to get in, if only because Red Hat's policy is to get things upstream as much as possible. In fact, I would have very much preferred to get things upstream first. Unfortunately, this was not really possible in this case; FWIW, the RHEL kernel already has something very similar to the first 4 patches in v1 of this series. The reason for posting it all together, it's that frankly getting patches into block/ is an absolute pain. I know it's not a good reason, and I have certainly compounded my own mistakes. But guys, you have a review problem if things are allowed to sit in the mailing list for two months without a single reply. It's really the first time (and I've been working on free software for a _long_ time, as both volunteer and employee, maintainer and contributor) that I felt the helpless because I was not in the frequent contributors "clique". I know that's just an impression and doesn't match reality, but emotions do count and they make it really hard for
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 16:07, Paolo Bonzini ha scritto: >> Finally, the patch for the feature I think you actually want, which is >> 13/14, could have been implemented fairly simply as a single patch and >> doesn't have to be part of this series. > > It was, and it was ignored. I sent it together because of the common > dependency on the first patch. > > However, it is not the only feature I need; that patch should be just for > things > like reservations or vendor-specific commands. I also need that SG_IO works > well without any privileges, neither CAP_SYS_RAWIO (needed for a process to > bypass the whitelist) neither CAP_SYS_ADMIN (needed for a process to disable > the whitelist for others as in patch 13/14). I need that at least for disks, > tapes and media changers. In fact, I'd much rather go back to userspace-configurable filters (http://thread.gmane.org/gmane.linux.scsi/77783/focus=1378071); then policy can be implemented entirely in userspace based on INQUIRY data. There was a patch here for the bitmaps: http://permalink.gmane.org/gmane.linux.kernel/1378071 IIRC turning it into a full implementation requires exposing the queue parameters for all SCSI devices in sysfs, even for those devices that are not visible as block devices. (Again IIRC) non-block devices such as tapes do not have a /sys/bus/scsi/devices/h:c:i:l/block directory, hence they also have no block/queue to export the whitelist knobs. You can then add a kernel config knob that would enable only the bare minimum set of commands (basically INQUIRY, REPORT LUNS, TEST UNIT READY; maybe also some of these: REQUEST SENSE, START STOP UNIT, MODE SENSE, LOG SENSE). If people know that udev (or something else) is new enough to know how to initialize the required bitmaps, they can enable the knob. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 18:32, Martin K. Petersen ha scritto: >>>>>> "Paolo" == Paolo Bonzini writes: > > Paolo> First of all, I'll note that SG_IO and block-device-specific > Paolo> ioctls both have their place. My usecase for SG_IO is > Paolo> virtualization, where I need to pass information from the LUN to > Paolo> the virtual machine with as much fidelity as possible if I choose > Paolo> to virtualize at the SCSI level. > > Now there's your problem! Several people told you way back that the SCSI > virt approach was a really poor choice. The SG_IO permissions problem is > a classic "Doctor, it hurts when I do this". Unfortunately, it's not me who does this; I'm the doctor who was told it hurts. You have hardware providers selling cloud services that want to run their own custom backup services from within a VM, which entails having vendor-specific commands run from within a VM. Or you have people that run clusters that are half-physical and half-virtual and want to use the same /dev/disk/by-id paths in both cases; perhaps, with NPIV, they want to use one zoning approach for both physical and virtual machines. Someone else they want to backup to tapes from a VM (for example s390 people who just put everything in a VM, so the distinction of physical and virtual makes no sense for them). Some people use virtual machines as sandboxes, and want to burn the ISOs from the same VMs where they download the ISOs. Some people have vendor utilities that only run under Windows, and want to run them in a VM. Yes, it hurts when they do this. But I'm not really in the position to say "don't do that", especially if the reaction would be to pick another hypervisor than KVM. > The kernel's fundamental task is to provide abstraction between > applications and intricacies of hardware. The right way to solve the > problem would have been to provide a better device abstraction built on > top of the block/SCSI infrastructure we already have in place. If you > need more fidelity, add fidelity to the block layer instead of punching > a giant hole through it. That would require implementing: - a interface to get rich error information - a bunch of ioctls or syscalls to expose every single command, for example extended copy or reservations - a ioctl interface to media changers, similar to /dev/st - ??? and what to do about vendor specific commands, etc.? With all this to be done in the kernel, having an implementation of tape and media changer SCSI targets in the virtual machine monitor ends up being the easiest part. Some of these two subtopics have been the subject of proverbially many LWN.net articles. The implications on userspace ABI are immense, and so is the complexity of the task. There is more than a temptation to take a shortcut, and it's not by chance IMO that all of VMware, Hyper-V and Xen did the same (though for Xen it's not upstream) before me. > I seem to recall that reservations were part of your motivation for > going the SCSI route in the first place. Reservations are the main motivation for the possibility to bypass the whitelist, the other being vendor-specific commands. UNMAP/WRITE SAME/COMPARE AND WRITE are the main motivation for per-class whitelists. > A better approach would have > been to create a generic reservations mechanism that could be exposed to > the guest. And then let the baremetal kernel worry about the appropriate > way to communicate with the physical hardware. Just like we've done with > reads and writes, discard, write same, etc. I agree for some cases. For example, I did mean to send a patch to add ioctls for BLKPING (test unit ready) and BLKCMPXCHG. I haven't done yet also because the huge latency in the review of this series wasn't exactly encouraging me. But the kernel is not the right place to provide a C API wrapper for the whole SCSI standard. > The fact that burning CDs requires SG_IO in the first place is just a > symptom that we got that interface totally wrong. cat iso.img > /dev/sr0 > would have been much more in line with how Unix works... In theory, but how do you do things like formatting, picking one of the gazillion burning methods or media? Reality is, you'd have a bunch of ioctls and a program to use them. "cat /dev/sg2 > page.jpg" would also be nice for scanners, but are you going to put SANE into the kernel? Again, the SG_IO shortcut is a necessity more than a temptation. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 20:11, Theodore Ts'o ha scritto: > On Wed, May 22, 2013 at 07:00:14PM +0200, Paolo Bonzini wrote: >> You have hardware providers selling cloud services that want to run >> their own custom backup services from within a VM, which entails having >> vendor-specific commands run from within a VM. Or you have people that >> run clusters that are half-physical and half-virtual and want to use the >> same /dev/disk/by-id paths in both cases; perhaps, with NPIV, they want >> to use one zoning approach for both physical and virtual machines. >> Someone else they want to backup to tapes from a VM (for example s390 >> people who just put everything in a VM, so the distinction of physical >> and virtual makes no sense for them). Some people use virtual machines >> as sandboxes, and want to burn the ISOs from the same VMs where they >> download the ISOs. Some people have vendor utilities that only run >> under Windows, and want to run them in a VM. > > So is this hypothetical or do you have a real customer in mind? All of these come from real customers. > If it's not theoretical, how does the cloud service control who has > access to the CD burner, and how are the disks loaded into the CD > burner? CD burning would be used in a VM that runs on your local workstation, so the VM gets access to the CD burner under your desk. There was also a developer of a CD burning tool that wanted to test it inside BSD, Solaris and Windows VMs; the idea is the same. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 22:19, Theodore Ts'o ha scritto: > On Wed, May 22, 2013 at 09:37:54PM +0200, Paolo Bonzini wrote: >>> If it's not theoretical, how does the cloud service control who has >>> access to the CD burner, and how are the disks loaded into the CD >>> burner? >> >> CD burning would be used in a VM that runs on your local workstation, so >> the VM gets access to the CD burner under your desk. There was also a >> developer of a CD burning tool that wanted to test it inside BSD, >> Solaris and Windows VMs; the idea is the same. > > So in both cases all of the VM's and the host OS are within the same > trust boundary. This simplifies the security requirements than in the > more generic cloud server caser where the VM's are mutually > suspicious. This simplifies the requirements of what we need to push > into the kernel, yes? What do you mean by "push into the kernel"? (Anyway the CD burner case is really the only one that the current whitelist covers completely. I was just listing it as a use case for SG_IO in the context as virtualization). Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 22:39, Tejun Heo ha scritto: > Hey, > > On Wed, May 22, 2013 at 05:53:34PM +0200, Paolo Bonzini wrote: >> I do listen to review feedback, but I also expect the other side to >> listen to me, ask me what is not clear, and possess some knowledge of >> the domain that he's reviewing patches for. All of which, quite >> frankly, I have not seen in this case. > > Heh, nice one. As we've talked on RH mailing list, I don't doubt this > has been a pretty bad experience for you but it also has been one of > the worst review experiences for me too. I can imagine. Sorry about that. > I'm not saying you don't listen to reviews at all but the reception > definitely feels very low-gain. Frankly, I can say the same with s/reviews/explanations/... > Anyways, at this point, the easiest way to make forward progress is > completely separating out security fix from the rest along with the > "count me out" knob, which should be able to cover most of the > described use cases anyway. Let's please do further modifications to > the filtering table as a separate step. Okay, we seem to have reverted (both of us) to a more civil tone, and I appreciate setting a way forward. I'll send three separate series in the next few days. I guess it's okay to send the common patch twice. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 22/05/2013 21:30, Tejun Heo ha scritto: > The thing is that the behavior change is now implemented in an > inactive form by #2 and then flipped on by #3. #2 both change the > format and the content of the table. This should have been like the > following. > > #2: Convert to the new table for mat but set all bits for all commands > as that's what the old table looks like in the new format. Ok, that's easy. I will still need a bulk change like s/__set_bit/sgio_bitmap_set/g, but it should still be relatively trivial. > #3: Implement per-type filtering. Now this wouldn't change any > behavior as the new table is the same as the old one. This can be > rolled into #2. Ok, I prefer to keep it separate though. > #4: Update the filtering table to fix the security issue and *just* > the identified security issue with UNMAP. This way, the behavior > change is limited to the identified security issue and distros can > backport upto this to address the security issue. > > #5: Update the filtering table further so that commands which don't > make sense for the class are not allowed. Note that this has some > possibility of breaking existing users and we do want this in a > separate patch from #4. Sure, I can do that. Compared to the way I split the patches, it would have "patch 3" and "patch 4" reversed. > Anyways, my point was that you wanna > put that right after the security fix, not at the end of the series > with contentious changes in the middle. That way, the security fix > and the part that your use case requires can be picked up separately. > In fact, at this point, it chould be helpful to make them separate > patchset. I would just make it a separate series, since it isn't contentious. > So, as I've said multiple times, let's *please* do this case-by-base - > build the command table gradually with explicit use cases at each > stage because SG_IO is very easy to get wrong for both the users and > the hardware devices and the kernel isn't policing the content of > what's being issued, because it is very nice to be able to know > exactly the intents behind such table contents later, and because we > wanna debate whether we even want to support specific use cases. It's > one thing to allow WRITE SAME passthrough and it's a different thing > to try to allow all commands which may be issued by udev. Ok, so I can split it in 10 patches one per command, but at some point I wonder if it is overkill. For example, for disks: - WRITE AND VERIFY(16) is needed to support >2TB disks, and the corresponding 12-byte CDB is whitelisted already. I didn't get reports about _these_ command but I do get bug reports about >2TB disks. SYNCHRONIZE CACHE(16) is similarly the 16-byte extension of another 10-byte command. - I added ATA PASS-THROUGH(16) because ATA PASS-THROUGH(12) is present; using the (16) version is preferrable because (12) conflicts with the destructive MMC command BLANK, see the sg_sat_identify man page. - WRITE SAME(16), WRITE SAME(10), UNMAP are needed for discard. - COMPARE AND WRITE is used by cluster software. Should this be four separate patches? Or is it enough to write this in the commit message? I honestly find this level of detail to be way way higher than the normal requirement of kernel patches, but I'm happy to comply. >>> Why do you need this at all if >>> you have the "count me out" knob in the first place? >> >> Because the "count me out" knob needs privileges. It opens up way, way > > So does giving out access to the device node itself. No, it doesn't. You can use SCM_RIGHTS, and pass a file descriptor for the device node to an unprivileged program. You can choose the users/groups that are allowed to access the device. In either case, the privileged action is limited in time or in scope. The count-me-out knob affects all processes that use the device node, and won't be cleaned up properly if you SIGKILL the (privileged) process that sets it. So if you can avoid it, you should. > While it could be cumbersome, requiring privileges there for arbitration > isn't a new thing. > >> more than what these patches do. > > So, your use case wouldn't work with this? There are many use cases, I listed some in my reply to Martin. Sometimes you have trust over the guest and can use count-me-out. But in some cases you don't, and yet the current whitelist is not enough (e.g. tapes). > I did read. Here's the patch description from patch 5. > > Start cleaning up the table, moving out of the way four rare & > obsolete device types: printers, communication devices (network > cards) and scanners (both TYPE_PROCESSOR and TYPE_SCANNER). Add > missing commands for these four device types. > > No real discussion or justifications for those "missing commands". For scanners, the justification is in the patch: I actually looked at SANE and listed each single command that is used. For the printer commands, there is exactly one additio
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 23/05/2013 00:17, Tejun Heo ha scritto: > Then let's make it fit the use case better. I really can't see much > point in crafting the cdb filter when you basically have to entrust > the device to the user anyway. Let's either trust the user with the > device or not. I'm very doubtful that the filtered access via SG_IO > can be reliable or secure enough. Let's please avoid extending a > broken thing. Sorry to say that, but "I'm very doubtful that..." is just conspiracy theory. It is not broken. I'm not _that_ clueless, if it were broken I wouldn't have had users use it in production. > One more thing, is it really necessary to have finer granularity than > provided by file permissions? What would be the use case? Do you > expect to have multiple - two - differing levels of access with and > without SG_IO? No, I don't. I want four levels: 1) no access; 2) read-only access; 3) read-write whitelisted access; 4) generic access; but it's indeed fine to assume that 3 and 4 will never be given together to the same disk. The important point is that 2 and 3 should not require any privileges except for opening the file. With the opt-out knob, you still need a long-lived privileged process in order to set the knob back to "no access", and that's undesirable. Long-lived privileged processes can be SIGKILLed and leave things open for misuse; instead, if I need something privileged I want to confine it to a helper that opens the file and passes back the file descriptor. > for the same user, it's pointless to give out SG_IO access to > processes while denying for other processes. As long as ptrace can > be attached, hijacking such fd is easy. Making it per-device should > be suitable enough, no? Yes, and that's what I did. Such hijacking is why a kernel whitelist is necessary in untrusted cases (i.e. you cannot just implement it in userspace). >> There are many use cases, I listed some in my reply to Martin. >> Sometimes you have trust over the guest and can use count-me-out. But >> in some cases you don't, and yet the current whitelist is not enough >> (e.g. tapes). > > Can you elaborate? Why can't a tape device be entrusted to the user? In general, any device may or may not be entrusted to the user. In this respect, tapes or disks have no difference. But while the current whitelist is almost okay for disks, it is not usable for tapes. Too many essential commands are missing; this is why extending the whitelist to cover other device types is important for me. And since you don't want to open new commands to all classes with no distinction (which I understand), the only choice is per-class whitelists. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 23/05/2013 11:02, Tejun Heo ha scritto: > On Thu, May 23, 2013 at 09:45:42AM +0200, Paolo Bonzini wrote: >> Il 23/05/2013 00:17, Tejun Heo ha scritto: >>> Then let's make it fit the use case better. I really can't see much >>> point in crafting the cdb filter when you basically have to entrust >>> the device to the user anyway. Let's either trust the user with the >>> device or not. I'm very doubtful that the filtered access via SG_IO >>> can be reliable or secure enough. Let's please avoid extending a >>> broken thing. >> >> Sorry to say that, but "I'm very doubtful that..." is just conspiracy >> theory. >> >> It is not broken. I'm not _that_ clueless, if it were broken I wouldn't >> have had users use it in production. > > No no, I'm not talking about it not working for the users - it's just > passing the commands, it of course works. I'm doubting about it being > a worthy security isolation layer. cdb filtering (of any form really) > has always been something on the border. It always was something we > got stuck with due to lack of other immediate options. I understood correctly then. :) I agree it's not the best, but it's not completely broken either. It has bugs, and that's what these patches try to fix. >>> One more thing, is it really necessary to have finer granularity than >>> provided by file permissions? What would be the use case? Do you >>> expect to have multiple - two - differing levels of access with and >>> without SG_IO? >> >> No, I don't. I want four levels: >> >> 1) no access; >> 2) read-only access; >> 3) read-write whitelisted access; >> 4) generic access; >> >> but it's indeed fine to assume that 3 and 4 will never be given together >> to the same disk. The important point is that 2 and 3 should not >> require any privileges except for opening the file. > > I'm wondering whether combining 3 into 4 would be good enough. No, it wouldn't. I learnt it the hard way (by having a patch nacked :)) at http://thread.gmane.org/gmane.linux.kernel/1311887. >> With the opt-out knob, you still need a long-lived privileged process in >> order to set the knob back to "no access", and that's undesirable. >> Long-lived privileged processes can be SIGKILLed and leave things open >> for misuse; instead, if I need something privileged I want to confine it >> to a helper that opens the file and passes back the file descriptor. > > Hmmm? Somebody has to give out the access rights anyway, be it a udev > rule or polkit. While not having to depend on them could be nice, I > don't think involving userland is a big deal. It's already involved > in closely related capacities anyway. Polkit need not do anything to give out the access rights, it only has to just confirm the credentials. A helper setgid-disk can consult polkit, open the file, pass the file descriptor back, and exit. We already do something similar for tap devices. >>>> There are many use cases, I listed some in my reply to Martin. >>>> Sometimes you have trust over the guest and can use count-me-out. But >>>> in some cases you don't, and yet the current whitelist is not enough >>>> (e.g. tapes). >>> >>> Can you elaborate? Why can't a tape device be entrusted to the user? > > I was curious why they wouldn't be able to use count-me-out knob. The > white list was nasty but we did it anyway because there were some > horrible commands which, ISTR, could even affect other devices sharing > the bus and at least at the beginning the list of commands which > should be allowed were pretty compact. Exactly. Though these commands do not really affect other devices sharing the bus, they affect other initiators trying to use the device. And these commands are generic, so they apply to disks as well as tapes. Unfortunately, a purely userspace whitelist would be too easy to circumvent. > Whitelisting those commands is likely to be mostly harmless most of > the time but I would be surprised if there aren't devices out in the > wild which can be bricked / affected adversely with carefully crafted > commands within the allowed cdbs. The level of security isolation > which can be provided by command code filtering is somewhat flimsy, > which is why cdb filtering was always seen as a kludge, which in turn > is why there's reluctance against extending it to more use cases. Yes, I understand that. But I don't believe it is a serious concern. In the case of tapes, for example, you're not talking about 10
[PATCH v3 part1 3/4] sg_io: use different default filters for each device class
Store the filters in a 256-entry array, and pick an appropriate filter for SCSI devices. Apart from SCSI disks, SG_IO is supported for CCISS, ide-floppy and virtio-blk devices; TYPE_DISK (which is zero, i.e. the default) is more appropriate for these devices than TYPE_ROM. However, all lists are still the same, so there is no semantic change in this patch. Cc: sta...@gnu.org Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 14 +- drivers/scsi/scsi_scan.c | 2 ++ include/linux/blkdev.h | 2 +- 3 files changed, 8 insertions(+), 10 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 21ddf17..6e18156 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -35,8 +35,8 @@ #include struct blk_cmd_filter { - unsigned long read_ok[BLK_SCSI_CMD_PER_LONG]; - unsigned long write_ok[BLK_SCSI_CMD_PER_LONG]; + u32 read_ok[BLK_SCSI_MAX_CMDS]; + u32 write_ok[BLK_SCSI_MAX_CMDS]; }; static struct blk_cmd_filter blk_default_cmd_filter; @@ -117,7 +117,7 @@ static int sg_emulated_host(struct request_queue *q, int __user *p) static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) { #define sgio_bitmap_set(cmd, rw) \ - __set_bit((cmd), filter->rw##_ok) + filter->rw##_ok[(cmd)] = ~0; /* Basic read-only commands */ sgio_bitmap_set(TEST_UNIT_READY, read); @@ -210,16 +210,12 @@ int blk_verify_command(struct request_queue *q, if (capable(CAP_SYS_RAWIO)) return 0; - /* if there's no filter set, assume we're filtering everything out */ - if (!filter) - return -EPERM; - /* Anybody who can open the device can do a read-safe command */ - if (test_bit(cmd[0], filter->read_ok)) + if (filter->read_ok[cmd[0]] & (1 << q->sgio_type)) return 0; /* Write-safe commands require a writable open */ - if (test_bit(cmd[0], filter->write_ok) && has_write_perm) + if (has_write_perm && filter->write_ok[cmd[0]] & (1 << q->sgio_type)) return 0; return -EPERM; diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 3e58b22..86940f3 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -782,6 +782,8 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result, sdev->removable = (inq_result[1] & 0x80) >> 7; } + sdev->request_queue->sgio_type = sdev->type; + switch (sdev->type) { case TYPE_RBC: case TYPE_TAPE: diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4fca347..5e18969 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -257,7 +257,6 @@ struct blk_queue_tag { }; #define BLK_SCSI_MAX_CMDS (256) -#define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8)) struct queue_limits { unsigned long bounce_pfn; @@ -410,6 +409,7 @@ struct request_queue { */ unsigned intsg_timeout; unsigned intsg_reserved_size; + unsigned char sgio_type; int node; #ifdef CONFIG_BLK_DEV_IO_TRACE struct blk_trace*blk_trace; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 part1 4/4] sg_io: resolve conflicts between commands assigned to multiple classes (CVE-2012-4542)
Some SCSI commands can be sent to disks via SG_IO even by unprivileged users. Unfortunately, some opcodes overlap across SCSI device classes and have different meanings for different classes. Four of them can be used for read-only file descriptors on MMC, but should be limited to descriptors opened for read-write on SBC: - READ SUBCHANNEL <-> UNMAP (destructive, but no control on written data) - GET PERFORMANCE <-> ERASE (not really a problem, no one supports ERASE anyway) - READ DISC INFORMATION <-> XPWRITE (not commonly implemented but most dangerous) - PLAY AUDIO TI <-> SANITIZE (a very new command) In addition, REPORT KEY's opcode A4h is used in SPC for SET TARGET PORT GROUPS and various other management commands, and should be blocked for everything except CD-ROMs and the like. To fix this, the series modifies the bitmap entries for these five commands. This is the smallest change that fixes this bug. Cc: sta...@gnu.org Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 6e18156..7a1d9f6 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -199,6 +199,32 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(GPCMD_SET_STREAMING, write); sgio_bitmap_set(GPCMD_SET_READ_AHEAD, write); #undef sgio_bitmap_set + + /* +* Treat specially those commands that have a different meaning +* for disks: READ SUBCHANNEL conflicts with UNMAP. +*/ + filter->read_ok[GPCMD_READ_SUBCHANNEL] &= ~(1 << TYPE_DISK); + filter->write_ok[GPCMD_READ_SUBCHANNEL] |= (1 << TYPE_DISK); + + /* PLAY AUDIO TI conflicts with SANITIZE. */ + filter->read_ok[GPCMD_PLAY_AUDIO_TI] &= ~((1 << TYPE_DISK) | (1 << TYPE_RBC)); + filter->write_ok[GPCMD_PLAY_AUDIO_TI] |= (1 << TYPE_DISK) | (1 << TYPE_RBC); + + /* READ DISC INFORMATION conflicts with XPWRITE. */ + filter->read_ok[GPCMD_READ_DISC_INFO] &= ~(1 << TYPE_DISK); + filter->write_ok[GPCMD_READ_DISC_INFO] |= (1 << TYPE_DISK); + + /* GET PERFORMANCE conflicts with ERASE. */ + filter->read_ok[GPCMD_GET_PERFORMANCE] &= ~(1 << TYPE_MOD); + filter->write_ok[GPCMD_GET_PERFORMANCE] |= (1 << TYPE_MOD); + + /* +* REPORT KEY conflicts with many management commands under operation +* code 0xA4, enable it only for MMC devices. +*/ + filter->read_ok[GPCMD_REPORT_KEY] = (1 << TYPE_ROM); + filter->write_ok[GPCMD_REPORT_KEY] = (1 << TYPE_ROM); } int blk_verify_command(struct request_queue *q, -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 part3 1/7] sg_io: complete transition to per-class lists of allowed commands
After this patch, a few commands are forbidden for devices of type other than TYPE_ROM, where they are reserved, vendor-specific. This avoids that future version of the standards introduce unwanted conficts. One command (READ CAPACITY) was listed twice in the old table (once as READ_CAPACITY, once as GPCMD_READ_CDVD_CAPACITY), hence the new table has one entry less than the old one. The old table has 72 entries, the new one has 71. Note that TYPE_SCANNER is not subject to the whitelist, so it is not included at all. Of course, checkpatch hates this table. It has long lines and non-standard spacing. IMO the improved readability trumps the problems reported by checkpatch. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 207 +++- include/scsi/scsi.h | 3 + 2 files changed, 128 insertions(+), 82 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 0af98fe..5d2143f 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -116,88 +116,131 @@ static int sg_emulated_host(struct request_queue *q, int __user *p) static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) { -#define sgio_bitmap_set(cmd, rw) \ - filter->rw##_ok[(cmd)] = ~0; - - /* Basic read-only commands */ - sgio_bitmap_set(TEST_UNIT_READY, read); - sgio_bitmap_set(REQUEST_SENSE, read); - sgio_bitmap_set(READ_6, read); - sgio_bitmap_set(READ_10, read); - sgio_bitmap_set(READ_12, read); - sgio_bitmap_set(READ_16, read); - sgio_bitmap_set(READ_BUFFER, read); - sgio_bitmap_set(READ_DEFECT_DATA, read); - sgio_bitmap_set(READ_CAPACITY, read); - sgio_bitmap_set(READ_LONG, read); - sgio_bitmap_set(INQUIRY, read); - sgio_bitmap_set(MODE_SENSE, read); - sgio_bitmap_set(MODE_SENSE_10, read); - sgio_bitmap_set(LOG_SENSE, read); - sgio_bitmap_set(START_STOP, read); - sgio_bitmap_set(GPCMD_VERIFY_10, read); - sgio_bitmap_set(VERIFY_16, read); - sgio_bitmap_set(REPORT_LUNS, read); - sgio_bitmap_set(SERVICE_ACTION_IN, read); - sgio_bitmap_set(RECEIVE_DIAGNOSTIC, read); - sgio_bitmap_set(MAINTENANCE_IN, read); - sgio_bitmap_set(GPCMD_READ_BUFFER_CAPACITY, read); - - /* Audio CD commands */ - sgio_bitmap_set(GPCMD_PLAY_CD, read); - sgio_bitmap_set(GPCMD_PLAY_AUDIO_10, read); - sgio_bitmap_set(GPCMD_PLAY_AUDIO_MSF, read); - sgio_bitmap_set(GPCMD_PLAY_AUDIO_TI, read); - sgio_bitmap_set(GPCMD_PAUSE_RESUME, read); - - /* CD/DVD data reading */ - sgio_bitmap_set(GPCMD_READ_CD, read); - sgio_bitmap_set(GPCMD_READ_CD_MSF, read); - sgio_bitmap_set(GPCMD_READ_DISC_INFO, read); - sgio_bitmap_set(GPCMD_READ_CDVD_CAPACITY, read); - sgio_bitmap_set(GPCMD_READ_DVD_STRUCTURE, read); - sgio_bitmap_set(GPCMD_READ_HEADER, read); - sgio_bitmap_set(GPCMD_READ_TRACK_RZONE_INFO, read); - sgio_bitmap_set(GPCMD_READ_SUBCHANNEL, read); - sgio_bitmap_set(GPCMD_READ_TOC_PMA_ATIP, read); - sgio_bitmap_set(GPCMD_REPORT_KEY, read); - sgio_bitmap_set(GPCMD_SCAN, read); - sgio_bitmap_set(GPCMD_GET_CONFIGURATION, read); - sgio_bitmap_set(GPCMD_READ_FORMAT_CAPACITIES, read); - sgio_bitmap_set(GPCMD_GET_EVENT_STATUS_NOTIFICATION, read); - sgio_bitmap_set(GPCMD_GET_PERFORMANCE, read); - sgio_bitmap_set(GPCMD_SEEK, read); - sgio_bitmap_set(GPCMD_STOP_PLAY_SCAN, read); - - /* Basic writing commands */ - sgio_bitmap_set(WRITE_6, write); - sgio_bitmap_set(WRITE_10, write); - sgio_bitmap_set(WRITE_VERIFY, write); - sgio_bitmap_set(WRITE_12, write); - sgio_bitmap_set(WRITE_VERIFY_12, write); - sgio_bitmap_set(WRITE_16, write); - sgio_bitmap_set(WRITE_LONG, write); - sgio_bitmap_set(WRITE_LONG_2, write); - sgio_bitmap_set(ERASE, write); - sgio_bitmap_set(GPCMD_MODE_SELECT_10, write); - sgio_bitmap_set(MODE_SELECT, write); - sgio_bitmap_set(LOG_SELECT, write); - sgio_bitmap_set(GPCMD_BLANK, write); - sgio_bitmap_set(GPCMD_CLOSE_TRACK, write); - sgio_bitmap_set(GPCMD_FLUSH_CACHE, write); - sgio_bitmap_set(GPCMD_FORMAT_UNIT, write); - sgio_bitmap_set(GPCMD_REPAIR_RZONE_TRACK, write); - sgio_bitmap_set(GPCMD_RESERVE_RZONE_TRACK, write); - sgio_bitmap_set(GPCMD_SEND_DVD_STRUCTURE, write); - sgio_bitmap_set(GPCMD_SEND_EVENT, write); - sgio_bitmap_set(GPCMD_SEND_KEY, write); - sgio_bitmap_set(GPCMD_SEND_OPC, write); - sgio_bitmap_set(GPCMD_SEND_CUE_SHEET, write); - sgio_bitmap_set(GPCMD_SET_SPEED, write); - sgio_bitmap_set(GPCMD_PREVENT_ALLOW_MEDIUM_REMOVAL, write); - sgio_bitmap_set(GPCMD_LOAD_UNLOAD, write); - sgio_bitmap_set(GPCMD_SET_STREAMING, write); - sg
[PATCH v3 part3 3/7] sg_io: cleanup and complete whitelist for rare device types
Start cleaning up the table, moving out of the way four rare & obsolete device types: printers, communication devices (network cards), and processor devices. This patch is included mostly for tidiness, so that flags for obsolete device types do not clutter the other entries. However, it adds two commands for printers: SLEW AND PRINT and SYNCHRONIZE BUFFER. These are present even in SCSI-2 (dated 1994). Processor devices could also do EXTENDED COPY operations, but these should not be allowed for unprivileged users because they can read/write to other devices. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 41 - 1 file changed, 32 insertions(+), 9 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 8f0344f..41bbd93 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -142,7 +142,7 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x03, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* REQUEST SENSE */ sgio_bitmap_set(0x12, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* INQUIRY */ sgio_bitmap_set(0x1A, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* MODE SENSE(6) */ - sgio_bitmap_set(0x1B, D|T|L| W|R|O|M|A| B|K|V|F , read); /* START STOP UNIT */ + sgio_bitmap_set(0x1B, D|T|W|R|O|M|A| B|K|V|F , read); /* START STOP UNIT */ sgio_bitmap_set(0x1C, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* RECEIVE DIAGNOSTIC RESULTS */ sgio_bitmap_set(0x2B, D|T|W|R|O|M| K , read); /* SEEK(10) */ sgio_bitmap_set(0x3C, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* READ BUFFER */ @@ -161,21 +161,21 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) /* read */ - sgio_bitmap_set(0x08, D|T| P|W| O| C, read); /* READ(6) */ + sgio_bitmap_set(0x08, D|T|W| O, read); /* READ(6) */ sgio_bitmap_set(0x25, D| W|R|O| B|K , read); /* READ CAPACITY(10) */ - sgio_bitmap_set(0x28, D| W|R|O| B|K|C, read); /* READ(10) */ + sgio_bitmap_set(0x28, D| W|R|O| B|K , read); /* READ(10) */ sgio_bitmap_set(0x2F, D| W|R|O, read); /* VERIFY(10) */ sgio_bitmap_set(0x37, D| O|M , read); /* READ DEFECT DATA(10) */ sgio_bitmap_set(0x3E, D| W| O, read); /* READ LONG(10) */ sgio_bitmap_set(0x88, D|T|W| O| B, read); /* READ(16) */ sgio_bitmap_set(0x8F, D|T|W| O| B, read); /* VERIFY(16) */ - sgio_bitmap_set(0xA8, D| W|R|O| C, read); /* READ(12) */ + sgio_bitmap_set(0xA8, D| W|R|O, read); /* READ(12) */ /* write */ - sgio_bitmap_set(0x04, D|T|L|R|O, write); /* FORMAT UNIT */ - sgio_bitmap_set(0x0A, D|T|L|P|W| O| C, write); /* WRITE(6) */ - sgio_bitmap_set(0x2A, D| W|R|O| B|K|C, write); /* WRITE(10) */ + sgio_bitmap_set(0x04, D|T| R|O, write); /* FORMAT UNIT */ + sgio_bitmap_set(0x0A, D|T|W| O, write); /* WRITE(6) */ + sgio_bitmap_set(0x2A, D| W|R|O| B|K , write); /* WRITE(10) */ sgio_bitmap_set(0x2E, D| W|R|O| B|K , write); /* WRITE AND VERIFY(10) */ sgio_bitmap_set(0x35, D| W|R|O| B|K , write); /* SYNCHRONIZE CACHE(10) */ sgio_bitmap_set(0x3F, D| W| O, write); /* WRITE LONG(10) */ @@ -183,11 +183,20 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x48, D| B, write); /* SANITIZE */ sgio_bitmap_set(0x51, D, write); /* XPWRITE(10) */ sgio_bitmap_set(0x8A, D|T|W| O| B, write); /* WRITE(16) */ - sgio_bitmap_set(0xAA, D| W|R|O| C, write); /* WRITE(12) */ + sgio_bitmap_set(0xAA, D| W|R|O, write); /* WRITE(12) */ sgio_bitmap_set(0xAC, O, write); /* ERASE(12) */ sgio_bitmap_set(0xAE, D| W| O, write); /* WRITE AND VERIFY(12) */ sgio_bitmap_set(0xEA, D| W| O, write); /* WRITE_LONG_2 ?? */ + /* printer */ + + sgio_bitmap_set(0x04, L, write); /* FORMAT */ + sgio_bitmap_set(0x0A, L, write); /* PRINT */ + sgio_bitmap_set(0x0B, L, write); /* SLEW AND PRINT */ + sgio_bitmap_set(0x10, L, write); /* SYNCHRONIZE BUFFER */ + sgio_bitmap_set(0x14, L, writ
[PATCH v3 part3 5/7] sg_io: cleanup and complete whitelist for media changers
Besides CD-ROMs, three more device types are interesting for SG_IO: media changers, tapes and of course disks. Starting with this patch, we will whitelist a few more commands for these devices. For media changers, enable "INITIALIZE ELEMENT STATUS" and "REQUEST VOLUME ELEMENT ADDRESS". A few changer-specific commands were already enabled by chance because they overlapped commands that are valid for other classes: "EXCHANGE MEDIUM", "SEND VOLUME TAG", "INITIALIZE ELEMENT STATUS WITH RANGE". In addition, some changers are still using an old pre-standard command for "INITIALIZE ELEMENT STATUS WITH RANGE", which is enabled as well. This makes media changers usable by unprivileged users that have access to the device node. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 27 ++- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index b11ad49..f9e719e 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -142,9 +142,9 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x03, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* REQUEST SENSE */ sgio_bitmap_set(0x12, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* INQUIRY */ sgio_bitmap_set(0x1A, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* MODE SENSE(6) */ - sgio_bitmap_set(0x1B, D|T|W|R|O|M|A| B|K|V|F , read); /* START STOP UNIT */ + sgio_bitmap_set(0x1B, D|T|W|R|O| A| B|K|V|F , read); /* START STOP UNIT */ sgio_bitmap_set(0x1C, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* RECEIVE DIAGNOSTIC RESULTS */ - sgio_bitmap_set(0x2B, D|T|W|R|O|M| K , read); /* SEEK(10) */ + sgio_bitmap_set(0x2B, D|T|W|R|O|K , read); /* SEEK(10) */ sgio_bitmap_set(0x3C, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* READ BUFFER */ sgio_bitmap_set(0x4D, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* LOG SENSE */ sgio_bitmap_set(0x5A, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* MODE SENSE(10) */ @@ -165,7 +165,7 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x25, D| W|R|O| B|K , read); /* READ CAPACITY(10) */ sgio_bitmap_set(0x28, D| W|R|O| B|K , read); /* READ(10) */ sgio_bitmap_set(0x2F, D| W|R|O, read); /* VERIFY(10) */ - sgio_bitmap_set(0x37, D| O|M , read); /* READ DEFECT DATA(10) */ + sgio_bitmap_set(0x37, D| O, read); /* READ DEFECT DATA(10) */ sgio_bitmap_set(0x3E, D| W| O, read); /* READ LONG(10) */ sgio_bitmap_set(0x88, D|T|W| O| B, read); /* READ(16) */ sgio_bitmap_set(0x8F, D|T|W| O| B, read); /* VERIFY(16) */ @@ -197,6 +197,23 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x14, L, write); /* RECOVER BUFFERED DATA */ sgio_bitmap_set(0x1B, L, write); /* STOP PRINT */ + /* media changer */ + + sgio_bitmap_set(0x07, M , read); /* INITIALIZE ELEMENT STATUS */ + sgio_bitmap_set(0x1B, M , read); /* OPEN/CLOSE IMPORT/EXPORT ELEMENT */ + sgio_bitmap_set(0x2B, M , read); /* POSITION TO ELEMENT */ + sgio_bitmap_set(0x37, M , read); /* INITIALIZE ELEMENT STATUS WITH RANGE */ + sgio_bitmap_set(0xB5, M , read); /* REQUEST VOLUME ELEMENT ADDRESS */ + + sgio_bitmap_set(0xA6, M , write); /* EXCHANGE MEDIUM */ + sgio_bitmap_set(0xB6, M , write); /* SEND VOLUME TAG */ + + /* +* The standard says that INITIALIZE ELEMENT STATUS WITH RANGE was +* mapped to 0xE7 (vendor-specific) before being standardized. +*/ + sgio_bitmap_set(0xE7, M , read); + /* MMC */ sgio_bitmap_set(0x23, R , read); /* READ FORMAT CAPACITIES */ @@ -230,9 +247,9 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0xA1, D|R|B, write); /* BLANK / ATA PASS-THROUGH(12) */ sgio_bitmap_set(0xA2, R , write); /* SEND EVENT */ sgio_bitmap_set(0xA3, R , write); /* SEND KEY */ - sgio_bitmap_set(0xA6, R| M , write); /* LOAD/UNLOAD C/DVD */ + sgio_bitmap_set(0xA6, R , write); /* LOAD/UNLOAD C/DVD */ sgi
[PATCH v3 part3 6/7] sg_io: cleanup and complete whitelist for tapes
Tapes have no problematic overlap, but quite a few commands are missing that are useful when operating tapes with /dev/sg. This patch adds commands from the SSC standards to the list. I added everything because the current whitelist is totally inapplicable to those devices; the command set is almost completely different from that of disks. The SSC standard is relatively small and does not really have the kitchen sink in them. Commands with similar meanings often have different names or slightly different characteristics for tapes vs. block devices; for example, tapes have no 10- or 12-byte CDBs for READ and WRITE. For this reason I'm separating block devices for tapes entirely. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 47 --- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index f9e719e..0bf0820 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -142,9 +142,9 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x03, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* REQUEST SENSE */ sgio_bitmap_set(0x12, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* INQUIRY */ sgio_bitmap_set(0x1A, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* MODE SENSE(6) */ - sgio_bitmap_set(0x1B, D|T|W|R|O| A| B|K|V|F , read); /* START STOP UNIT */ + sgio_bitmap_set(0x1B, D| W|R|O| A| B|K|V|F , read); /* START STOP UNIT */ sgio_bitmap_set(0x1C, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* RECEIVE DIAGNOSTIC RESULTS */ - sgio_bitmap_set(0x2B, D|T|W|R|O|K , read); /* SEEK(10) */ + sgio_bitmap_set(0x2B, D| W|R|O|K , read); /* SEEK(10) */ sgio_bitmap_set(0x3C, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* READ BUFFER */ sgio_bitmap_set(0x4D, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* LOG SENSE */ sgio_bitmap_set(0x5A, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* MODE SENSE(10) */ @@ -159,22 +159,22 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x55, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, write); /* MODE SELECT(10) */ sgio_bitmap_set(0x1E, D|T|W|R|O|M| K| F , write); /* PREVENT ALLOW MEDIUM REMOVAL */ - /* read */ + /* block devices, read */ - sgio_bitmap_set(0x08, D|T|W| O, read); /* READ(6) */ + sgio_bitmap_set(0x08, D| W| O, read); /* READ(6) */ sgio_bitmap_set(0x25, D| W|R|O| B|K , read); /* READ CAPACITY(10) */ sgio_bitmap_set(0x28, D| W|R|O| B|K , read); /* READ(10) */ sgio_bitmap_set(0x2F, D| W|R|O, read); /* VERIFY(10) */ sgio_bitmap_set(0x37, D| O, read); /* READ DEFECT DATA(10) */ sgio_bitmap_set(0x3E, D| W| O, read); /* READ LONG(10) */ - sgio_bitmap_set(0x88, D|T|W| O| B, read); /* READ(16) */ - sgio_bitmap_set(0x8F, D|T|W| O| B, read); /* VERIFY(16) */ + sgio_bitmap_set(0x88, D| W| O| B, read); /* READ(16) */ + sgio_bitmap_set(0x8F, D| W| O| B, read); /* VERIFY(16) */ sgio_bitmap_set(0xA8, D| W|R|O, read); /* READ(12) */ - /* write */ + /* block devices, write */ - sgio_bitmap_set(0x04, D|T| R|O, write); /* FORMAT UNIT */ - sgio_bitmap_set(0x0A, D|T|W| O, write); /* WRITE(6) */ + sgio_bitmap_set(0x04, D|R|O, write); /* FORMAT UNIT */ + sgio_bitmap_set(0x0A, D| W| O, write); /* WRITE(6) */ sgio_bitmap_set(0x2A, D| W|R|O| B|K , write); /* WRITE(10) */ sgio_bitmap_set(0x2E, D| W|R|O| B|K , write); /* WRITE AND VERIFY(10) */ sgio_bitmap_set(0x35, D| W|R|O| B|K , write); /* SYNCHRONIZE CACHE(10) */ @@ -182,7 +182,7 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x42, D, write); /* UNMAP */ sgio_bitmap_set(0x48, D| B, write); /* SANITIZE */ sgio_bitmap_set(0x51, D, write); /* XPWRITE(10) */ - sgio_bitmap_set(0x8A, D|T|W| O| B, write); /* WRITE(16) */ + sgio_bitmap_set(0x8A, D| W| O| B, write); /* WRITE(16) */ sgio_bitmap_set(0xAA, D| W|R|O, write); /* WRITE(12) */ sgio_bitmap_set(0xAC, O, write); /* ERASE(12) */ sgio_bitmap_set(0xAE, D| W| O, write)
[PATCH v3 part3 7/7] sg_io: cleanup and complete whitelist for disks
This splits entries for SBC commands that conflict with MMC, and adds missing commands to the table from SBC and related standards. Only commands that affect the medium are added: - I added ATA PASS-THROUGH(16) because ATA PASS-THROUGH(12) is present; using the (16) version is preferrable because (12) conflicts with the destructive MMC command BLANK; see the sg_sat_identify man page for example. - WRITE SAME(16) and WRITE SAME(10) are used by discard and zero-out feature. For discard, the UNMAP command is already in the whitelist. For zero-out, omitting them produces "WRITE SAME failed. Manually zeroing" errors in the log. - COMPARE AND WRITE is used by cluster software. - WRITE AND VERIFY(16) is needed to support >2TB disks, and the corresponding 12-byte CDB is whitelisted already. I didn't get reports about _these_ command but I do get bug reports about >2TB disks. SYNCHRONIZE CACHE(16) is similarly the 16-byte extension of another 10-byte command. - similarly, in the case of VERIFY(12) and READ DEFECT DATA(12) the same command was already whitelisted for another command length. Commands that affect other state of the LUN are all privileged, with the sole exception of START STOP UNIT, which has always been allowed for all file descriptors. I do not really agree with that and it's probably an artifact of when /dev/cdrom had r--r--r-- permissions, but I'm not trying to change that. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 0bf0820..216cd17 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -170,6 +170,8 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x88, D| W| O| B, read); /* READ(16) */ sgio_bitmap_set(0x8F, D| W| O| B, read); /* VERIFY(16) */ sgio_bitmap_set(0xA8, D| W|R|O, read); /* READ(12) */ + sgio_bitmap_set(0xAF, D| W| O, read); /* VERIFY(12) */ + sgio_bitmap_set(0xB7, D| O, read); /* READ DEFECT DATA(12) */ /* block devices, write */ @@ -179,10 +181,18 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x2E, D| W|R|O| B|K , write); /* WRITE AND VERIFY(10) */ sgio_bitmap_set(0x35, D| W|R|O| B|K , write); /* SYNCHRONIZE CACHE(10) */ sgio_bitmap_set(0x3F, D| W| O, write); /* WRITE LONG(10) */ + sgio_bitmap_set(0x41, D, write); /* WRITE SAME(10) */ sgio_bitmap_set(0x42, D, write); /* UNMAP */ sgio_bitmap_set(0x48, D| B, write); /* SANITIZE */ sgio_bitmap_set(0x51, D, write); /* XPWRITE(10) */ + sgio_bitmap_set(0x53, D, write); /* XDWRITEREAD(10) */ + sgio_bitmap_set(0x85, D| B, write); /* ATA PASS-THROUGH(16) */ + sgio_bitmap_set(0x89, D, write); /* COMPARE AND WRITE */ sgio_bitmap_set(0x8A, D| W| O| B, write); /* WRITE(16) */ + sgio_bitmap_set(0x8E, D| W| O| B, write); /* WRITE AND VERIFY(16) */ + sgio_bitmap_set(0x91, D| W| O| B, write); /* SYNCHRONIZE CACHE(16) */ + sgio_bitmap_set(0x93, D, write); /* WRITE SAME(16) */ + sgio_bitmap_set(0xA1, D| B, write); /* ATA PASS-THROUGH(12) */ sgio_bitmap_set(0xAA, D| W|R|O, write); /* WRITE(12) */ sgio_bitmap_set(0xAC, O, write); /* ERASE(12) */ sgio_bitmap_set(0xAE, D| W| O, write); /* WRITE AND VERIFY(12) */ @@ -239,12 +249,12 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0xBD, R , read); /* MECHANISM STATUS */ sgio_bitmap_set(0xBE, R , read); /* READ CD */ - sgio_bitmap_set(0x53, D|R , write); /* RESERVE TRACK / XDWRITEREAD(10) */ + sgio_bitmap_set(0x53, R , write); /* RESERVE TRACK */ sgio_bitmap_set(0x54, R , write); /* SEND OPC INFORMATION */ sgio_bitmap_set(0x58, R , write); /* REPAIR TRACK */ sgio_bitmap_set(0x5B, R , write); /* CLOSE TRACK/SESSION */ sgio_bitmap_set(0x5D, R , write); /* SEND CUE SHEET */ - sgio_bitmap_set(0xA1, D|R|
[PATCH v3 part3 4/7] sg_io: whitelist another command for multimedia devices
Three MMC commands were never included: PLAY AUDIO(12), SERVICE ACTION IN(12), MECHANISM STATUS. Add MECHANISM STATUS, the only one that has not been obsoleted in recent versions of the standard. QEMU implements it, so it is reasonable to assume that someone is using it. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 41bbd93..b11ad49 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -219,6 +219,7 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0xB9, R , read); /* READ CD MSF */ sgio_bitmap_set(0xBA, R , read); /* SCAN */ sgio_bitmap_set(0xBC, R , read); /* PLAY CD */ + sgio_bitmap_set(0xBD, R , read); /* MECHANISM STATUS */ sgio_bitmap_set(0xBE, R , read); /* READ CD */ sgio_bitmap_set(0x53, D|R , write); /* RESERVE TRACK / XDWRITEREAD(10) */ -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 part3 2/7] sg_io: create separate entries for conflicting commands
Some SCSI commands were special cased at the end of the table because they overlapped across SCSI device classes, with different meanings for different classes. Instead of hacking the bits manually, use separate entries in the table. The 0xA4 opcode is blocked for non-MMC devices, even when open read-write. The other four conflicting commands have their bitmap entries split in two parts, one read-only for MMC and one read-write for the other classes. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 40 +--- 1 file changed, 9 insertions(+), 31 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 5d2143f..8f0344f 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -179,29 +179,33 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) sgio_bitmap_set(0x2E, D| W|R|O| B|K , write); /* WRITE AND VERIFY(10) */ sgio_bitmap_set(0x35, D| W|R|O| B|K , write); /* SYNCHRONIZE CACHE(10) */ sgio_bitmap_set(0x3F, D| W| O, write); /* WRITE LONG(10) */ + sgio_bitmap_set(0x42, D, write); /* UNMAP */ + sgio_bitmap_set(0x48, D| B, write); /* SANITIZE */ + sgio_bitmap_set(0x51, D, write); /* XPWRITE(10) */ sgio_bitmap_set(0x8A, D|T|W| O| B, write); /* WRITE(16) */ sgio_bitmap_set(0xAA, D| W|R|O| C, write); /* WRITE(12) */ + sgio_bitmap_set(0xAC, O, write); /* ERASE(12) */ sgio_bitmap_set(0xAE, D| W| O, write); /* WRITE AND VERIFY(12) */ sgio_bitmap_set(0xEA, D| W| O, write); /* WRITE_LONG_2 ?? */ /* MMC */ sgio_bitmap_set(0x23, R , read); /* READ FORMAT CAPACITIES */ - sgio_bitmap_set(0x42, D|R , read); /* READ SUB-CHANNEL / UNMAP !! */ + sgio_bitmap_set(0x42, R , read); /* READ SUB-CHANNEL */ sgio_bitmap_set(0x43, R , read); /* READ TOC/PMA/ATIP */ sgio_bitmap_set(0x44, T| R , read); /* READ HEADER */ sgio_bitmap_set(0x45, R , read); /* PLAY AUDIO(10) */ sgio_bitmap_set(0x46, R , read); /* GET CONFIGURATION */ sgio_bitmap_set(0x47, R , read); /* PLAY AUDIO MSF */ - sgio_bitmap_set(0x48, D|R|B, read); /* PLAY AUDIO TI / SANITIZE !! */ + sgio_bitmap_set(0x48, R , read); /* PLAY AUDIO TI */ sgio_bitmap_set(0x4A, R , read); /* GET EVENT STATUS NOTIFICATION */ sgio_bitmap_set(0x4B, R , read); /* PAUSE/RESUME */ sgio_bitmap_set(0x4E, R , read); /* STOP PLAY/SCAN */ - sgio_bitmap_set(0x51, D|R , read); /* READ DISC INFORMATION / XPWRITE(10) !! */ + sgio_bitmap_set(0x51, R , read); /* READ DISC INFORMATION */ sgio_bitmap_set(0x52, R , read); /* READ TRACK INFORMATION */ sgio_bitmap_set(0x5C, R , read); /* READ BUFFER CAPACITY */ - sgio_bitmap_set(0xA4, D|T|L|P|W|R|O|M|A|E|B|K|V|F|C, read); /* REPORT KEY / 0xA4 service actions !! */ - sgio_bitmap_set(0xAC, R|O, read); /* GET PERFORMANCE / ERASE !! */ + sgio_bitmap_set(0xA4, R , read); /* REPORT KEY */ + sgio_bitmap_set(0xAC, R , read); /* GET PERFORMANCE */ sgio_bitmap_set(0xAD, R , read); /* READ DVD STRUCTURE */ sgio_bitmap_set(0xB9, R , read); /* READ CD MSF */ sgio_bitmap_set(0xBA, R , read); /* SCAN */ @@ -242,32 +246,6 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) #undef V #undef F #undef sgio_bitmap_set - - /* -* Treat specially those commands that have a different meaning -* for disks: READ SUBCHANNEL conflicts with UNMAP. -*/ - filter->read_ok[GPCMD_READ_SUBCHANNEL] &= ~(1 << TYPE_DISK); - filter->write_ok[GPCMD_READ_SUBCHANNEL] |= (1 << TYPE_DISK); - - /* PLAY AUDIO TI conflicts with SANITIZE. */ - filter->read_ok[GPCMD_PLAY_AUDIO_TI] &= ~((1 << TYPE_DISK) | (1 << TYPE_RBC)); - filter->write_ok[GPCMD_PLAY_AUDIO_TI] |= (1 << TYPE_DISK) | (1 << TYPE_RBC); - - /* READ DISC INFORMATION conflicts with
[PATCH v3 part1 2/4] sg_io: prepare to introduce per-class command filters
To prepare for the next patches, abstract setting of an entry in the command filter behind a macro. The next patch will change the implementation of the macro. Cc: sta...@gnu.org Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/scsi_ioctl.c | 148 +++-- 1 file changed, 76 insertions(+), 72 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 96cab50..21ddf17 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -116,85 +116,89 @@ static int sg_emulated_host(struct request_queue *q, int __user *p) static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) { +#define sgio_bitmap_set(cmd, rw) \ + __set_bit((cmd), filter->rw##_ok) + /* Basic read-only commands */ - __set_bit(TEST_UNIT_READY, filter->read_ok); - __set_bit(REQUEST_SENSE, filter->read_ok); - __set_bit(READ_6, filter->read_ok); - __set_bit(READ_10, filter->read_ok); - __set_bit(READ_12, filter->read_ok); - __set_bit(READ_16, filter->read_ok); - __set_bit(READ_BUFFER, filter->read_ok); - __set_bit(READ_DEFECT_DATA, filter->read_ok); - __set_bit(READ_CAPACITY, filter->read_ok); - __set_bit(READ_LONG, filter->read_ok); - __set_bit(INQUIRY, filter->read_ok); - __set_bit(MODE_SENSE, filter->read_ok); - __set_bit(MODE_SENSE_10, filter->read_ok); - __set_bit(LOG_SENSE, filter->read_ok); - __set_bit(START_STOP, filter->read_ok); - __set_bit(GPCMD_VERIFY_10, filter->read_ok); - __set_bit(VERIFY_16, filter->read_ok); - __set_bit(REPORT_LUNS, filter->read_ok); - __set_bit(SERVICE_ACTION_IN, filter->read_ok); - __set_bit(RECEIVE_DIAGNOSTIC, filter->read_ok); - __set_bit(MAINTENANCE_IN, filter->read_ok); - __set_bit(GPCMD_READ_BUFFER_CAPACITY, filter->read_ok); + sgio_bitmap_set(TEST_UNIT_READY, read); + sgio_bitmap_set(REQUEST_SENSE, read); + sgio_bitmap_set(READ_6, read); + sgio_bitmap_set(READ_10, read); + sgio_bitmap_set(READ_12, read); + sgio_bitmap_set(READ_16, read); + sgio_bitmap_set(READ_BUFFER, read); + sgio_bitmap_set(READ_DEFECT_DATA, read); + sgio_bitmap_set(READ_CAPACITY, read); + sgio_bitmap_set(READ_LONG, read); + sgio_bitmap_set(INQUIRY, read); + sgio_bitmap_set(MODE_SENSE, read); + sgio_bitmap_set(MODE_SENSE_10, read); + sgio_bitmap_set(LOG_SENSE, read); + sgio_bitmap_set(START_STOP, read); + sgio_bitmap_set(GPCMD_VERIFY_10, read); + sgio_bitmap_set(VERIFY_16, read); + sgio_bitmap_set(REPORT_LUNS, read); + sgio_bitmap_set(SERVICE_ACTION_IN, read); + sgio_bitmap_set(RECEIVE_DIAGNOSTIC, read); + sgio_bitmap_set(MAINTENANCE_IN, read); + sgio_bitmap_set(GPCMD_READ_BUFFER_CAPACITY, read); /* Audio CD commands */ - __set_bit(GPCMD_PLAY_CD, filter->read_ok); - __set_bit(GPCMD_PLAY_AUDIO_10, filter->read_ok); - __set_bit(GPCMD_PLAY_AUDIO_MSF, filter->read_ok); - __set_bit(GPCMD_PLAY_AUDIO_TI, filter->read_ok); - __set_bit(GPCMD_PAUSE_RESUME, filter->read_ok); + sgio_bitmap_set(GPCMD_PLAY_CD, read); + sgio_bitmap_set(GPCMD_PLAY_AUDIO_10, read); + sgio_bitmap_set(GPCMD_PLAY_AUDIO_MSF, read); + sgio_bitmap_set(GPCMD_PLAY_AUDIO_TI, read); + sgio_bitmap_set(GPCMD_PAUSE_RESUME, read); /* CD/DVD data reading */ - __set_bit(GPCMD_READ_CD, filter->read_ok); - __set_bit(GPCMD_READ_CD_MSF, filter->read_ok); - __set_bit(GPCMD_READ_DISC_INFO, filter->read_ok); - __set_bit(GPCMD_READ_CDVD_CAPACITY, filter->read_ok); - __set_bit(GPCMD_READ_DVD_STRUCTURE, filter->read_ok); - __set_bit(GPCMD_READ_HEADER, filter->read_ok); - __set_bit(GPCMD_READ_TRACK_RZONE_INFO, filter->read_ok); - __set_bit(GPCMD_READ_SUBCHANNEL, filter->read_ok); - __set_bit(GPCMD_READ_TOC_PMA_ATIP, filter->read_ok); - __set_bit(GPCMD_REPORT_KEY, filter->read_ok); - __set_bit(GPCMD_SCAN, filter->read_ok); - __set_bit(GPCMD_GET_CONFIGURATION, filter->read_ok); - __set_bit(GPCMD_READ_FORMAT_CAPACITIES, filter->read_ok); - __set_bit(GPCMD_GET_EVENT_STATUS_NOTIFICATION, filter->read_ok); - __set_bit(GPCMD_GET_PERFORMANCE, filter->read_ok); - __set_bit(GPCMD_SEEK, filter->read_ok); - __set_bit(GPCMD_STOP_PLAY_SCAN, filter->read_ok); + sgio_bitmap_set(GPCMD_READ_CD, read); + sgio_bitmap_set(GPCMD_READ_CD_MSF, read); + sgio_bitmap_set(GPCMD_READ_DISC_INFO, read); + sgio_bitmap_set(GPCMD_READ_CDVD_CAPACITY, read); + sgio_bitmap_set(GPCMD_READ_DVD_STRUCTURE, read); + sgio_bitmap_set(GPCMD_READ_HEADER,
[PATCH v3 part2] Add per-device sysfs knob to enable unrestricted, unprivileged SG_IO
Privilege restrictions for SG_IO right now apply without distinction to all devices, based on the single capability CAP_SYS_RAWIO. This is a very broad capability, and makes it difficult to give SG_IO access to trusted clients that need access to persistent reservations, trim/discard, or vendor-specific commands. One problem here is that CAP_SYS_RAWIO allows to escape a partition and issue commands that affect the full disk, thus making DAC almost useless. This series solves a case which happens with virtualization. It happens often that the guest's operation requires privileged commands (the most common being persistent reservations or vendor-specific commands); at the same time, the virtual machine monitor should run as confined as possible and giving it CAP_SYS_RAWIO is not possible. A new queue flag will let unprivileged users send any SG_IO command to the device, without any filtering. This queue flag can then be set on selected devices. This patch depends on, and conflicts with, the CVE-2012-4542 fix that I have just sent. Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- Documentation/block/queue-sysfs.txt | 8 block/blk-sysfs.c | 33 + block/scsi_ioctl.c | 4 +++- include/linux/blkdev.h | 3 +++ 4 files changed, 47 insertions(+), 1 deletion(-) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index e54ac1d..341e781 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -133,6 +133,14 @@ control of this block device to that new IO scheduler. Note that writing an IO scheduler name to this file will attempt to load that IO scheduler module, if it isn't already present in the system. +unpriv_sgio (RW) + +When a process runs without CAP_SYS_RAWIO, access to some SCSI commands +with the SG_IO ioctl is restricted. If this option is '0', the whitelist +is applied for all file descriptors belonging to unprivileged processes. +If this option is '1', the whitelist is only applied for file descriptors +that are opened read-only; other file descriptors can send all SCSI commands, +and no restrictions are applied by the kernel. Jens Axboe , February 2009 diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 5efc5a6..e6f0021 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -215,6 +215,32 @@ static ssize_t queue_max_hw_sectors_show(struct request_queue *q, char *page) return queue_var_show(max_hw_sectors_kb, (page)); } +static ssize_t +queue_show_unpriv_sgio(struct request_queue *q, char *page) +{ + int bit; + bit = test_bit(QUEUE_FLAG_UNPRIV_SGIO, &q->queue_flags); + return queue_var_show(bit, page); +} +static ssize_t +queue_store_unpriv_sgio(struct request_queue *q, const char *page, size_t count) +{ + unsigned long val; + ssize_t ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + ret = queue_var_store(&val, page, count); + spin_lock_irq(q->queue_lock); + if (val) + queue_flag_set(QUEUE_FLAG_UNPRIV_SGIO, q); + else + queue_flag_clear(QUEUE_FLAG_UNPRIV_SGIO, q); + spin_unlock_irq(q->queue_lock); + return ret; +} + #define QUEUE_SYSFS_BIT_FNS(name, flag, neg) \ static ssize_t \ queue_show_##name(struct request_queue *q, char *page) \ @@ -405,6 +431,12 @@ static struct queue_sysfs_entry queue_nonrot_entry = { .store = queue_store_nonrot, }; +static struct queue_sysfs_entry queue_unpriv_sgio_entry = { + .attr = {.name = "unpriv_sgio", .mode = S_IRUGO | S_IWUSR }, + .show = queue_show_unpriv_sgio, + .store = queue_store_unpriv_sgio, +}; + static struct queue_sysfs_entry queue_nomerges_entry = { .attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR }, .show = queue_nomerges_show, @@ -447,6 +479,7 @@ static struct attribute *default_attrs[] = { &queue_discard_max_entry.attr, &queue_discard_zeroes_data_entry.attr, &queue_write_same_max_entry.attr, + &queue_unpriv_sgio_entry.attr, &queue_nonrot_entry.attr, &queue_nomerges_entry.attr, &queue_rq_affinity_entry.attr, diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 238c8f6..0af98fe 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -241,7 +241,9 @@ int blk_verify_command(struct request_queue *q, return 0; /* Write-safe commands require a writable open */ - if (has_write_perm && filter->write_ok[cmd[0]] & (1 << q->sgio_type)) + if (has_write_perm && + (blk_queue_unpriv_s
[PATCH v3 part1 1/4] sg_io: pass request_queue to blk_verify_command
Adjust the blk_verify_command function to let it look at per-queue data. This will be done in the next patch. Acked-by: Tejun Heo Cc: sta...@gnu.org Cc: FUJITA Tomonori Cc: Doug Gilbert Cc: "James E.J. Bottomley" Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe Signed-off-by: Paolo Bonzini --- block/bsg.c| 2 +- block/scsi_ioctl.c | 7 --- drivers/scsi/sg.c | 3 ++- include/linux/blkdev.h | 3 ++- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 420a5a9..dedd83c 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -187,7 +187,7 @@ static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq, return -EFAULT; if (hdr->subprotocol == BSG_SUB_PROTOCOL_SCSI_CMD) { - if (blk_verify_command(rq->cmd, has_write_perm)) + if (blk_verify_command(q, rq->cmd, has_write_perm)) return -EPERM; } else if (!capable(CAP_SYS_RAWIO)) return -EPERM; diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index a5ffcc9..96cab50 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -197,7 +197,8 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) __set_bit(GPCMD_SET_READ_AHEAD, filter->write_ok); } -int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm) +int blk_verify_command(struct request_queue *q, + unsigned char *cmd, fmode_t has_write_perm) { struct blk_cmd_filter *filter = &blk_default_cmd_filter; @@ -226,7 +227,7 @@ static int blk_fill_sghdr_rq(struct request_queue *q, struct request *rq, { if (copy_from_user(rq->cmd, hdr->cmdp, hdr->cmd_len)) return -EFAULT; - if (blk_verify_command(rq->cmd, mode & FMODE_WRITE)) + if (blk_verify_command(q, rq->cmd, mode & FMODE_WRITE)) return -EPERM; /* @@ -473,7 +474,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode, if (in_len && copy_from_user(buffer, sic->data + cmdlen, in_len)) goto error; - err = blk_verify_command(rq->cmd, mode & FMODE_WRITE); + err = blk_verify_command(q, rq->cmd, mode & FMODE_WRITE); if (err) goto error; diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index df5e961..533e789 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -218,11 +218,12 @@ static void sg_put_dev(Sg_device *sdp); static int sg_allow_access(struct file *filp, unsigned char *cmd) { struct sg_fd *sfp = filp->private_data; + struct request_queue *q = sfp->parentdp->device->request_queue; if (sfp->parentdp->device->type == TYPE_SCANNER) return 0; - return blk_verify_command(cmd, filp->f_mode & FMODE_WRITE); + return blk_verify_command(q, cmd, filp->f_mode & FMODE_WRITE); } static int get_exclude(Sg_device *sdp) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2fdb4a4..4fca347 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1089,7 +1089,8 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block, gfp_mask); } -extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm); +extern int blk_verify_command(struct request_queue *q, + unsigned char *cmd, fmode_t has_write_perm); enum blk_default_limits { BLK_MAX_SEGMENTS= 128, -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 24/05/2013 03:44, Tejun Heo ha scritto: > On Thu, May 23, 2013 at 11:47:25AM +0200, Paolo Bonzini wrote: >>> No no, I'm not talking about it not working for the users - it's just >>> passing the commands, it of course works. I'm doubting about it being >>> a worthy security isolation layer. cdb filtering (of any form really) >>> has always been something on the border. It always was something we >>> got stuck with due to lack of other immediate options. >> >> I understood correctly then. :) I agree it's not the best, but it's not >> completely broken either. It has bugs, and that's what these patches >> try to fix. > > The same filtering table being applied to different classes of > hardware is a software bug, but my point is that the practive > essentially entrusts non-insignificant part of security enforcement to > the hardware itself. The variety of hardware in question is very wide > and significant portion has historically been known to be flaky. Unproven theory, and contradicted by actual practice. Bugs are more common in the handling of borderline conditions, not in the handling of unimplemented commands. If you want to be secure aginst buggy firmware, the commands you have to block are READ and WRITE. Check out the list of existing USB quirks. >>> I'm wondering whether combining 3 into 4 would be good enough. >> >> No, it wouldn't. I learnt it the hard way (by having a patch nacked :)) >> at http://thread.gmane.org/gmane.linux.kernel/1311887. > > Of course you can't do that by adding dangerous commands to the > existing filtering table which is allowed by default. I'm saying > "count me out" knob could be good enough. Neither is perfect but at > least the latter doesn't affect the default cases. You need to allow more commands. The count-me-out knob allows all commands. You cannot always allow all commands. Ergo, you cannot always use the count-me-out knob. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 part1 1/4] sg_io: pass request_queue to blk_verify_command
Il 24/05/2013 09:36, James Bottomley ha scritto: > On Thu, 2013-05-23 at 15:58 +0200, Paolo Bonzini wrote: >> Adjust the blk_verify_command function to let it look at per-queue >> data. This will be done in the next patch. > > This is not a bug fix. This is an enabler for your complex and to my > mind dubious rework of the SG_IO command filter. I'm running out of > ways to say please don't mix bug fixes with features, because this > redesignating of the original patch set as part 1 and parts 2,3 doesn't > satisfy the requirement. I made it part 1/2/3 because parts 2/3 depend on part 1. It makes dependency tracking easier, at least in my mind. If you have another solution that does not require passing request_queue to blk_verify_command, I'm all ears. > Does anyone in the real world actually care about this bug? Yes, or I would move on and not waste so much time on this. Paolo > because if > not perhaps we can just remove the confusion and consider this as a > feature set. If there's someone who actually cares, please lets just do > the bug fix first and argue about the feature later. > > James > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 part1 1/4] sg_io: pass request_queue to blk_verify_command
Il 24/05/2013 09:50, James Bottomley ha scritto: > On Fri, 2013-05-24 at 09:43 +0200, Paolo Bonzini wrote: >> Il 24/05/2013 09:36, James Bottomley ha scritto: >>> On Thu, 2013-05-23 at 15:58 +0200, Paolo Bonzini wrote: >>>> Adjust the blk_verify_command function to let it look at per-queue >>>> data. This will be done in the next patch. >>> >>> This is not a bug fix. This is an enabler for your complex and to my >>> mind dubious rework of the SG_IO command filter. I'm running out of >>> ways to say please don't mix bug fixes with features, because this >>> redesignating of the original patch set as part 1 and parts 2,3 doesn't >>> satisfy the requirement. >> >> I made it part 1/2/3 because parts 2/3 depend on part 1. It makes >> dependency tracking easier, at least in my mind. >> >> If you have another solution that does not require passing request_queue >> to blk_verify_command, I'm all ears. > > That's a circular response that doesn't answer the question. The actual > question is: what is simple fix for the bug that isn't entangled with > enabling the SG_IO per device type whitelist feature. > >>> Does anyone in the real world actually care about this bug? >> >> Yes, or I would move on and not waste so much time on this. > > Fine, so produce a simple fix for this bug which we can discuss that's > not tied to this feature. Honestly, I have no idea how this is even possible. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 24/05/2013 10:02, Tejun Heo ha scritto: > On Fri, May 24, 2013 at 4:13 PM, Paolo Bonzini wrote: >>> The same filtering table being applied to different classes of >>> hardware is a software bug, but my point is that the practive >>> essentially entrusts non-insignificant part of security enforcement to >>> the hardware itself. The variety of hardware in question is very wide >>> and significant portion has historically been known to be flaky. >> >> Unproven theory, and contradicted by actual practice. Bugs are more >> common in the handling of borderline conditions, not in the handling of >> unimplemented commands. >> >> If you want to be secure aginst buggy firmware, the commands you have to >> block are READ and WRITE. Check out the list of existing USB quirks. > > Well, I'd actually much prefer disabling CDB whitelisting for all !MMC > devices if at all possible. You're basically arguing that because what > we have is already broken, it should be okay to break it further. > Also, RW commands having more quirks doesn't necessarily indicate that > they tend to be more broken. They just get hammered on a lot in > various ways so problems on those commands tend to be more noticeable. I agree intuition may not count, and it's perfectly possible that firmware writers forgot a "break;" or put the wrong location in a jump table, so that unimplemented commands give interesting results. However, the _fact_ is that this might happen anyway with the buttload of commands that are already enabled by the whitelist and that most disks will never implement. >> You need to allow more commands. >> The count-me-out knob allows all commands. >> You cannot always allow all commands. >> Ergo, you cannot always use the count-me-out knob. > > The thing is that both approaches aren't perfect here so you can make > similar type of argument from the other side. If the system wants to > give out raw hardware access to VMs, requiring it to delegate the > device fully could be reasonable. No, it is not unfortunately. Allowing to do discards is one thing, allowing to disrupt the settings of a SAN is another. You can only delegate the device fully in these cases: (a) of course, if the guest is trusted; (b) if QEMU is running as a confined user, then you can get by with a userspace whitelist. (Otherwise you're just a ptrace away from arbitrary access, as you pointed out). Unfortunately, there are _real_ problems that this patches fix, such as log spews due to blocked commands, and these happen even if neither of the above conditions is true. Is there anything else I can do? Sure, I can check for the presence of the whitelist and hack the VPD pages to hide features that I know the whitelist will block. Is it the right thing to do? In my opinion no. It makes no sense to have raw device access _disable_ features compared to emulation. > Not ideal but widening direct > command access by default is pretty nasty too. I actually agree, and that's why I'm trying to balance the widening and restricting. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 part1 1/4] sg_io: pass request_queue to blk_verify_command
Il 24/05/2013 10:03, James Bottomley ha scritto: > > >>> Does anyone in the real world actually care about this bug? > >> > >> Yes, or I would move on and not waste so much time on this. >>> > > >>> > > Fine, so produce a simple fix for this bug which we can discuss that's >>> > > not tied to this feature. >> > >> > Honestly, I have no idea how this is even possible. > Really? It looks to me like a simple block on the commands for disk > devices in the opcode switch would do it (with a corresponding change to > sg.c:sg_allow_access). Which switch? What I can do is something like this in blk_verify_command: if (q->sgio_type == TYPE_ROM) return 0; if (rq->cmd[0] == 0xA4) return -EPERM; if (!is_write && (req->cmd[0] == ... || rq->cmd[0] == ...)) return -EPERM; But then the particular patch you're replying to is still necessary, and you're slowing down blk_verify_command. It may be fine for stable and -rc, but IMHO it calls for a better implementation in 3.11. (Besides, I did it like this because it is what Tejun suggested). Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 24/05/2013 11:07, Tejun Heo ha scritto: > On Fri, May 24, 2013 at 5:31 PM, Paolo Bonzini wrote: >> I agree intuition may not count, and it's perfectly possible that >> firmware writers forgot a "break;" or put the wrong location in a jump >> table, so that unimplemented commands give interesting results. > > It's not just unimplemented commands. Exposing any new command exposes > its borderline problems together with it. For commands that are used by Linux already, the right way to fix the problems is not obscuring the commands from userspace's view. You can hit the same problems with ioctls or even with normal operation of the device. >> However, the _fact_ is that this might happen anyway with the buttload >> of commands that are already enabled by the whitelist and that most >> disks will never implement. > > Yes and that's why the whitelist is generally frowned upon. It's > inherently fragile. These devices simply aren't designed and > implemented to be exposed to lesser security domains directly. It's > true that it's already kinda broken that way but as I wrote before > it's a vicious cycle and we don't wanna keep building on top of it. > This expansion is gonna increase the usage of whitelisting which will > in turn attract further use cases which are likely to call for even > more expansion. And prohibiting the extension of whitelists is gonna increase the usage of unpriv_sgio and less-secure userspace whitelists. Anvil, meet hammer. >>> The thing is that both approaches aren't perfect here so you can make >>> similar type of argument from the other side. If the system wants to >>> give out raw hardware access to VMs, requiring it to delegate the >>> device fully could be reasonable. >> >> No, it is not unfortunately. Allowing to do discards is one thing, >> allowing to disrupt the settings of a SAN is another. You can only >> delegate the device fully in these cases: >> >> (a) of course, if the guest is trusted; >> (b) if QEMU is running as a confined user; > > If the bulk of filtering can be solved with userland whitelisting as a > confined user, it should be possible to resolve peripheral problems > like log messages in simpler way, no? Can you please elaborate on the > log message problem? Who's spewing those messages? For example: if (bdev_write_same(bdev)) { unsigned char bdn[BDEVNAME_SIZE]; if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask, ZERO_PAGE(0))) return 0; bdevname(bdev, bdn); pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn); } return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask); The device exposes the ability to zero out LUN blocks, but the command is not whitelisted and it fails. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 part1 1/4] sg_io: pass request_queue to blk_verify_command
Il 24/05/2013 10:32, Paolo Bonzini ha scritto: > Il 24/05/2013 10:03, James Bottomley ha scritto: >>>>>>>>>> Does anyone in the real world actually care about this bug? >>>>>>>> >>>>>>>> Yes, or I would move on and not waste so much time on this. >>>>>> >>>>>> Fine, so produce a simple fix for this bug which we can discuss that's >>>>>> not tied to this feature. >>>> >>>> Honestly, I have no idea how this is even possible. >> Really? It looks to me like a simple block on the commands for disk >> devices in the opcode switch would do it (with a corresponding change to >> sg.c:sg_allow_access). > > Which switch? What I can do is something like this in blk_verify_command: > > if (q->sgio_type == TYPE_ROM) > return 0; > if (rq->cmd[0] == 0xA4) > return -EPERM; > if (!is_write && > (req->cmd[0] == ... || rq->cmd[0] == ...)) > return -EPERM; > > But then the particular patch you're replying to is still necessary, > and you're slowing down blk_verify_command. It may be fine for stable > and -rc, but IMHO it calls for a better implementation in 3.11. Ok, so I did a patch along these lines. And it's just as ugly as everything else that I've been posting in these threads. Yes, perhaps it has a redeeming grace in that it is fine for <= 3.10, but that's it. Because actually I agree with you. The rework of the SG_IO command filter _is_ dubious to say the least; 300 lines of default, immutable policy don't belong in the kernel. So why am I posting this crap? Because I have to work around the nack of the generic sysfs bitmap patches, which would have beatifully solved all of this. In fact, you had proposed that approach. I posted it in September 2012. Then (as usual) silence for one month until it was quickly dismissed by Tejun. *You and Jens* failed to review patches, and this rathole is what that led to. It's unpleasant for me as it is for everyone else. Yes, you and Jens are busy, we all are. But *you and Jens* are the maintainers. Please make a decision instead of drawing straws, so that we can all go back to our regular business. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 part1 1/4] sg_io: pass request_queue to blk_verify_command
Il 25/05/2013 06:14, James Bottomley ha scritto: > On Fri, 2013-05-24 at 10:32 +0200, Paolo Bonzini wrote: >> Il 24/05/2013 10:03, James Bottomley ha scritto: >>>>>>>>>>> Does anyone in the real world actually care about this bug? >>>>>>>>> >>>>>>>>> Yes, or I would move on and not waste so much time on this. >>>>>>> >>>>>>> Fine, so produce a simple fix for this bug which we can discuss that's >>>>>>> not tied to this feature. >>>>> >>>>> Honestly, I have no idea how this is even possible. >>> Really? It looks to me like a simple block on the commands for disk >>> devices in the opcode switch would do it (with a corresponding change to >>> sg.c:sg_allow_access). >> >> Which switch? What I can do is something like this in blk_verify_command: > > not in blk_verify_command: outside of it, in the three places it's used. In other words, if (blk_verify_command(...) || blk_verify_command_with_queue(q, ...)) We must have different taste. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 25/05/2013 07:27, Christoph Hellwig ha scritto: > On Fri, May 24, 2013 at 09:35:02PM -0700, James Bottomley wrote: >> I'll go along with this. I'm also wondering what the problem would be >> if we just allowed all commands on either CAP_SYS_RAWIO or opening the >> device for write, so we just defer to the filesystem permissions and >> restricted read only opens to the basic all device opcodes. > > I've been out of this area for a bit, but the problem used to be that > you could send destructive commands to a partition. The right fix > for that would be to not allow SG_IO on partitions at all, just > wondering if anything would be broken by this. Linus wanted to keep that for CAP_SYS_RAWIO. We found two uses of SG_IO on partitions: zfs-fuse used SYNCHRONIZE CACHE; some proprietary driver used TEST UNIT READY. Really, the solution is to make the bitmaps configurable in userspace. It is no less secure than unpriv_sgio. Then the kernel can be configured at build-time to have either an MMC bitmap and a basic whitelist of a dozen commands. We can even avoid working around those few conflicting opcodes; if you're paranoid you can just configure your kernel right. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 25/05/2013 09:11, Christoph Hellwig ha scritto: > > Linus wanted to keep that for CAP_SYS_RAWIO. We found two uses of SG_IO > > on partitions: zfs-fuse used SYNCHRONIZE CACHE; some proprietary driver > > used TEST UNIT READY. > > > > Really, the solution is to make the bitmaps configurable in userspace. > > It is no less secure than unpriv_sgio. Then the kernel can be > > configured at build-time to have either an MMC bitmap and a basic > > whitelist of a dozen commands. We can even avoid working around those > > few conflicting opcodes; if you're paranoid you can just configure your > > kernel right. > > Keep it simple. Allowing SG_IO for CAP_SYS_RAWIO probably is fine, > allowing it for permissions only clearly isn't. All the per-command > filetering is just complete bullshit and the kind of bloat that > eventually will make Linux unmaintainable. What I'm proposing is: - by default, only allow INQUIRY / REPORT LUNS / TEST UNIT READY. Keep the old whitelist available for backwards-compatibility, configurable at build-time. - let CAP_SYS_ADMIN users set a different per-device whitelist. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 25/05/2013 10:37, Tejun Heo ha scritto: > Hey, James. > > On Fri, May 24, 2013 at 09:35:02PM -0700, James Bottomley wrote: >>> Well, I'd actually much prefer disabling CDB whitelisting for all !MMC >>> devices if at all possible. >> >> I'll go along with this. I'm also wondering what the problem would be > > Don't think we can. It'd be a behavior change clearly visible to > userland at this point. We can (and even for MMC) if it is a build-time configuration knob. It would satisfy those people who want the CVE fixed, as long as userspace gets some configurability. > * Fix the security bug. I don't really care how it's fixed as long as > the amount of whitelisted commands goes down not up. > > * It's not like we can remove the filter for !MMC devices at this > point, so I think it makes sense to make it per-class so that we can > *remove* commands which aren't relevant for the device type. Also, > we probably wanna add read blinking comment yelling that no further > commands should be added. > > * Merge the patch to give out SG_IO access along with write access, so > the use cases which want to give out SG_IO access can do so > explicitly and be fully responsible for the device. This makes > sense to me. If one wants to be allowed to issue raw commands to > the hardware, one takes the full responsibility. That's not possible; it would make it impossible to do things like using a privileged helper to open the file and passing it back via SCM_RIGHTS to an unprivileged program (running as the user). This is the ptrace attack that you mentioned. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PING^7 (was Re: [PATCH v2 00/14] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542))
Il 25/05/2013 14:48, Tejun Heo ha scritto: >>> * Merge the patch to give out SG_IO access along with write access, so >>> > > the use cases which want to give out SG_IO access can do so >>> > > explicitly and be fully responsible for the device. This makes >>> > > sense to me. If one wants to be allowed to issue raw commands to >>> > > the hardware, one takes the full responsibility. >> > >> > That's not possible; it would make it impossible to do things like using >> > a privileged helper to open the file and passing it back via SCM_RIGHTS >> > to an unprivileged program (running as the user). This is the ptrace >> > attack that you mentioned. > I have no idea what you're talking about. I'm describing the same > thing you implemented and posted. Ok, I think we need a rewind. I'll try to post what I mean next week. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/4] block: add back command filter modification via sysfs
This adds two new sysfs attributes to the queue kobject. The attributes allow reading and writing the whitelist of unprivileged commands. This is again a bit different from what was removed in commit 018e044 (block: get rid of queue-private command filter, 2009-06-26), but the idea is the same. One difference is that it does not use a separate kobject. Also, the supported sysfs syntax is a bit more expressive: it includes ranges, the ability to replace all of the filter with a single command, and does not force usage of hexadecimal. Since the names are different, and the old ones were anyway never really enabled, the different API is not a problem. Cc: linux-scsi@vger.kernel.org Signed-off-by: Paolo Bonzini --- Documentation/block/queue-sysfs.txt | 16 ++ block/Kconfig | 10 block/blk-sysfs.c | 41 ++ block/scsi_ioctl.c | 106 include/linux/blkdev.h | 21 +++ 5 files changed, 194 insertions(+) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index e54ac1d..1152b38 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -133,6 +133,22 @@ control of this block device to that new IO scheduler. Note that writing an IO scheduler name to this file will attempt to load that IO scheduler module, if it isn't already present in the system. +sgio_read_filter (RW) +- +When read, this file will display a list of SCSI commands (i.e. values of +the first byte of a CDB) that are always available for unprivileged users +(via /dev/bsg, /dev/sgNN, or ioctls such as SG_IO and CDROM_SEND_PACKET). +When written, the list of commands will be modified. By default it +will be completely replaced; writing a string that begins with '+' will +add new commands, and writing a string that begins with '-' will remove +some commands. Ranges of commands are supported, for example '0x00-0xff'. + +sgio_write_filter (RW) +-- +When read, this file will display a list of SCSI commands (i.e. values of +the first byte of a CDB) that are available for unprivileged users +when the block device is open for writing. Writing to this file behaves +as for sgio_read_filter. Jens Axboe , February 2009 diff --git a/block/Kconfig b/block/Kconfig index a7e40a7..e89d6a2 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -65,6 +65,16 @@ config BLK_DEV_BSG If unsure, say Y. +config BLK_DEV_SG_FILTER_SYSFS + bool "Customizable SG_IO filters in sysfs" + default y + help + Saying Y here will let you use sysfs to customize the list + of SCSI commands that are available (via /dev/sg, /dev/bsg or + ioctls such as SG_IO) to unprivileged users. + + If unsure, say Y. + config BLK_DEV_BSGLIB bool "Block layer SG support v4 helper lib" default n diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 2539f8d..3e4ffeb 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -429,6 +429,43 @@ static struct queue_sysfs_entry queue_random_entry = { .store = queue_store_random, }; +#ifdef CONFIG_BLK_DEV_SG_FILTER_SYSFS +static ssize_t queue_sgio_filter_read_show(struct request_queue *q, char *page) +{ + return blk_filter_show(q, page, READ); +} + +static ssize_t queue_sgio_filter_write_show(struct request_queue *q, +char *page) +{ + return blk_filter_show(q, page, WRITE); +} + +static ssize_t queue_sgio_filter_read_store(struct request_queue *q, + const char *page, size_t count) +{ + return blk_filter_store(q, page, count, READ); +} + +static ssize_t queue_sgio_filter_write_store(struct request_queue *q, +const char *page, size_t count) +{ + return blk_filter_store(q, page, count, WRITE); +} + +static struct queue_sysfs_entry queue_sgio_filter_read_entry = { + .attr = { .name = "sgio_filter_read", .mode = S_IRUGO | S_IWUSR }, + .show = queue_sgio_filter_read_show, + .store = queue_sgio_filter_read_store, +}; + +static struct queue_sysfs_entry queue_sgio_filter_write_entry = { + .attr = {.name = "sgio_filter_write", .mode = S_IRUGO | S_IWUSR }, + .show = queue_sgio_filter_write_show, + .store = queue_sgio_filter_write_store, +}; +#endif + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -452,6 +489,10 @@ static struct attribute *default_attrs[] = { &queue_rq_affinity_entry.attr, &queue_iostats_entry.attr, &queue_random_entry.attr, +#ifdef CONFIG_BLK_DEV_SG_FILTER_SYSFS + &queue_sgio_filter_read_entry.attr, + &queue_sgio_filter_write_entry.attr, +#endif
[RFC PATCH 1/4] block: add back queue-private command filter
The command filter used to be mutable via sysfs, but this was broken and backed out. Let's add it back. This patch adds the infrastructure for filtering, but unlike the old code this one just adds a pointer to request_queue, so as to make it cheaper in the majority of cases where no special filtering is desired. This is a partial (and massaged) revert of commit 018e044 (block: get rid of queue-private command filter, 2009-06-26). Cc: linux-scsi@vger.kernel.org Signed-off-by: Paolo Bonzini --- block/blk-sysfs.c | 2 ++ block/bsg.c| 2 +- block/scsi_ioctl.c | 17 + drivers/scsi/sg.c | 4 +++- include/linux/blkdev.h | 10 +- 5 files changed, 20 insertions(+), 15 deletions(-) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 5efc5a6..2539f8d 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -530,6 +530,8 @@ static void blk_release_queue(struct kobject *kobj) blkcg_exit_queue(q); + kfree(q->cmd_filter); + if (q->elevator) { spin_lock_irq(q->queue_lock); ioc_clear_queue(q); diff --git a/block/bsg.c b/block/bsg.c index 420a5a9..fe16cae 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -187,7 +187,7 @@ static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq, return -EFAULT; if (hdr->subprotocol == BSG_SUB_PROTOCOL_SCSI_CMD) { - if (blk_verify_command(rq->cmd, has_write_perm)) + if (blk_verify_command(q->cmd_filter, rq->cmd, has_write_perm)) return -EPERM; } else if (!capable(CAP_SYS_RAWIO)) return -EPERM; diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index a5ffcc9..22b925f 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -34,11 +34,6 @@ #include #include -struct blk_cmd_filter { - unsigned long read_ok[BLK_SCSI_CMD_PER_LONG]; - unsigned long write_ok[BLK_SCSI_CMD_PER_LONG]; -}; - static struct blk_cmd_filter blk_default_cmd_filter; /* Command group 3 is reserved and should never be used. */ @@ -197,17 +192,15 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) __set_bit(GPCMD_SET_READ_AHEAD, filter->write_ok); } -int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm) +int blk_verify_command(struct blk_cmd_filter *filter, + unsigned char *cmd, fmode_t has_write_perm) { - struct blk_cmd_filter *filter = &blk_default_cmd_filter; - /* root can do any command. */ if (capable(CAP_SYS_RAWIO)) return 0; - /* if there's no filter set, assume we're filtering everything out */ if (!filter) - return -EPERM; + filter = &blk_default_cmd_filter; /* Anybody who can open the device can do a read-safe command */ if (test_bit(cmd[0], filter->read_ok)) @@ -226,7 +219,7 @@ static int blk_fill_sghdr_rq(struct request_queue *q, struct request *rq, { if (copy_from_user(rq->cmd, hdr->cmdp, hdr->cmd_len)) return -EFAULT; - if (blk_verify_command(rq->cmd, mode & FMODE_WRITE)) + if (blk_verify_command(q->cmd_filter, rq->cmd, mode & FMODE_WRITE)) return -EPERM; /* @@ -473,7 +466,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode, if (in_len && copy_from_user(buffer, sic->data + cmdlen, in_len)) goto error; - err = blk_verify_command(rq->cmd, mode & FMODE_WRITE); + err = blk_verify_command(q->cmd_filter, rq->cmd, mode & FMODE_WRITE); if (err) goto error; diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index df5e961..5f64202 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -218,11 +218,13 @@ static void sg_put_dev(Sg_device *sdp); static int sg_allow_access(struct file *filp, unsigned char *cmd) { struct sg_fd *sfp = filp->private_data; + struct request_queue *q = sfp->parentdp->device->request_queue; if (sfp->parentdp->device->type == TYPE_SCANNER) return 0; - return blk_verify_command(cmd, filp->f_mode & FMODE_WRITE); + return blk_verify_command(q->cmd_filter, + cmd, filp->f_mode & FMODE_WRITE); } static int get_exclude(Sg_device *sdp) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2fdb4a4..9a8434d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -259,6 +259,11 @@ struct blk_queue_tag { #define BLK_SCSI_MAX_CMDS (256) #define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8)) +struct blk_cmd_filter { + unsigned long read_ok[BLK_SCSI_CMD_PER_LONG]; + unsigned long write_ok[BLK_SCSI_CMD_PER_LONG]; +}; + struct queue_li
[RFC PATCH 2/4] scsi: create an all-zero filter for scanners
Using /dev/sg for scanners is blocked from unprivileged users. Reimplement this using customizable command filters, so that the sysfs knobs will work in this case too. Cc: linux-scsi@vger.kernel.org Signed-off-by: Paolo Bonzini --- drivers/scsi/scsi_scan.c | 8 +++- drivers/scsi/sg.c| 3 --- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 3e58b22..de5751b 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -783,13 +783,19 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result, } switch (sdev->type) { + case TYPE_SCANNER: + sdev->request_queue->cmd_filter = + kzalloc(sizeof(struct blk_cmd_filter), GFP_ATOMIC); + if (sdev->request_queue->cmd_filter == NULL) + return SCSI_SCAN_NO_RESPONSE; + /* fallthrough */ + case TYPE_RBC: case TYPE_TAPE: case TYPE_DISK: case TYPE_PRINTER: case TYPE_MOD: case TYPE_PROCESSOR: - case TYPE_SCANNER: case TYPE_MEDIUM_CHANGER: case TYPE_ENCLOSURE: case TYPE_COMM: diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 5f64202..f7f5c11 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -220,9 +220,6 @@ static int sg_allow_access(struct file *filp, unsigned char *cmd) struct sg_fd *sfp = filp->private_data; struct request_queue *q = sfp->parentdp->device->request_queue; - if (sfp->parentdp->device->type == TYPE_SCANNER) - return 0; - return blk_verify_command(q->cmd_filter, cmd, filp->f_mode & FMODE_WRITE); } -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/4] SG_IO filtering via sysfs and minimal whitelist
On Wed, 12 Sep 2012 09:05:41 +0100, James Bottomley wrote: > This is why the whole filter thing was mutable via sysfs. That way the > admin could set this up per device. It sounds like this is what you > want to fix, rather than opening up more holes in an already leaky > security apparatus. The ideal is that we would be much more restrictive > by default and give root the ability to override this both globally and > per-device to conform to whatever policy it has for the virtual > environments. > > The patch which removed all of the sysfs pieces was: > > commit 018e0446890661504783f92388ecce7138c1566d > Author: Jens Axboe > Date: Fri Jun 26 16:27:10 2009 +0200 > > block: get rid of queue-private command filter > > So that's probably the place to start for putting it back properly. [https://lkml.org/lkml/2012/9/12/76] We've been running in circles for nine months now. Let's restart from the maintainer's suggestion, which was probably dismissed too quickly. This is still not a complete solution, because /dev/sgN does not have access to its queue object. Still, it can be a base for discussion. If accepted (in a complete form with access to the queue object for non-block devices), this series removes the need to fix the opcode conflicts as far as I'm concerned. We could just consider that a feature of CONFIG_BLK_DEV_SG_FILTER_MMC. Previously posted at https://lkml.org/lkml/2012/9/25/397 (short thread). Rebased just fine, this one is compile-tested only. Paolo Paolo Bonzini (4): block: add back queue-private command filter scsi: create an all-zero filter for scanners block: add back command filter modification via sysfs scsi: lock out SG_IO by default to unprivileged users Documentation/block/queue-sysfs.txt | 16 + block/Kconfig | 22 ++ block/blk-sysfs.c | 43 block/bsg.c | 2 +- block/scsi_ioctl.c | 131 +++- drivers/scsi/scsi_scan.c| 8 ++- drivers/scsi/sg.c | 7 +- include/linux/blkdev.h | 31 - 8 files changed, 238 insertions(+), 22 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 4/4] scsi: lock out SG_IO by default to unprivileged users
Move policy on SG_IO access to userspace. Some level of checking at the kernel level (compared to just having an opt-out for the default whitelist) is needed because any userspace whitelist is too easily circumvented by doing ptrace on the program that has a file descriptor open. This would be a regression for those who are using Unix permissions, security modules or the device cgroup to confine programs. A meaningful whitelist can then be set by udev, for example. Signed-off-by: Paolo Bonzini --- This is not yet usable, because sg devices do not have a link to the queue object. block/Kconfig | 12 block/scsi_ioctl.c | 10 ++ 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/block/Kconfig b/block/Kconfig index e89d6a2..67d1f68 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -75,6 +75,18 @@ config BLK_DEV_SG_FILTER_SYSFS If unsure, say Y. +config BLK_DEV_SG_FILTER_MMC + bool "Backwards-compatible default SG_IO filter" + default y + help + Saying Y here will let you send several commands (mostly based + on the MMC command set) by default to SCSI devices via /dev/sg, + /dev/bsg or ioctls such as SG_IO) even for unprivileged users. + Saying N will restrict the set of commands that unprivileged + users can send to the bare minimum. + + If unsure, say Y. + config BLK_DEV_BSGLIB bool "Block layer SG support v4 helper lib" default n diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 5db39b5..079dfa2 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -113,9 +113,12 @@ static int sg_emulated_host(struct request_queue *q, int __user *p) static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) { memset(filter, 0, sizeof(*filter)); - - /* Basic read-only commands */ + __set_bit(INQUIRY, filter->read_ok); __set_bit(TEST_UNIT_READY, filter->read_ok); + __set_bit(REPORT_LUNS, filter->read_ok); + +#ifdef BLK_DEV_SG_FILTER_MMC + /* Basic read-only commands */ __set_bit(REQUEST_SENSE, filter->read_ok); __set_bit(READ_6, filter->read_ok); __set_bit(READ_10, filter->read_ok); @@ -125,14 +128,12 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) __set_bit(READ_DEFECT_DATA, filter->read_ok); __set_bit(READ_CAPACITY, filter->read_ok); __set_bit(READ_LONG, filter->read_ok); - __set_bit(INQUIRY, filter->read_ok); __set_bit(MODE_SENSE, filter->read_ok); __set_bit(MODE_SENSE_10, filter->read_ok); __set_bit(LOG_SENSE, filter->read_ok); __set_bit(START_STOP, filter->read_ok); __set_bit(GPCMD_VERIFY_10, filter->read_ok); __set_bit(VERIFY_16, filter->read_ok); - __set_bit(REPORT_LUNS, filter->read_ok); __set_bit(SERVICE_ACTION_IN, filter->read_ok); __set_bit(RECEIVE_DIAGNOSTIC, filter->read_ok); __set_bit(MAINTENANCE_IN, filter->read_ok); @@ -193,6 +194,7 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter) __set_bit(GPCMD_LOAD_UNLOAD, filter->write_ok); __set_bit(GPCMD_SET_STREAMING, filter->write_ok); __set_bit(GPCMD_SET_READ_AHEAD, filter->write_ok); +#endif } int blk_verify_command(struct blk_cmd_filter *filter, -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/9] virtio_scsi: Enable new EH timeout handler
Il 10/06/2013 03:40, Hannes Reinecke ha scritto: > Signed-off-by: Hannes Reinecke > --- > drivers/scsi/virtio_scsi.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index 2168258..1efd219 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -654,6 +654,11 @@ static int virtscsi_abort(struct scsi_cmnd *sc) > return virtscsi_tmf(vscsi, cmd); > } > > +static enum blk_eh_timer_return virtscsi_timedout(struct scsi_cmnd *scmd) > +{ > + return scsi_abort_command(scmd); > +} > + > static int virtscsi_target_alloc(struct scsi_target *starget) > { > struct virtio_scsi_target_state *tgt = > @@ -683,6 +688,7 @@ static struct scsi_host_template > virtscsi_host_template_single = { > .queuecommand = virtscsi_queuecommand_single, > .eh_abort_handler = virtscsi_abort, > .eh_device_reset_handler = virtscsi_device_reset, > + .eh_timed_out = virtscsi_timedout, > > .can_queue = 1024, > .dma_boundary = UINT_MAX, > @@ -699,6 +705,7 @@ static struct scsi_host_template > virtscsi_host_template_multi = { > .queuecommand = virtscsi_queuecommand_multi, > .eh_abort_handler = virtscsi_abort, > .eh_device_reset_handler = virtscsi_device_reset, > + .eh_timed_out = virtscsi_timedout, > > .can_queue = 1024, > .dma_boundary = UINT_MAX, > Acked-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] virtio-scsi: Implement TMF timeout
Il 10/06/2013 03:40, Hannes Reinecke ha scritto: > Any TMF might be take longer as expected, or not return at all. > So we need to use 'wait_for_completion_timeout' when sending > a TMF to protect against these cases. > > Cc: Paolo Bonzini > Signed-off-by: Hannes Reinecke > --- > drivers/scsi/virtio_scsi.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index 1efd219..abfc684 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -30,6 +30,7 @@ > #define VIRTIO_SCSI_MEMPOOL_SZ 64 > #define VIRTIO_SCSI_EVENT_LEN 8 > #define VIRTIO_SCSI_VQ_BASE 2 > +#define VIRTIO_SCSI_TMF_TIMEOUT 30 > > /* Command queue element */ > struct virtio_scsi_cmd { > @@ -597,8 +598,10 @@ static int virtscsi_tmf(struct virtio_scsi *vscsi, > struct virtio_scsi_cmd *cmd) > GFP_NOIO) < 0) > goto out; > > - wait_for_completion(&comp); > - if (cmd->resp.tmf.response == VIRTIO_SCSI_S_OK || > + if (wait_for_completion_timeout(&comp, > + VIRTIO_SCSI_TMF_TIMEOUT * HZ) == 0) > + ret = FAILED; > + else if (cmd->resp.tmf.response == VIRTIO_SCSI_S_OK || > cmd->resp.tmf.response == VIRTIO_SCSI_S_FUNCTION_SUCCEEDED) > ret = SUCCESS; > > Acked-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] SG_IO filtering via sysfs and minimal whitelist
Il 27/05/2013 15:50, Paolo Bonzini ha scritto: > > We've been running in circles for nine months now. Let's restart from > the maintainer's suggestion, which was probably dismissed too quickly. > > This is still not a complete solution, because /dev/sgN does not have > access to its queue object. Still, it can be a base for discussion. > > If accepted (in a complete form with access to the queue object for > non-block devices), this series removes the need to fix the opcode > conflicts as far as I'm concerned. We could just consider that a > feature of CONFIG_BLK_DEV_SG_FILTER_MMC. > > Previously posted at https://lkml.org/lkml/2012/9/25/397 (short thread). > Rebased just fine, this one is compile-tested only. > > Paolo RFC didn't get many Cs. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe: Add nvme-scsi.c (was Re: [PULL REQUEST] NVMe driver updates)
Il 09/05/2013 22:20, Matthew Wilcox ha scritto: > NVMe: Add nvme-scsi.c I couldn't find the original patch on LKML, so I'll just quote the relevant piece of code. > +int nvme_sg_io(struct nvme_ns *ns, struct sg_io_hdr __user *u_hdr) > +{ > + struct sg_io_hdr hdr; > + int retcode; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EACCES; This should be EPERM, and also it should use the same checks as other implementations of SG_IO, including: - checking if it's operating on a partition, and requiring CAP_SYS_RAWIO if so; - allowing a limited number of commands even for !capable(CAP_SYS_RAWIO). All this is done by scsi_cmd_blk_ioctl. Paolo > + if (copy_from_user(&hdr, u_hdr, sizeof(hdr))) > + return -EFAULT; > + if (hdr.interface_id != 'S') > + return -EINVAL; > + if (hdr.cmd_len > BLK_MAX_CDB) > + return -EINVAL; > + > + retcode = nvme_scsi_translate(ns, &hdr); > + if (retcode < 0) > + return retcode; > + if (retcode > 0) > + retcode = SNTI_TRANSLATION_SUCCESS; > + if (copy_to_user(__user u_hdr, &hdr, sizeof(sg_io_hdr_t)) > 0) > + return -EFAULT; > + > + return retcode; > +} -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] SG_IO filtering via sysfs and minimal whitelist
Il 25/06/2013 23:19, Paolo Bonzini ha scritto: > Il 27/05/2013 15:50, Paolo Bonzini ha scritto: >> >> We've been running in circles for nine months now. Let's restart from >> the maintainer's suggestion, which was probably dismissed too quickly. >> >> This is still not a complete solution, because /dev/sgN does not have >> access to its queue object. Still, it can be a base for discussion. >> >> If accepted (in a complete form with access to the queue object for >> non-block devices), this series removes the need to fix the opcode >> conflicts as far as I'm concerned. We could just consider that a >> feature of CONFIG_BLK_DEV_SG_FILTER_MMC. >> >> Previously posted at https://lkml.org/lkml/2012/9/25/397 (short thread). >> Rebased just fine, this one is compile-tested only. >> >> Paolo > > RFC didn't get many Cs. Ping^2. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-scsi: Fix virtqueue affinity setup
> vscsi->num_queues counts the number of request virtqueue which does not > include the control and event virtqueue. It is wrong to subtract > VIRTIO_SCSI_VQ_BASE from vscsi->num_queues. Reviewed-by: Paolo Bonzini > This patch fixes the following panic. > > (qemu) device_del scsi0 > > BUG: unable to handle kernel NULL pointer dereference at 0020 > IP: [] __virtscsi_set_affinity+0x6f/0x120 > PGD 0 > Oops: [#1] SMP > Modules linked in: > CPU: 0 PID: 659 Comm: kworker/0:1 Not tainted 3.11.0-rc2+ #1172 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > Workqueue: kacpi_hotplug _handle_hotplug_event_func > task: 88007bee1cc0 ti: 88007bfe4000 task.ti: 88007bfe4000 > RIP: 0010:[] [] > __virtscsi_set_affinity+0x6f/0x120 > RSP: 0018:88007bfe5a38 EFLAGS: 00010202 > RAX: 0010 RBX: 880077fd0d28 RCX: 0050 > RDX: RSI: 0246 RDI: > RBP: 88007bfe5a58 R08: 880077f6ff00 R09: 0001 > R10: 8143e673 R11: 0001 R12: 0001 > R13: 880077fd0800 R14: R15: 88007bf489b0 > FS: () GS:88007ea0() knlGS: > CS: 0010 DS: ES: CR0: 8005003b > CR2: 0020 CR3: 79f8b000 CR4: 06f0 > Stack: > 880077fd0d28 880077fd0800 0008 > 88007bfe5a78 8179b37d 88007bccc800 88007bccc800 > 88007bfe5a98 8179b3b6 88007bccc800 880077fd0d28 > Call Trace: > [] virtscsi_set_affinity+0x2d/0x40 > [] virtscsi_remove_vqs+0x26/0x50 > [] virtscsi_remove+0x82/0xa0 > [] virtio_dev_remove+0x22/0x70 > [] __device_release_driver+0x69/0xd0 > [] device_release_driver+0x2d/0x40 > [] bus_remove_device+0x116/0x150 > [] device_del+0x126/0x1e0 > [] device_unregister+0x16/0x30 > [] unregister_virtio_device+0x19/0x30 > [] virtio_pci_remove+0x36/0x80 > [] pci_device_remove+0x37/0x70 > [] __device_release_driver+0x69/0xd0 > [] device_release_driver+0x2d/0x40 > [] bus_remove_device+0x116/0x150 > [] device_del+0x126/0x1e0 > [] pci_stop_bus_device+0x9c/0xb0 > [] pci_stop_and_remove_bus_device+0x16/0x30 > [] acpiphp_disable_slot+0x8e/0x150 > [] hotplug_event_func+0xba/0x1a0 > [] ? acpi_os_release_object+0xe/0x12 > [] _handle_hotplug_event_func+0x31/0x70 > [] process_one_work+0x183/0x500 > [] worker_thread+0x122/0x400 > [] ? manage_workers+0x2d0/0x2d0 > [] kthread+0xce/0xe0 > [] ? kthread_freezable_should_stop+0x70/0x70 > [] ret_from_fork+0x7c/0xb0 > [] ? kthread_freezable_should_stop+0x70/0x70 > Code: 01 00 00 00 74 59 45 31 e4 83 bb c8 01 00 00 02 74 46 66 2e 0f 1f 84 > 00 00 00 00 00 49 63 c4 48 c1 e0 04 48 8b bc 0 > 3 10 02 00 00 <48> 8b 47 20 48 8b 80 d0 01 00 00 48 8b 40 50 48 85 c0 74 07 > be > RIP [] __virtscsi_set_affinity+0x6f/0x120 > RSP > CR2: 0020 > ---[ end trace 99679331a3775f48 ]--- > > CC: sta...@vger.kernel.org > Signed-off-by: Asias He > --- > drivers/scsi/virtio_scsi.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index 2168258..74b88ef 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -751,7 +751,7 @@ static void __virtscsi_set_affinity(struct virtio_scsi > *vscsi, bool affinity) > > vscsi->affinity_hint_set = true; > } else { > - for (i = 0; i < vscsi->num_queues - VIRTIO_SCSI_VQ_BASE; i++) > + for (i = 0; i < vscsi->num_queues; i++) > virtqueue_set_affinity(vscsi->req_vqs[i].vq, -1); > > vscsi->affinity_hint_set = false; > -- > 1.8.3.1 > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Potential out-of-bounds access in drivers/scsi/sd.c
Il 04/09/2013 16:32, Alan Stern ha scritto: > On Wed, 4 Sep 2013, Dmitry Vyukov wrote: > >> Hi, >> >> We are working on a memory error detector AddressSanitizer for Linux >> kernel >> (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel), >> it can detect use-after-free and buffer-overflow errors. > > ... > >> The code in sd_read_cache_type does the following: >> >> while (offset < len) { >> ... >> } >> ... >> if ((buffer[offset] & 0x3f) != modepage) { >> sd_printk(KERN_ERR, sdkp, "Got wrong page\n"); >> goto defaults; >> } >> >> When control leaves the while loop, offset >= len, so buffer[offset] >> reads random garbage out-of-bounds. >> It the worst case it can lead to crash, or if (buffer[offset] & 0x3f) >> happen to be == modepage, then it will read more garbage. >> >> Please help validate and triage this. > > The tool's output is correct. The patch below should fix it. > > Alan Stern > > > > Index: usb-3.11/drivers/scsi/sd.c > === > --- usb-3.11.orig/drivers/scsi/sd.c > +++ usb-3.11/drivers/scsi/sd.c > @@ -2419,7 +2419,7 @@ sd_read_cache_type(struct scsi_disk *sdk > } > } > > - if (modepage == 0x3F) { > + if (modepage == 0x3F || offset + 2 >= len) { > sd_printk(KERN_ERR, sdkp, "No Caching mode page " > "present\n"); > goto defaults; If you do this, the buggy "if" becomes dead code (the loop above doesn't have any "break", so you know that offset >= len and the new condition is always true). So the patch does indeed prevent the bug, but the code can be simplified. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI: Fix potential out-of-bounds access in drivers/scsi/sd.c
Il 06/09/2013 17:49, Alan Stern ha scritto: > This patch fixes an out-of-bounds error in sd_read_cache_type(), found > by Google's AddressSanitizer tool. When the loop ends, we know that > "offset" lies beyond the end of the data in the buffer, so no Caching > mode page was found. In theory it may be present, but the buffer size > is limited to 512 bytes. > > Signed-off-by: Alan Stern > Reported-by: Dmitry Vyukov > CC: Reviewed-by: Paolo Bonzini > > --- > > > [as1709] > > > drivers/scsi/sd.c | 11 +++ > 1 file changed, 3 insertions(+), 8 deletions(-) > > Index: usb-3.11/drivers/scsi/sd.c > === > --- usb-3.11.orig/drivers/scsi/sd.c > +++ usb-3.11/drivers/scsi/sd.c > @@ -2419,14 +2419,9 @@ sd_read_cache_type(struct scsi_disk *sdk > } > } > > - if (modepage == 0x3F) { > - sd_printk(KERN_ERR, sdkp, "No Caching mode page " > - "present\n"); > - goto defaults; > - } else if ((buffer[offset] & 0x3f) != modepage) { > - sd_printk(KERN_ERR, sdkp, "Got wrong page\n"); > - goto defaults; > - } > + sd_printk(KERN_ERR, sdkp, "No Caching mode page found\n"); > + goto defaults; > + > Page_found: > if (modepage == 8) { > sdkp->WCE = ((buffer[offset + 2] & 0x04) != 0); > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI's heuristics for enabling WRITE SAME still need work [was: dm mpath: disable WRITE SAME if it fails]
Il 21/09/2013 00:03, Martin K. Petersen ha scritto: > > The major headache here of course is that WRITE SAME is inherently > destructive. We can't just fire off one during discovery and see if it > works. For WRITE you can issue a command with a transfer length of 0 to > see if things work. But unfortunately for WRITE SAME a transfer length > of zero means "wipe the entire device". Yikes! What about WRITE SAME with an out-of-range lba? Or lba equal to the size of the disk and #blocks equal to zero. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] scsi: Fix max transfer length for 4k disks
On 29/01/2015 22:54, Brian King wrote: > The following patch fixes an issue observed with 4k sector disks > where the max_hw_sectors attribute was getting set too large in > sd_revalidate_disk. Since sdkp->max_xfer_blocks is in units > of SCSI logical blocks and queue_max_hw_sectors is in units of > 512 byte blocks, on a 4k sector disk, every time we went through > sd_revalidate_disk, we were taking the current value of > queue_max_hw_sectors and increasing it by a factor of 8. Fix > this by only shifting sdkp->max_xfer_blocks. > > Cc: stable > Signed-off-by: Brian King > --- > > drivers/scsi/sd.c |6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff -puN drivers/scsi/sd.c~sd_revalidate_4k drivers/scsi/sd.c > --- linux/drivers/scsi/sd.c~sd_revalidate_4k 2015-01-29 14:44:23.316171187 > -0600 > +++ linux-bjking1/drivers/scsi/sd.c 2015-01-29 14:51:05.846126392 -0600 > @@ -2800,9 +2800,11 @@ static int sd_revalidate_disk(struct gen >*/ > sd_set_flush_flag(sdkp); > > - max_xfer = min_not_zero(queue_max_hw_sectors(sdkp->disk->queue), > - sdkp->max_xfer_blocks); > + max_xfer = sdkp->max_xfer_blocks; > max_xfer <<= ilog2(sdp->sector_size) - 9; > + > + max_xfer = min_not_zero(queue_max_hw_sectors(sdkp->disk->queue), > + max_xfer); > blk_queue_max_hw_sectors(sdkp->disk->queue, max_xfer); > set_capacity(disk, sdkp->capacity); > sd_config_write_same(sdkp); > _ > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Reviewed-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] scsi: serialize ->rescan against ->remove
On 29/01/2015 00:00, Christoph Hellwig wrote: > Lock the device embedded in the scsi_device to protect against > concurrent calls to ->remove. > > Signed-off-by: Christoph Hellwig I wonder if this makes this problem: https://lkml.org/lkml/2015/1/5/9 go away. Paolo > --- > drivers/scsi/scsi_scan.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c > index 983aed1..523faee 100644 > --- a/drivers/scsi/scsi_scan.c > +++ b/drivers/scsi/scsi_scan.c > @@ -1593,16 +1593,15 @@ EXPORT_SYMBOL(scsi_add_device); > > void scsi_rescan_device(struct device *dev) > { > - if (!dev->driver) > - return; > - > - if (try_module_get(dev->driver->owner)) { > + device_lock(dev); > + if (dev->driver && try_module_get(dev->driver->owner)) { > struct scsi_driver *drv = to_scsi_driver(dev->driver); > > if (drv->rescan) > drv->rescan(dev); > module_put(dev->driver->owner); > } > + device_unlock(dev); > } > EXPORT_SYMBOL(scsi_rescan_device); > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] lib/iovec: Add memcpy_fromiovec_out library function
On 30/01/2015 09:12, Nicholas A. Bellinger wrote: > From: Nicholas Bellinger > > This patch adds a new memcpy_fromiovec_out() library function which modifies > the passed *iov following memcpy_fromiovec(), but also returns the next > current > iovec pointer via **iov_out. > > This is useful for vhost ANY_LAYOUT support when guests are allowed to > generate > incoming virtio request headers combined with subsequent SGL payloads into a > single iovec. > > Cc: Michael S. Tsirkin > Cc: Paolo Bonzini > Signed-off-by: Nicholas Bellinger > --- > include/linux/uio.h | 2 ++ > lib/iovec.c | 27 +++ > 2 files changed, 29 insertions(+) > > diff --git a/include/linux/uio.h b/include/linux/uio.h > index 1c5e453..3e4473d 100644 > --- a/include/linux/uio.h > +++ b/include/linux/uio.h > @@ -136,6 +136,8 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, > __wsum *csum, struct iov_ > size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, > struct iov_iter *i); > > int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len); > +int memcpy_fromiovec_out(unsigned char *kdata, struct iovec *iov, > + struct iovec **iov_out, int len); > int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov, > int offset, int len); > int memcpy_toiovecend(const struct iovec *v, unsigned char *kdata, > diff --git a/lib/iovec.c b/lib/iovec.c > index 2d99cb4..6a813dd 100644 > --- a/lib/iovec.c > +++ b/lib/iovec.c > @@ -28,6 +28,33 @@ int memcpy_fromiovec(unsigned char *kdata, struct iovec > *iov, int len) > EXPORT_SYMBOL(memcpy_fromiovec); > > /* > + * Copy iovec to kernel, saving the current iov to *iov_out. > + * Returns -EFAULT on error. Perhaps document that iov is modified, zeroing everything in [iov, *iov_out) and possibly removing the front of *iov_out? Paolo > + */ > + > +int memcpy_fromiovec_out(unsigned char *kdata, struct iovec *iov, > + struct iovec **iov_out, int len) > +{ > + while (len > 0) { > + if (iov->iov_len) { > + int copy = min_t(unsigned int, len, iov->iov_len); > + if (copy_from_user(kdata, iov->iov_base, copy)) > + return -EFAULT; > + len -= copy; > + kdata += copy; > + iov->iov_base += copy; > + iov->iov_len -= copy; > + } > + if (!iov->iov_len) > + iov++; > + } > + *iov_out = iov; > + > + return 0; > +} > +EXPORT_SYMBOL(memcpy_fromiovec_out); > + > +/* > * Copy kernel to iovec. Returns -EFAULT on error. > */ > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] scsi: serialize ->rescan against ->remove
On 30/01/2015 02:08, Fam Zheng wrote: > On Fri, 01/30 00:11, Paolo Bonzini wrote: >> >> >> On 29/01/2015 00:00, Christoph Hellwig wrote: >>> Lock the device embedded in the scsi_device to protect against >>> concurrent calls to ->remove. >>> >>> Signed-off-by: Christoph Hellwig >> >> I wonder if this makes this problem: https://lkml.org/lkml/2015/1/5/9 go >> away. > > A quick test says yes. Great, we might want to revert that patch in 3.21. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] scsi: serialize ->rescan against ->remove
On 02/02/2015 13:59, Christoph Hellwig wrote: > On Fri, Jan 30, 2015 at 10:46:17AM +0100, Paolo Bonzini wrote: >> > Great, we might want to revert that patch in 3.21. > Is that fix in any tree yet? Seems like I missed it for the scsi > tree at least. So unless you want it for 3.19/stable we might as well > ust skip that patch. Yes, I agree. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] scsi: proper state checking and module refcount handling in scsi_device_get
On 02/02/2015 14:01, Christoph Hellwig wrote: > This effectively reverts commits 85b6c7 ("[SCSI] sd: fix cache flushing on > module removal (and individual device removal)" and dc4515ea ("scsi: always > increment reference count"). > > We now never call scsi_device_get from the shutdown path, and the fact > that we started grabbing reference there in commit 85b6c7 turned out > turned out to create more problems than it solves, and required > workarounds for workarounds for workarounds. Move back to properly checking > the device state and carefully handle module refcounting. > > Signed-off-by: Christoph Hellwig > --- > drivers/scsi/scsi.c | 20 +--- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c > index 9b38299..9b7fd0b 100644 > --- a/drivers/scsi/scsi.c > +++ b/drivers/scsi/scsi.c > @@ -979,18 +979,24 @@ EXPORT_SYMBOL(scsi_report_opcode); > * Description: Gets a reference to the scsi_device and increments the use > count > * of the underlying LLDD module. You must hold host_lock of the > * parent Scsi_Host or already have a reference when calling this. > + * > + * This will fail if a device is deleted or cancelled, or when the LLD module > + * is in the process of being unloaded. > */ > int scsi_device_get(struct scsi_device *sdev) > { > - if (sdev->sdev_state == SDEV_DEL) > - return -ENXIO; > + if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL) > + goto fail; > if (!get_device(&sdev->sdev_gendev)) > - return -ENXIO; > - /* We can fail try_module_get if we're doing SCSI operations > - * from module exit (like cache flush) */ > - __module_get(sdev->host->hostt->module); > - > + goto fail; > + if (!try_module_get(sdev->host->hostt->module)) > + goto fail_put_device; > return 0; > + > +fail_put_device: > + put_device(&sdev->sdev_gendev); > +fail: > + return -ENXIO; > } > EXPORT_SYMBOL(scsi_device_get); > > Reviewed-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] scsi: serialize ->rescan against ->remove
On 02/02/2015 14:01, Christoph Hellwig wrote: > Lock the device embedded in the scsi_device to protect against > concurrent calls to ->remove. > > Signed-off-by: Christoph Hellwig > Acked-by: Alan Stern > --- > drivers/scsi/scsi_scan.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c > index 983aed1..523faee 100644 > --- a/drivers/scsi/scsi_scan.c > +++ b/drivers/scsi/scsi_scan.c > @@ -1593,16 +1593,15 @@ EXPORT_SYMBOL(scsi_add_device); > > void scsi_rescan_device(struct device *dev) > { > - if (!dev->driver) > - return; > - > - if (try_module_get(dev->driver->owner)) { > + device_lock(dev); > + if (dev->driver && try_module_get(dev->driver->owner)) { > struct scsi_driver *drv = to_scsi_driver(dev->driver); > > if (drv->rescan) > drv->rescan(dev); > module_put(dev->driver->owner); > } > + device_unlock(dev); > } > EXPORT_SYMBOL(scsi_rescan_device); > > Reviewed-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] scsi: serialize ->rescan against ->remove
On 05/03/2015 14:33, Christoph Hellwig wrote: > Any chance to get reviews for this series? Also we should at least > expedite this first patch into 4.0-rc as it fixes scanning races > in virtio_scsi. I reviewed 1 and 3, but I'm not really qualified for patch 2. Paolo > On Mon, Feb 02, 2015 at 02:01:24PM +0100, Christoph Hellwig wrote: >> Lock the device embedded in the scsi_device to protect against >> concurrent calls to ->remove. >> >> Signed-off-by: Christoph Hellwig >> Acked-by: Alan Stern >> --- >> drivers/scsi/scsi_scan.c | 7 +++ >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c >> index 983aed1..523faee 100644 >> --- a/drivers/scsi/scsi_scan.c >> +++ b/drivers/scsi/scsi_scan.c >> @@ -1593,16 +1593,15 @@ EXPORT_SYMBOL(scsi_add_device); >> >> void scsi_rescan_device(struct device *dev) >> { >> -if (!dev->driver) >> -return; >> - >> -if (try_module_get(dev->driver->owner)) { >> +device_lock(dev); >> +if (dev->driver && try_module_get(dev->driver->owner)) { >> struct scsi_driver *drv = to_scsi_driver(dev->driver); >> >> if (drv->rescan) >> drv->rescan(dev); >> module_put(dev->driver->owner); >> } >> +device_unlock(dev); >> } >> EXPORT_SYMBOL(scsi_rescan_device); >> >> -- >> 1.9.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > ---end quoted text--- > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] scsi: serialize ->rescan against ->remove
On 05/03/2015 14:37, Paolo Bonzini wrote: > > > On 05/03/2015 14:33, Christoph Hellwig wrote: >> Any chance to get reviews for this series? Also we should at least >> expedite this first patch into 4.0-rc as it fixes scanning races >> in virtio_scsi. > > I reviewed 1 and 3, but I'm not really qualified for patch 2. Christoph, any news about these patches? Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix return code for ioctl( BLKRRPART ) if device is down
On 24/03/2015 11:16, Fam Zheng wrote: > If issued right after link down, "blockdev --rereadpt" will be stuck for a > while and then return normally. Although the underlying capacity and partition > table are not correctly updated. And it means that userspace can't detect the > error at all. > > Fix this by propargating the error of "read capacity" command through the > stack, so that the ioctl could fail with -EIO. > > Fam Zheng (3): > block: Return error in rescan_partitions if revalidating disk failed > sd: Return error in sd_revalidate_disk if read capacity failed > sd: Return -EIO if read capacity failed > > block/partition-generic.c | 6 +++--- > drivers/scsi/sd.c | 22 +- > 2 files changed, 16 insertions(+), 12 deletions(-) > Reviewed-by: Paolo Bonzini Though patch 3 could be seen as a change in userspace ABI, so I'm less sure about it. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_ram: a RAM-based SCSI driver
Il 05/12/2012 17:45, Kirill A. Shutemov ha scritto: > From: "Kirill A. Shutemov" > > This driver is intended to run as fast as possible, hence the options to > discard writes and reads. It's designed to let us find latency issues > elsewhere in the storage stack (eg filesystem, block layer, scsi layer). > > There are a few different options, controlled through module parameters. > The sector size and disc capacity are load-time parameters, but the > parameters affecting performance are tweakable at runtime. > > By default, it'll allocate half a gigabyte of RAM to use as a ramdisc; > you can change this with the `capacity' module parameter. Is this that much faster than scsi-debug? The discarding options surely can be added there. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_ram: a RAM-based SCSI driver
Il 07/12/2012 12:20, Kirill A. Shutemov ha scritto: >> > Is this that much faster than scsi-debug? The discarding options surely >> > can be added there. > scsi_ram is about 9% faster (without fake_rw/throw_away_*) on my machine: There are two main differences in the data path: - scsi_debug uses locking. Shouldn't be the culprit because otherwise read and write would have different speed, but it's easy to add it to scsi_ram and check. - scsi_debug uses scsi_sg_copy_to_buffer, i.e. sg_miter_start/next/stop in kernel/scatterlist.c. Again, it should be easy to port the scsi_ram code to scsi_debug or vice versa and verify if this is the culprit. Having a completely different driver seems useless to me... Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/2] add per-device sysfs knob to enable unrestricted, unprivileged SG_IO
Il 13/11/2012 18:25, Paolo Bonzini ha scritto: > Privilege restrictions for SG_IO right now apply without distinction to > all devices, based on the single capability CAP_SYS_RAWIO. This is a very > broad capability, and makes it difficult to give SG_IO access to trusted > clients that need access to persistent reservations, trim/discard, or > vendor-specific commands. One problem here is that CAP_SYS_RAWIO allows > to escape a partition and issue commands that affect the full disk, > thus making DAC almost useless. > > For simplicity, this series attempts to solve one case only: you want > to pass through almost everything, but still run as confined as possible. > This is for example the case for virtualization, where more complex > filtering can be done just as easily in userspace, in the virtual > machine monitor. (This does mean the filter can be subverted if the > guest can escape the QEMU jail, but a more generic approach involving > a bitmap was NACKed). > > Ok for 3.8? Ping... Jens, I haven't seen any pull request for 3.8 from you. Are these patches on your radar? Tejun acked both of them. Paolo > v2->v3: change bitmap filter to boolean > > Paolo Bonzini (2): > sg_io: pass request_queue to blk_verify_command > sg_io: introduce unpriv_sgio queue flag > > block/blk-sysfs.c | 32 > block/bsg.c|2 +- > block/scsi_ioctl.c |9 + > drivers/scsi/sg.c |3 ++- > include/linux/blkdev.h |6 +- > 5 files changed, 45 insertions(+), 7 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/5] virtio-scsi: introduce multiqueue support
This patch adds queue steering to virtio-scsi. When a target is sent multiple requests, we always drive them to the same queue so that FIFO processing order is kept. However, if a target was idle, we can choose a queue arbitrarily. In this case the queue is chosen according to the current VCPU, so the driver expects the number of request queues to be equal to the number of VCPUs. This makes it easy and fast to select the queue, and also lets the driver optimize the IRQ affinity for the virtqueues (each virtqueue's affinity is set to the CPU that "owns" the queue). The speedup comes from improving cache locality and giving CPU affinity to the virtqueues, which is why this scheme was selected. Assuming that the thread that is sending requests to the device is I/O-bound, it is likely to be sleeping at the time the ISR is executed, and thus executing the ISR on the same processor that sent the requests is cheap. However, the kernel will not execute the ISR on the "best" processor unless you explicitly set the affinity. This is because in practice you will have many such I/O-bound processes and thus many otherwise idle processors. Then the kernel will execute the ISR on a random processor, rather than the one that is sending requests to the device. The alternative to per-CPU virtqueues is per-target virtqueues. To achieve the same locality, we could dynamically choose the virtqueue's affinity based on the CPU of the last task that sent a request. This is less appealing because we do not set the affinity directly---we only provide a hint to the irqbalanced running in userspace. Dynamically changing the affinity only works if the userspace applies the hint fast enough. Signed-off-by: Paolo Bonzini --- v1->v2: improved comments and commit messages, added memory barriers drivers/scsi/virtio_scsi.c | 234 +-- 1 files changed, 201 insertions(+), 33 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 4f6c6a3..ca9d29d 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -26,6 +26,7 @@ #define VIRTIO_SCSI_MEMPOOL_SZ 64 #define VIRTIO_SCSI_EVENT_LEN 8 +#define VIRTIO_SCSI_VQ_BASE 2 /* Command queue element */ struct virtio_scsi_cmd { @@ -57,24 +58,57 @@ struct virtio_scsi_vq { struct virtqueue *vq; }; -/* Per-target queue state */ +/* + * Per-target queue state. + * + * This struct holds the data needed by the queue steering policy. When a + * target is sent multiple requests, we need to drive them to the same queue so + * that FIFO processing order is kept. However, if a target was idle, we can + * choose a queue arbitrarily. In this case the queue is chosen according to + * the current VCPU, so the driver expects the number of request queues to be + * equal to the number of VCPUs. This makes it easy and fast to select the + * queue, and also lets the driver optimize the IRQ affinity for the virtqueues + * (each virtqueue's affinity is set to the CPU that "owns" the queue). + * + * An interesting effect of this policy is that only writes to req_vq need to + * take the tgt_lock. Read can be done outside the lock because: + * + * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1. + * In that case, no other CPU is reading req_vq: even if they were in + * virtscsi_queuecommand_multi, they would be spinning on tgt_lock. + * + * - reads of req_vq only occur when the target is not idle (reqs != 0). + * A CPU that enters virtscsi_queuecommand_multi will not modify req_vq. + * + * Similarly, decrements of reqs are never concurrent with writes of req_vq. + * Thus they can happen outside the tgt_lock, provided of course we make reqs + * an atomic_t. + */ struct virtio_scsi_target_state { - /* Never held at the same time as vq_lock. */ + /* This spinlock ever held at the same time as vq_lock. */ spinlock_t tgt_lock; + + /* Count of outstanding requests. */ + atomic_t reqs; + + /* Currently active virtqueue for requests sent to this target. */ + struct virtio_scsi_vq *req_vq; }; /* Driver instance state */ struct virtio_scsi { struct virtio_device *vdev; - struct virtio_scsi_vq ctrl_vq; - struct virtio_scsi_vq event_vq; - struct virtio_scsi_vq req_vq; - /* Get some buffers ready for event vq */ struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN]; struct virtio_scsi_target_state *tgt; + + u32 num_queues; + + struct virtio_scsi_vq ctrl_vq; + struct virtio_scsi_vq event_vq; + struct virtio_scsi_vq req_vqs[]; }; static struct kmem_cache *virtscsi_cmd_cache; @@ -109,6 +143,7 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf) struct virtio_scsi_cmd *cmd = buf; struct scsi_cmnd *sc = cmd->sc; struct v
[PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
Using the new virtio_scsi_add_sg function lets us simplify the queueing path. In particular, all data protected by the tgt_lock is just gone (multiqueue will find a new use for the lock). The speedup is relatively small (2-4%) but it is worthwhile because of the code simplification---both in this patches and in the next ones. Signed-off-by: Wanlong Gao Signed-off-by: Paolo Bonzini --- v1->v2: new drivers/scsi/virtio_scsi.c | 94 +++ 1 files changed, 42 insertions(+), 52 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 74ab67a..2b93b6e 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -59,11 +59,8 @@ struct virtio_scsi_vq { /* Per-target queue state */ struct virtio_scsi_target_state { - /* Protects sg. Lock hierarchy is tgt_lock -> vq_lock. */ + /* Never held at the same time as vq_lock. */ spinlock_t tgt_lock; - - /* For sglist construction when adding commands to the virtqueue. */ - struct scatterlist sg[]; }; /* Driver instance state */ @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq) spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); }; -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx, -struct scsi_data_buffer *sdb) -{ - struct sg_table *table = &sdb->table; - struct scatterlist *sg_elem; - unsigned int idx = *p_idx; - int i; - - for_each_sg(table->sgl, sg_elem, table->nents, i) - sg[idx++] = *sg_elem; - - *p_idx = idx; -} - /** - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue * @vscsi : virtio_scsi state * @cmd: command structure - * @out_num: number of read-only elements - * @in_num : number of write-only elements * @req_size : size of the request buffer * @resp_size : size of the response buffer - * - * Called with tgt_lock held. + * @gfp: flags to use for memory allocations */ -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt, -struct virtio_scsi_cmd *cmd, -unsigned *out_num, unsigned *in_num, -size_t req_size, size_t resp_size) +static int virtscsi_add_cmd(struct virtqueue *vq, + struct virtio_scsi_cmd *cmd, + size_t req_size, size_t resp_size, gfp_t gfp) { struct scsi_cmnd *sc = cmd->sc; - struct scatterlist *sg = tgt->sg; - unsigned int idx = 0; + struct scatterlist sg; + unsigned int count, count_sg; + struct sg_table *out, *in; + struct virtqueue_buf buf; + int ret; + + out = in = NULL; + + if (sc && sc->sc_data_direction != DMA_NONE) { + if (sc->sc_data_direction != DMA_FROM_DEVICE) + out = &scsi_out(sc)->table; + if (sc->sc_data_direction != DMA_TO_DEVICE) + in = &scsi_in(sc)->table; + } + + count_sg = 2 + (out ? 1 : 0) + (in ? 1 : 0); + count= 2 + (out ? out->nents : 0) + (in ? in->nents : 0); + ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp); + if (ret < 0) + return ret; /* Request header. */ - sg_set_buf(&sg[idx++], &cmd->req, req_size); + sg_init_one(&sg, &cmd->req, req_size); + virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE); /* Data-out buffer. */ - if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) - virtscsi_map_sgl(sg, &idx, scsi_out(sc)); - - *out_num = idx; + if (out) + virtqueue_add_sg(&buf, out->sgl, out->nents, DMA_TO_DEVICE); /* Response header. */ - sg_set_buf(&sg[idx++], &cmd->resp, resp_size); + sg_init_one(&sg, &cmd->resp, resp_size); + virtqueue_add_sg(&buf, &sg, 1, DMA_FROM_DEVICE); /* Data-in buffer */ - if (sc && sc->sc_data_direction != DMA_TO_DEVICE) - virtscsi_map_sgl(sg, &idx, scsi_in(sc)); + if (in) + virtqueue_add_sg(&buf, in->sgl, in->nents, DMA_FROM_DEVICE); - *in_num = idx - *out_num; + virtqueue_end_buf(&buf); + return 0; } static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, @@ -409,25 +407,20 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, struct virtio_scsi_cmd *cmd, size_t req_size, size_t resp_size, gfp_t gfp) { - unsigned int out_num, in_num; unsigned long flags; - int err; +
[PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
The virtqueue_add_buf function has two limitations: 1) it requires the caller to provide all the buffers in a single call; 2) it does not support chained scatterlists: the buffers must be provided as an array of struct scatterlist; Because of these limitations, virtio-scsi has to copy each request into a scatterlist internal to the driver. It cannot just use the one that was prepared by the upper SCSI layers. This patch adds a different set of APIs for adding a buffer to a virtqueue. The new API lets you pass the buffers piecewise, wrapping multiple calls to virtqueue_add_sg between virtqueue_start_buf and virtqueue_end_buf. virtio-scsi can then call virtqueue_add_sg 3/4 times: for the request header, for the write buffer (if present), for the response header, and finally for the read buffer (again if present). It saves the copying and the related locking. Note that this API is not needed in virtio-blk, because it does all the work of the upper SCSI layers itself in the blk_map_rq_sg call. Then it simply hands the resulting scatterlist to virtqueue_add_buf. Signed-off-by: Paolo Bonzini --- v1->v2: new drivers/virtio/virtio_ring.c | 205 ++ include/linux/virtio.h | 21 2 files changed, 226 insertions(+), 0 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index ffd7e7d..ccfa97c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -394,6 +394,211 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head) vq->vq.num_free++; } +/** + * virtqueue_start_buf - start building buffer for the other end + * @vq: the struct virtqueue we're talking about. + * @buf: a struct keeping the state of the buffer + * @data: the token identifying the buffer. + * @count: the number of buffers that will be added + * @count_sg: the number of sg lists that will be added + * @gfp: how to do memory allocations (if necessary). + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted), and that a successful call is + * followed by one or more calls to virtqueue_add_sg, and finally a call + * to virtqueue_end_buf. + * + * Returns zero or a negative error (ie. ENOSPC). + */ +int virtqueue_start_buf(struct virtqueue *_vq, + struct virtqueue_buf *buf, + void *data, + unsigned int count, + unsigned int count_sg, + gfp_t gfp) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + struct vring_desc *desc = NULL; + int head; + int ret = -ENOMEM; + + START_USE(vq); + + BUG_ON(data == NULL); + +#ifdef DEBUG + { + ktime_t now = ktime_get(); + + /* No kick or get, with .1 second between? Warn. */ + if (vq->last_add_time_valid) + WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time)) + > 100); + vq->last_add_time = now; + vq->last_add_time_valid = true; + } +#endif + + BUG_ON(count < count_sg); + BUG_ON(count_sg == 0); + + /* If the host supports indirect descriptor tables, and there is +* no space for direct buffers or there are multi-item scatterlists, +* go indirect. +*/ + head = vq->free_head; + if (vq->indirect && (count > count_sg || vq->vq.num_free < count)) { + if (vq->vq.num_free == 0) + goto no_space; + + desc = kmalloc(count * sizeof(struct vring_desc), gfp); + if (!desc) + goto error; + + /* We're about to use a buffer */ + vq->vq.num_free--; + + /* Use a single buffer which doesn't continue */ + vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT; + vq->vring.desc[head].addr = virt_to_phys(desc); + vq->vring.desc[head].len = count * sizeof(struct vring_desc); + + /* Update free pointer */ + vq->free_head = vq->vring.desc[head].next; + } + + /* Set token. */ + vq->data[head] = data; + + pr_debug("Started buffer head %i for %p\n", head, vq); + + buf->vq = _vq; + buf->indirect = desc; + buf->tail = NULL; + buf->head = head; + return 0; + +no_space: + ret = -ENOSPC; +error: + pr_debug("Can't add buf (%d) - count = %i, avail = %i\n", +ret, count, vq->vq.num_free); + END_USE(vq); + return ret; +} +EXPORT_SYMBOL_GPL(virtqueue_start_buf); + +/** + * virtqueue_add_sg - add sglist to buffer + * @buf: the struct that was passed to virtqueue_start_buf + * @sgl: the descript
[PATCH v2 4/5] virtio-scsi: pass struct virtio_scsi to virtqueue completion function
This will be needed soon in order to retrieve the per-target struct. Signed-off-by: Paolo Bonzini --- drivers/scsi/virtio_scsi.c | 17 + 1 files changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 4a3abaf..4f6c6a3 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -104,7 +104,7 @@ static void virtscsi_compute_resid(struct scsi_cmnd *sc, u32 resid) * * Called with vq_lock held. */ -static void virtscsi_complete_cmd(void *buf) +static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf) { struct virtio_scsi_cmd *cmd = buf; struct scsi_cmnd *sc = cmd->sc; @@ -165,7 +165,8 @@ static void virtscsi_complete_cmd(void *buf) sc->scsi_done(sc); } -static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf)) +static void virtscsi_vq_done(struct virtio_scsi *vscsi, struct virtqueue *vq, +void (*fn)(struct virtio_scsi *vscsi, void *buf)) { void *buf; unsigned int len; @@ -173,7 +174,7 @@ static void virtscsi_vq_done(struct virtqueue *vq, void (*fn)(void *buf)) do { virtqueue_disable_cb(vq); while ((buf = virtqueue_get_buf(vq, &len)) != NULL) - fn(buf); + fn(vscsi, buf); } while (!virtqueue_enable_cb(vq)); } @@ -184,11 +185,11 @@ static void virtscsi_req_done(struct virtqueue *vq) unsigned long flags; spin_lock_irqsave(&vscsi->req_vq.vq_lock, flags); - virtscsi_vq_done(vq, virtscsi_complete_cmd); + virtscsi_vq_done(vscsi, vq, virtscsi_complete_cmd); spin_unlock_irqrestore(&vscsi->req_vq.vq_lock, flags); }; -static void virtscsi_complete_free(void *buf) +static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf) { struct virtio_scsi_cmd *cmd = buf; @@ -205,7 +206,7 @@ static void virtscsi_ctrl_done(struct virtqueue *vq) unsigned long flags; spin_lock_irqsave(&vscsi->ctrl_vq.vq_lock, flags); - virtscsi_vq_done(vq, virtscsi_complete_free); + virtscsi_vq_done(vscsi, vq, virtscsi_complete_free); spin_unlock_irqrestore(&vscsi->ctrl_vq.vq_lock, flags); }; @@ -329,7 +330,7 @@ static void virtscsi_handle_event(struct work_struct *work) virtscsi_kick_event(vscsi, event_node); } -static void virtscsi_complete_event(void *buf) +static void virtscsi_complete_event(struct virtio_scsi *vscsi, void *buf) { struct virtio_scsi_event_node *event_node = buf; @@ -344,7 +345,7 @@ static void virtscsi_event_done(struct virtqueue *vq) unsigned long flags; spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags); - virtscsi_vq_done(vq, virtscsi_complete_event); + virtscsi_vq_done(vscsi, vq, virtscsi_complete_event); spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); }; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/5] virtio-scsi: redo allocation of target data
virtio_scsi_target_state is now empty, but we will find new uses for it in the next few patches. However, dropping the sglist lets us turn the array-of-pointers into a simple array, which simplifies the allocation. However, we do not leave the virtio_scsi_target_state structs in the flexible array member at the end of struct virtio_scsi, because we will place the virtqueues there in the next patches. Signed-off-by: Paolo Bonzini --- v1->v2: new drivers/scsi/virtio_scsi.c | 43 ++- 1 files changed, 14 insertions(+), 29 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 2b93b6e..4a3abaf 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -74,7 +74,7 @@ struct virtio_scsi { /* Get some buffers ready for event vq */ struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN]; - struct virtio_scsi_target_state *tgt[]; + struct virtio_scsi_target_state *tgt; }; static struct kmem_cache *virtscsi_cmd_cache; @@ -426,7 +426,7 @@ static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) { struct virtio_scsi *vscsi = shost_priv(sh); - struct virtio_scsi_target_state *tgt = vscsi->tgt[sc->device->id]; + struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; struct virtio_scsi_cmd *cmd; int ret; @@ -474,7 +474,7 @@ out: static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd) { DECLARE_COMPLETION_ONSTACK(comp); - struct virtio_scsi_target_state *tgt = vscsi->tgt[cmd->sc->device->id]; + struct virtio_scsi_target_state *tgt = &vscsi->tgt[cmd->sc->device->id]; int ret = FAILED; cmd->comp = ∁ @@ -578,19 +578,9 @@ static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq, virtscsi_vq->vq = vq; } -static struct virtio_scsi_target_state *virtscsi_alloc_tgt( - struct virtio_device *vdev, int sg_elems) +static void virtscsi_init_tgt(struct virtio_scsi_target_state *tgt) { - struct virtio_scsi_target_state *tgt; - gfp_t gfp_mask = GFP_KERNEL; - - /* We need extra sg elements at head and tail. */ - tgt = kmalloc(sizeof(*tgt), gfp_mask); - if (!tgt) - return NULL; - spin_lock_init(&tgt->tgt_lock); - return tgt; } static void virtscsi_scan(struct virtio_device *vdev) @@ -604,16 +594,12 @@ static void virtscsi_remove_vqs(struct virtio_device *vdev) { struct Scsi_Host *sh = virtio_scsi_host(vdev); struct virtio_scsi *vscsi = shost_priv(sh); - u32 i, num_targets; /* Stop all the virtqueues. */ vdev->config->reset(vdev); - num_targets = sh->max_id; - for (i = 0; i < num_targets; i++) { - kfree(vscsi->tgt[i]); - vscsi->tgt[i] = NULL; - } + kfree(vscsi->tgt); + vscsi->tgt = NULL; vdev->config->del_vqs(vdev); } @@ -654,13 +640,14 @@ static int virtscsi_init(struct virtio_device *vdev, /* We need to know how many segments before we allocate. */ sg_elems = virtscsi_config_get(vdev, seg_max) ?: 1; - for (i = 0; i < num_targets; i++) { - vscsi->tgt[i] = virtscsi_alloc_tgt(vdev, sg_elems); - if (!vscsi->tgt[i]) { - err = -ENOMEM; - goto out; - } + vscsi->tgt = kmalloc(num_targets * sizeof(vscsi->tgt[0]), GFP_KERNEL); + if (!vscsi->tgt) { + err = -ENOMEM; + goto out; } + for (i = 0; i < num_targets; i++) + virtscsi_init_tgt(&vscsi->tgt[i]); + err = 0; out: @@ -679,9 +666,7 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev) /* Allocate memory and link the structs together. */ num_targets = virtscsi_config_get(vdev, max_target) + 1; - shost = scsi_host_alloc(&virtscsi_host_template, - sizeof(*vscsi) - + num_targets * sizeof(struct virtio_scsi_target_state)); + shost = scsi_host_alloc(&virtscsi_host_template, sizeof(*vscsi)); if (!shost) return -ENOMEM; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
Hi all, this series adds multiqueue support to the virtio-scsi driver, based on Jason Wang's work on virtio-net. It uses a simple queue steering algorithm that expects one queue per CPU. LUNs in the same target always use the same queue (so that commands are not reordered); queue switching occurs when the request being queued is the only one for the target. Also based on Jason's patches, the virtqueue affinity is set so that each CPU is associated to one virtqueue. I tested the patches with fio, using up to 32 virtio-scsi disks backed by tmpfs on the host. These numbers are with 1 LUN per target. FIO configuration - [global] rw=read bsrange=4k-64k ioengine=libaio direct=1 iodepth=4 loops=20 overall bandwidth (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 795 965 925 4 997 13761500 8 1136 21302060 161440 22692474 241408 21792436 321515 19782319 (These numbers for single-queue are with 4 VCPUs, but the impact of adding more VCPUs is very limited). avg bandwidth per LUN (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 397 482 462 4 249 344 375 8 142 266 257 16 90 141 154 24 5890 101 32 4761 72 Patch 1 adds a new API to add functions for piecewise addition for buffers, which enables various simplifications in virtio-scsi (patches 2-3) and a small performance improvement of 2-6%. Patches 4 and 5 add multiqueuing. I'm mostly looking for comments on the new API of patch 1 for inclusion into the 3.9 kernel. Thanks to Wao Ganlong for help rebasing and benchmarking these patches. Paolo Bonzini (5): virtio: add functions for piecewise addition of buffers virtio-scsi: use functions for piecewise composition of buffers virtio-scsi: redo allocation of target data virtio-scsi: pass struct virtio_scsi to virtqueue completion function virtio-scsi: introduce multiqueue support drivers/scsi/virtio_scsi.c | 374 +- drivers/virtio/virtio_ring.c | 205 include/linux/virtio.h | 21 +++ 3 files changed, 485 insertions(+), 115 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/5] virtio-scsi: use functions for piecewise composition of buffers
Il 18/12/2012 14:37, Michael S. Tsirkin ha scritto: > On Tue, Dec 18, 2012 at 01:32:49PM +0100, Paolo Bonzini wrote: >> Using the new virtio_scsi_add_sg function lets us simplify the queueing >> path. In particular, all data protected by the tgt_lock is just gone >> (multiqueue will find a new use for the lock). > > vq access still needs some protection: virtio is not reentrant > by itself. with tgt_lock gone what protects vq against > concurrent add_buf calls? vq_lock. Paolo >> The speedup is relatively small (2-4%) but it is worthwhile because of >> the code simplification---both in this patches and in the next ones. >> >> Signed-off-by: Wanlong Gao >> Signed-off-by: Paolo Bonzini >> --- >> v1->v2: new >> >> drivers/scsi/virtio_scsi.c | 94 >> +++ >> 1 files changed, 42 insertions(+), 52 deletions(-) >> >> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c >> index 74ab67a..2b93b6e 100644 >> --- a/drivers/scsi/virtio_scsi.c >> +++ b/drivers/scsi/virtio_scsi.c >> @@ -59,11 +59,8 @@ struct virtio_scsi_vq { >> >> /* Per-target queue state */ >> struct virtio_scsi_target_state { >> -/* Protects sg. Lock hierarchy is tgt_lock -> vq_lock. */ >> +/* Never held at the same time as vq_lock. */ >> spinlock_t tgt_lock; >> - >> -/* For sglist construction when adding commands to the virtqueue. */ >> -struct scatterlist sg[]; >> }; >> >> /* Driver instance state */ >> @@ -351,57 +348,58 @@ static void virtscsi_event_done(struct virtqueue *vq) >> spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); >> }; >> >> -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx, >> - struct scsi_data_buffer *sdb) >> -{ >> -struct sg_table *table = &sdb->table; >> -struct scatterlist *sg_elem; >> -unsigned int idx = *p_idx; >> -int i; >> - >> -for_each_sg(table->sgl, sg_elem, table->nents, i) >> -sg[idx++] = *sg_elem; >> - >> -*p_idx = idx; >> -} >> - >> /** >> - * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist >> + * virtscsi_add_cmd - add a virtio_scsi_cmd to a virtqueue >> * @vscsi : virtio_scsi state >> * @cmd : command structure >> - * @out_num : number of read-only elements >> - * @in_num : number of write-only elements >> * @req_size: size of the request buffer >> * @resp_size : size of the response buffer >> - * >> - * Called with tgt_lock held. >> + * @gfp : flags to use for memory allocations >> */ >> -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt, >> - struct virtio_scsi_cmd *cmd, >> - unsigned *out_num, unsigned *in_num, >> - size_t req_size, size_t resp_size) >> +static int virtscsi_add_cmd(struct virtqueue *vq, >> +struct virtio_scsi_cmd *cmd, >> +size_t req_size, size_t resp_size, gfp_t gfp) >> { >> struct scsi_cmnd *sc = cmd->sc; >> -struct scatterlist *sg = tgt->sg; >> -unsigned int idx = 0; >> +struct scatterlist sg; >> +unsigned int count, count_sg; >> +struct sg_table *out, *in; >> +struct virtqueue_buf buf; >> +int ret; >> + >> +out = in = NULL; >> + >> +if (sc && sc->sc_data_direction != DMA_NONE) { >> +if (sc->sc_data_direction != DMA_FROM_DEVICE) >> +out = &scsi_out(sc)->table; >> +if (sc->sc_data_direction != DMA_TO_DEVICE) >> +in = &scsi_in(sc)->table; >> +} >> + >> +count_sg = 2 + (out ? 1 : 0) + (in ? 1 : 0); >> +count= 2 + (out ? out->nents : 0) + (in ? in->nents : 0); >> +ret = virtqueue_start_buf(vq, &buf, cmd, count, count_sg, gfp); >> +if (ret < 0) >> +return ret; >> >> /* Request header. */ >> -sg_set_buf(&sg[idx++], &cmd->req, req_size); >> +sg_init_one(&sg, &cmd->req, req_size); >> +virtqueue_add_sg(&buf, &sg, 1, DMA_TO_DEVICE); >> >> /* Data-out buffer. */ >> -if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) >> -virtscsi_map_sgl(sg, &idx, scsi_out(sc)); >> - >&
Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
Il 18/12/2012 14:36, Michael S. Tsirkin ha scritto: > Some comments without arguing about whether the performance > benefit is worth it. > > On Tue, Dec 18, 2012 at 01:32:48PM +0100, Paolo Bonzini wrote: >> diff --git a/include/linux/virtio.h b/include/linux/virtio.h >> index cf8adb1..39d56c4 100644 >> --- a/include/linux/virtio.h >> +++ b/include/linux/virtio.h >> @@ -7,6 +7,7 @@ >> #include >> #include >> #include >> +#include >> #include >> >> /** >> @@ -40,6 +41,26 @@ int virtqueue_add_buf(struct virtqueue *vq, >>void *data, >>gfp_t gfp); >> >> +struct virtqueue_buf { >> +struct virtqueue *vq; >> +struct vring_desc *indirect, *tail; > > This is wrong: virtio.h does not include virito_ring.h, > and it shouldn't by design depend on it. > >> +int head; >> +}; >> + > > Can't we track state internally to the virtqueue? > Exposing it seems to buy us nothing since you can't > call add_buf between start and end anyway. I wanted to keep the state for these functions separate from the rest. I don't think it makes much sense to move it to struct virtqueue unless virtqueue_add_buf is converted to use the new API (doesn't make much sense, could even be a tad slower). On the other hand moving it there would eliminate the dependency on virtio_ring.h. Rusty, what do you think? >> +int virtqueue_start_buf(struct virtqueue *_vq, >> +struct virtqueue_buf *buf, >> +void *data, >> +unsigned int count, >> +unsigned int count_sg, >> +gfp_t gfp); >> + >> +void virtqueue_add_sg(struct virtqueue_buf *buf, >> + struct scatterlist sgl[], >> + unsigned int count, >> + enum dma_data_direction dir); >> + > > And idea: in practice virtio scsi seems to always call sg_init_one, no? > So how about we pass in void* or something and avoid using sg and count? > This would make it useful for -net BTW. It also passes the scatterlist from the LLD. It calls sg_init_one for the request/response headers. Paolo >> +void virtqueue_end_buf(struct virtqueue_buf *buf); >> + >> void virtqueue_kick(struct virtqueue *vq); >> >> bool virtqueue_kick_prepare(struct virtqueue *vq); >> -- >> 1.7.1 >> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support
Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto: >> -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) >> +static int virtscsi_queuecommand(struct virtio_scsi *vscsi, >> + struct virtio_scsi_target_state *tgt, >> + struct scsi_cmnd *sc) >> { >> -struct virtio_scsi *vscsi = shost_priv(sh); >> -struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; >> struct virtio_scsi_cmd *cmd; >> +struct virtio_scsi_vq *req_vq; >> int ret; >> >> struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev); >> @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, >> struct scsi_cmnd *sc) >> BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE); >> memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len); >> >> -if (virtscsi_kick_cmd(tgt, &vscsi->req_vq, cmd, >> +req_vq = ACCESS_ONCE(tgt->req_vq); > > This ACCESS_ONCE without a barrier looks strange to me. > Can req_vq change? Needs a comment. Barriers are needed to order two things. Here I don't have the second thing to order against, hence no barrier. Accessing req_vq lockless is safe, and there's a comment about it, but you still want ACCESS_ONCE to ensure the compiler doesn't play tricks. It shouldn't be necessary, because the critical section of virtscsi_queuecommand_multi will already include the appropriate compiler barriers, but it is actually clearer this way to me. :) >> +if (virtscsi_kick_cmd(tgt, req_vq, cmd, >>sizeof cmd->req.cmd, sizeof cmd->resp.cmd, >>GFP_ATOMIC) == 0) >> ret = 0; >> @@ -472,6 +545,48 @@ out: >> return ret; >> } >> >> +static int virtscsi_queuecommand_single(struct Scsi_Host *sh, >> +struct scsi_cmnd *sc) >> +{ >> +struct virtio_scsi *vscsi = shost_priv(sh); >> +struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; >> + >> +atomic_inc(&tgt->reqs); > > And here we don't have barrier after atomic? Why? Needs a comment. Because we don't write req_vq, so there's no two writes to order. Barrier against what? >> +return virtscsi_queuecommand(vscsi, tgt, sc); >> +} >> + >> +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh, >> + struct scsi_cmnd *sc) >> +{ >> +struct virtio_scsi *vscsi = shost_priv(sh); >> +struct virtio_scsi_target_state *tgt = &vscsi->tgt[sc->device->id]; >> +unsigned long flags; >> +u32 queue_num; >> + >> +/* >> + * Using an atomic_t for tgt->reqs lets the virtqueue handler >> + * decrement it without taking the spinlock. >> + * >> + * We still need a critical section to prevent concurrent submissions >> + * from picking two different req_vqs. >> + */ >> +spin_lock_irqsave(&tgt->tgt_lock, flags); >> +if (atomic_inc_return(&tgt->reqs) == 1) { >> +queue_num = smp_processor_id(); >> +while (unlikely(queue_num >= vscsi->num_queues)) >> +queue_num -= vscsi->num_queues; >> + >> +/* >> + * Write reqs before writing req_vq, matching the >> + * smp_read_barrier_depends() in virtscsi_req_done. >> + */ >> +smp_wmb(); >> +tgt->req_vq = &vscsi->req_vqs[queue_num]; >> +} >> +spin_unlock_irqrestore(&tgt->tgt_lock, flags); >> +return virtscsi_queuecommand(vscsi, tgt, sc); >> +} >> + >> static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd >> *cmd) >> { >> DECLARE_COMPLETION_ONSTACK(comp); >> @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc) >> return virtscsi_tmf(vscsi, cmd); >> } >> >> -static struct scsi_host_template virtscsi_host_template = { >> +static struct scsi_host_template virtscsi_host_template_single = { >> .module = THIS_MODULE, >> .name = "Virtio SCSI HBA", >> .proc_name = "virtio_scsi", >> -.queuecommand = virtscsi_queuecommand, >> .this_id = -1, >> +.queuecommand = virtscsi_queuecommand_single, >> +.eh_abort_handler = virtscsi_abort, >> +.eh_device_reset_handler = virtscsi_device_reset, >> + >> +.can_queue = 1024, >> +.dma_boundary = UINT_MAX, >> +.use_clustering = ENABLE_CLUSTERING, >> +}; >> + >> +static struct scsi_host_template virtscsi_host_template_multi = { >> +.module = THIS_MODULE, >> +.name = "Virtio SCSI HBA", >> +.proc_name = "virtio_scsi", >> +.this_id = -1, >> +.queuecommand = virtscsi_queuecommand_multi, >> .eh_abort_handler = virtscsi_abort, >> .eh_device_reset_handler = virtscsi_device_reset, >> >> @@ -572,16 +701,27 @@ static struct scsi_host_template >> virtscsi_host_template = { >>&__val, sizeof(__val)); \ >> }) >> >> + >> static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq, >> - str
Re: [PATCH v2 1/5] virtio: add functions for piecewise addition of buffers
Il 18/12/2012 14:59, Michael S. Tsirkin ha scritto: >>> Can't we track state internally to the virtqueue? Exposing it >>> seems to buy us nothing since you can't call add_buf between >>> start and end anyway. >> >> I wanted to keep the state for these functions separate from the >> rest. I don't think it makes much sense to move it to struct >> virtqueue unless virtqueue_add_buf is converted to use the new API >> (doesn't make much sense, could even be a tad slower). > > Why would it be slower? virtqueue_add_buf could be slower if it used the new API. That's because of the overhead of writing and reading from struct virtqueue_buf, instead of using variables in registers. >> On the other hand moving it there would eliminate the dependency >> on virtio_ring.h. Rusty, what do you think? >> >>> And idea: in practice virtio scsi seems to always call >>> sg_init_one, no? So how about we pass in void* or something and >>> avoid using sg and count? This would make it useful for -net >>> BTW. >> >> It also passes the scatterlist from the LLD. It calls sg_init_one >> for the request/response headers. > > Try adding a _single variant. You might see unrolling a loop gives > more of a benefit than this whole optimization. Makes sense, I'll try. However, note that I *do* need the infrastructure in this patch because virtio-scsi could never use a hypothetical virtqueue_add_buf_single; requests always have at least 2 buffers for the headers. However I could add virtqueue_add_sg_single and use it for those headers. The I/O buffer can keep using virtqueue_add_sg. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html