[PATCH] KVM: SVM: Pass through the host kernel's IO delay port
From: Paolo Bonzini KVM's optimization of guest port 80 accesses was removed last May 11 in commit 99f85a. However, this probably has speed penalties. I don't have a machine to test but the equivalent VMX patch (fdef3ad) reported a speedup of 3-5%, and on the Xen mailing list it was mentioned that on Xen passing port 80 through had positive effects on startup speed. We can enable passthrough to the same port the host kernel uses instead. Signed-off-by: Paolo Bonzini --- arch/x86/kernel/io_delay.c |1 + arch/x86/kvm/svm.c |5 + 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/io_delay.c b/arch/x86/kernel/io_delay.c index a979b5b..a478cb2 100644 --- a/arch/x86/kernel/io_delay.c +++ b/arch/x86/kernel/io_delay.c @@ -129,3 +129,4 @@ static int __init io_delay_param(char *s) } early_param("io_delay", io_delay_param); +EXPORT_SYMBOL_GPL(io_delay_type); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1f8510c..c49f4db 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -26,6 +26,7 @@ #include #include +#include #include #include @@ -411,6 +412,10 @@ static __init int svm_hardware_setup(void) iopm_va = page_address(iopm_pages); memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER)); + if (io_delay_type == CONFIG_IO_DELAY_TYPE_0X80) + clear_bit(0x80, iopm_va); + else if (io_delay_type == CONFIG_IO_DELAY_TYPE_0XED) + clear_bit(0xED, iopm_va); iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT; if (boot_cpu_has(X86_FEATURE_NX)) -- 1.6.0.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: SVM: Pass through the host kernel's IO delay port
Avi Kivity wrote: On 06/19/2009 01:15 PM, Paolo Bonzini wrote: From: Paolo Bonzini KVM's optimization of guest port 80 accesses was removed last May 11 in commit 99f85a. However, this probably has speed penalties. I don't have a machine to test but the equivalent VMX patch (fdef3ad) reported a speedup of 3-5%, and on the Xen mailing list it was mentioned that on Xen passing port 80 through had positive effects on startup speed. We can enable passthrough to the same port the host kernel uses instead. Since we don't tell the guest to use 0xed, this won't help. Won't the guest do that automatically through DMI? I think we have four cases: 1) non-buggy chipset, host uses 0x80 and guest cannot crash the machine; this is okay. 2) buggy chipset, both host and guest use 0xed thanks to DMI detection; and (since 0x80 is not passthrough) the guest cannot crash the machine, so this is okay. 3) host is given explicit io_delay=0xed, but the guest isn't. However, 0x80 is not passthrough so the guest can use port 0x80 safely even on buggy hardware. 4) a buggy chipset is not detected, so io_delay=0x80 and 0x80 is made passthrough. In principle the guest could crash the machine; however the host will have written to 0x80 at startup, most likely causing a crash way before a guest is started. So this should not be a problem either. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support
+ * Design note: We create one PIO/MMIO device (iosignalfd_group) which + * aggregates one or more iosignalfd_items. Each item points to exactly one + * eventfd, and can be registered to trigger on any write to the group + * (wildcard), or to a write of a specific value. If more than one item is to + * be supported, the addr/len ranges must all be identical in the group. If a + * trigger value is to be supported on a particular item, the group range must + * be exactly the width of the trigger. Some duplicate spaces in the text above, apparently at random places. -ENOPARSE ;) Can you elaborate? I see "aggregates one". The others are all at end of sentence, so I think that Michael was not talking about those (git grep '\*.*\. '). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: SVM: Pass through the host kernel's IO delay port
Marcelo Tosatti wrote: Paolo, As said in private email you can change the port 80 passthrough to disabled only on those hosts that have broken port 80 (you can find the blacklist at arch/x86/kernel/io_delay.c).c Actually I read the code more and, since pv_ops is actually disabling outs to port 80 if the guest is sufficiently new, it may even make more sense for easier maintenance to remove the port 80 passthrough on Intel processors as well... Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: SVM: Pass through the host kernel's IO delay port
Marcelo Tosatti wrote: On Mon, Jun 22, 2009 at 07:57:22PM +0200, Paolo Bonzini wrote: Marcelo Tosatti wrote: Paolo, As said in private email you can change the port 80 passthrough to disabled only on those hosts that have broken port 80 (you can find the blacklist at arch/x86/kernel/io_delay.c).c Actually I read the code more and, since pv_ops is actually disabling outs to port 80 if the guest is sufficiently new, it may even make more sense for easier maintenance to remove the port 80 passthrough on Intel processors as well... But what about Windows guests? I don't think they are using port 80 to delay ISA access, are they? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/12/2011 09:51 AM, Michael S. Tsirkin wrote: > > If a device uses more than one queue it is the responsibility of the > device to ensure strict request ordering. Maybe I misunderstand - how can this be the responsibility of the device if the device does not get the information about the original ordering of the requests? For example, if the driver is crazy enough to put all write requests on one queue and all barriers on another one, how is the device supposed to ensure ordering? I agree here, in fact I misread Hannes's comment as "if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering". If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/14/2011 10:39 AM, Hannes Reinecke wrote: If, however, we decide to expose some details about the backend, we could be using the values from the backend directly. EG we could be forwarding the SCSI target port identifier here (if backed by real hardware) or creating our own SAS-type identifier when backed by qemu block. Then we could just query the backend via a new command on the controlq (eg 'list target ports') and wouldn't have to worry about any protocol specific details here. Besides the controlq command, which I can certainly add, this is actually quite similar to what I had in mind (though my plan likely would not have worked because I was expecting hierarchical LUNs used uniformly). So, "list target ports" would return a set of LUN values to which you can send REPORT LUNS, or something like that? I suppose that if you're using real hardware as the backing storage the in-kernel target can provide that. For the QEMU backend I'd keep hierarchical LUNs, though of course one could add a FC or SAS bus to QEMU, each implementing its own identifier scheme. If I understand it correctly, it should remain possible to use a single host for both pass-through and emulated targets. Would you draft the command structure, so I can incorporate it into the spec? Of course, when doing so we would be lose the ability to freely remap LUNs. But then remapping LUNs doesn't gain you much imho. Plus you could always use qemu block backend here if you want to hide the details. And you could always use the QEMU block backend with scsi-generic if you want to remap LUNs, instead of true passthrough via the kernel target. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/29/2011 12:03 PM, Christoph Hellwig wrote: > I agree here, in fact I misread Hannes's comment as "if a driver > uses more than one queue it is responsibility of the driver to > ensure strict request ordering". If you send requests to different > queues, you know that those requests are independent. I don't think > anything else is feasible in the virtio framework. That doesn't really fit very well with the SAM model. If we want to use multiple queues for a single LUN it has to be transparent to the SCSI command stream. Then again I don't quite see the use for that anyway. Agreed, I see a use for multiple queues (MSI-X), but not for multiple queues shared by a single LUN. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/29/2011 12:31 PM, Michael S. Tsirkin wrote: On Wed, Jun 29, 2011 at 12:06:29PM +0200, Paolo Bonzini wrote: On 06/29/2011 12:03 PM, Christoph Hellwig wrote: I agree here, in fact I misread Hannes's comment as "if a driver uses more than one queue it is responsibility of the driver to ensure strict request ordering". If you send requests to different queues, you know that those requests are independent. I don't think anything else is feasible in the virtio framework. That doesn't really fit very well with the SAM model. If we want to use multiple queues for a single LUN it has to be transparent to the SCSI command stream. Then again I don't quite see the use for that anyway. Agreed, I see a use for multiple queues (MSI-X), but not for multiple queues shared by a single LUN. Then let's make it explicit in the spec? What, forbid it or say ordering is not guaranteed? The latter is already explicit with the wording suggested in the thread. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 06/29/2011 11:39 AM, Stefan Hajnoczi wrote: > > Of course, when doing so we would be lose the ability to freely remap > > LUNs. But then remapping LUNs doesn't gain you much imho. > > Plus you could always use qemu block backend here if you want > > to hide the details. > > And you could always use the QEMU block backend with scsi-generic if you > want to remap LUNs, instead of true passthrough via the kernel target. IIUC the in-kernel target always does remapping. It passes through individual LUNs rather than entire targets and you pick LU Numbers to map to the backing storage (which may or may not be a SCSI pass-through device). Nicholas Bellinger can confirm whether this is correct. But then I don't understand. If you pick LU numbers both with the in-kernel target and with QEMU, you do not need to use e.g. WWPNs with fiber channel, because we are not passing through the details of the transport protocol (one day we might have virtio-fc, but more likely not). So the LUNs you use might as well be represented by hierarchical LUNs. Using NPIV with KVM would be done by mapping the same virtual N_Port ID in the host(s) to the same LU number in the guest. You might already do this now with virtio-blk, in fact. Put in another way: the virtio-scsi device is itself a SCSI target, so yes, there is a single target port identifier in virtio-scsi. But this SCSI target just passes requests down to multiple real targets, and so will let you do ALUA and all that. Of course if I am dead wrong please correct me. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] iov: Add 'offset' parameter to iov_to_buf()
On 07/01/2011 09:42 AM, Hannes Reinecke wrote: size_t iov_from_buf(struct iovec *iov, unsigned int iovcnt, -const void *buf, size_t size) +const void *buf, size_t offset, size_t size) Wrong commit subject, it seems. :) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] scsi: replace 'tag' with 'hba_private' pointer
On 07/01/2011 09:42 AM, Hannes Reinecke wrote: 'tag' is just an abstraction to identify the command from the driver. So we should make that explicit by replacing 'tag' with a driver-defined pointer 'hba_private'. This saves the lookup for driver handling several commands in parallel. This makes tracing a bit harder to follow. Perhaps you can keep the transport tag (a uint64_t) in the SCSIRequest for debugging purposes? Signed-off-by: Hannes Reinecke --- hw/esp.c |2 +- hw/lsi53c895a.c | 17 - hw/scsi-bus.c | 22 +++--- hw/scsi-disk.c|5 ++--- hw/scsi-generic.c |4 ++-- hw/scsi.h |8 hw/spapr_vscsi.c | 41 - hw/usb-msd.c | 10 +- trace-events | 14 +++--- 9 files changed, 52 insertions(+), 71 deletions(-) diff --git a/hw/esp.c b/hw/esp.c index 6d3f5d2..912ff89 100644 --- a/hw/esp.c +++ b/hw/esp.c @@ -244,7 +244,7 @@ static void do_busid_cmd(ESPState *s, uint8_t *buf, uint8_t busid) DPRINTF("do_busid_cmd: busid 0x%x\n", busid); lun = busid& 7; -s->current_req = scsi_req_new(s->current_dev, 0, lun); +s->current_req = scsi_req_new(s->current_dev, lun, s); Might as well pass NULL here. The hba_private value is basically unnecessary when the adapter doesn't support tagged command queuing. diff --git a/hw/usb-msd.c b/hw/usb-msd.c index 86582cc..4e2ea03 100644 --- a/hw/usb-msd.c +++ b/hw/usb-msd.c @@ -216,8 +216,8 @@ static void usb_msd_transfer_data(SCSIRequest *req, uint32_t len) MSDState *s = DO_UPCAST(MSDState, dev.qdev, req->bus->qbus.parent); USBPacket *p = s->packet; -if (req->tag != s->tag) { -fprintf(stderr, "usb-msd: Unexpected SCSI Tag 0x%x\n", req->tag); +if (req->hba_private != s) { +fprintf(stderr, "usb-msd: Unexpected SCSI command 0x%p\n", req); } Same here, just pass NULL and remove these ifs. Otherwise looks like a very good idea. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio scsi host draft specification, v3
On 07/01/2011 09:14 AM, Hannes Reinecke wrote: Actually, the kernel does _not_ do a LUN remapping. Not the kernel, the in-kernel target. The in-kernel target can and will map hardware LUNs (target_lun in drivers/target/*) to arbitrary LUNs (mapped_lun). Put in another way: the virtio-scsi device is itself a SCSI target, Argl. No way. The virtio-scsi device has to map to a single LUN. I think we are talking about different things. By "virtio-scsi device" I meant the "virtio-scsi HBA". When I referred to a LUN as seen by the guest, I was calling it a "virtual SCSI device". So yes, we were calling things with different names. Perhaps from now on we can call them virtio-scsi {initiator,target,LUN} and have no ambiguity? I'll also modify the spec in this sense. The SCSI spec itself only deals with LUNs, so anything you'll read in there obviously will only handle the interaction between the initiator (read: host) and the LUN itself. However, the actual command is send via an intermediat target, hence you'll always see the reference to the ITL (initiator-target-lun) nexus. Yes, this I understand. The SCSI spec details discovery of the individual LUNs presented by a given target, it does _NOT_ detail the discovery of the targets themselves. That is being delegated to the underlying transport And in fact I have this in virtio-scsi too, since virtio-scsi _is_ a transport: When VIRTIO_SCSI_EVT_RESET_REMOVED or VIRTIO_SCSI_EVT_RESET_RESCAN is sent for LUN 0, the driver should ask the initiator to rescan the target, in order to detect the case when an entire target has appeared or disappeared. [If the device fails] to report an event due to missing buffers, [...] the driver should poll the logical units for unit attention conditions, and/or do whatever form of bus scan is appropriate for the guest operating system. In the case of NPIV it would make sense to map the virtual SCSI host to the backend, so that all devices presented to the virtual SCSI host will be presented to the backend, too. However, when doing so these devices will normally be referenced by their original LUN, as these will be presented to the guest via eg 'REPORT LUNS'. Right. The above thread now tries to figure out if we should remap those LUN numbers or just expose them as they are. If we decide on remapping, we have to emulate _all_ commands referring explicitely to those LUN numbers (persistent reservations, anyone?). But it seems to me that commands referring explicitly to LUN numbers most likely have to be reimplemented anyway for virtualization. I'm thinking exactly of persistent reservations. If two guests on the same host try a persistent reservation, they should conflict with each other. If reservation commands were just passed through, they would be seen as coming from the same initiator (the HBA driver or iSCSI initiator in the host OS). etc. If we don't, we would expose some hardware detail to the guest, but would save us _a lot_ of processing. But can we afford it? And would the architecture allow that at all? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] scsi: replace 'tag' with 'hba_private' pointer
On 07/01/2011 03:11 PM, Hannes Reinecke wrote: On 07/01/2011 10:27 AM, Paolo Bonzini wrote: On 07/01/2011 09:42 AM, Hannes Reinecke wrote: 'tag' is just an abstraction to identify the command from the driver. So we should make that explicit by replacing 'tag' with a driver-defined pointer 'hba_private'. This saves the lookup for driver handling several commands in parallel. This makes tracing a bit harder to follow. Perhaps you can keep the transport tag (a uint64_t) in the SCSIRequest for debugging purposes? Hmm. The transport tag wouldn't have any meaning outside scsi-bus.c. It depends, in vmw_pvscsi I take it from a field in the request block that is 0..255. So either you have a small tag that is recycled but stays nice, or a large tag that is unwieldy but should not be recycled ever. A pointer is unwieldy _and_ is recycled, so it gives the worse of both worlds. But I'm not very attached to this, I may even do it myself if/when I find the need. Won't ack yet because of the nit with ESP/USB, but even if you do not bother I will ack the next respin. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] scsi: Add 'hba_private' to SCSIRequest
req = vscsi_find_req(s, sreq); +vscsi_req *req = sreq->hba_private; int32_t res_in = 0, res_out = 0; dprintf("VSCSI: SCSI cmd complete, r=0x%x tag=0x%x status=0x%x, req=%p\n", @@ -563,15 +553,14 @@ static void vscsi_command_complete(SCSIRequest *sreq, uint32_t status) } } vscsi_send_rsp(s, req, 0, res_in, res_out); -vscsi_put_req(s, req); +vscsi_put_req(req); } static void vscsi_request_cancelled(SCSIRequest *sreq) { -VSCSIState *s = DO_UPCAST(VSCSIState, vdev.qdev, sreq->bus->qbus.parent); -vscsi_req *req = vscsi_find_req(s, sreq); +vscsi_req *req = sreq->hba_private; -vscsi_put_req(s, req); +vscsi_put_req(req); } static void vscsi_process_login(VSCSIState *s, vscsi_req *req) @@ -659,7 +648,7 @@ static int vscsi_queue_cmd(VSCSIState *s, vscsi_req *req) } req->lun = lun; -req->sreq = scsi_req_new(sdev, req->qtag, lun); +req->sreq = scsi_req_new(sdev, req->qtag, lun, req); n = scsi_req_enqueue(req->sreq, srp->cmd.cdb); dprintf("VSCSI: Queued command tag 0x%x CMD 0x%x ID %d LUN %d ret: %d\n", @@ -858,7 +847,7 @@ static void vscsi_got_payload(VSCSIState *s, vscsi_crq *crq) } if (done) { -vscsi_put_req(s, req); +vscsi_put_req(req); } } diff --git a/hw/usb-msd.c b/hw/usb-msd.c index 86582cc..bfea096 100644 --- a/hw/usb-msd.c +++ b/hw/usb-msd.c @@ -216,10 +216,6 @@ static void usb_msd_transfer_data(SCSIRequest *req, uint32_t len) MSDState *s = DO_UPCAST(MSDState, dev.qdev, req->bus->qbus.parent); USBPacket *p = s->packet; -if (req->tag != s->tag) { -fprintf(stderr, "usb-msd: Unexpected SCSI Tag 0x%x\n", req->tag); -} - assert((s->mode == USB_MSDM_DATAOUT) == (req->cmd.mode == SCSI_XFER_TO_DEV)); s->scsi_len = len; s->scsi_buf = scsi_req_get_buf(req); @@ -241,9 +237,6 @@ static void usb_msd_command_complete(SCSIRequest *req, uint32_t status) MSDState *s = DO_UPCAST(MSDState, dev.qdev, req->bus->qbus.parent); USBPacket *p = s->packet; -if (req->tag != s->tag) { -fprintf(stderr, "usb-msd: Unexpected SCSI Tag 0x%x\n", req->tag); -} DPRINTF("Command complete %d\n", status); s->residue = s->data_len; s->result = status != 0; @@ -387,7 +380,7 @@ static int usb_msd_handle_data(USBDevice *dev, USBPacket *p) s->tag, cbw.flags, cbw.cmd_len, s->data_len); s->residue = 0; s->scsi_len = 0; -s->req = scsi_req_new(s->scsi_dev, s->tag, 0); +s->req = scsi_req_new(s->scsi_dev, s->tag, 0, NULL); scsi_req_enqueue(s->req, cbw.cmd); /* ??? Should check that USB and SCSI data transfer directions match. */ Acked-by: Paolo Bonzini Have a nice weekend. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] megasas: LSI Megaraid SAS emulation
On 07/02/2011 03:50 PM, Hannes Reinecke wrote: (And no, I will not getting into another dog-fight with Paul B. here. Virtio can do without bounce buffers. AHCI can. So I fail to see why SCSI has to rely on bounce buffers.) I agree, but I do see why a SCSI device might prefer to rely on bounce buffers for non-I/O commands. This is why in my last RFC series for vmw_pvscsi I let the device choose whether to force a bounce buffer or get an external iovec from the HBA. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] megasas: LSI Megaraid SAS emulation
On 07/04/2011 08:13 AM, Hannes Reinecke wrote: On 07/03/2011 04:36 PM, Paolo Bonzini wrote: On 07/02/2011 03:50 PM, Hannes Reinecke wrote: (And no, I will not getting into another dog-fight with Paul B. here. Virtio can do without bounce buffers. AHCI can. So I fail to see why SCSI has to rely on bounce buffers.) I agree, but I do see why a SCSI device might prefer to rely on bounce buffers for non-I/O commands. This is why in my last RFC series for vmw_pvscsi I let the device choose whether to force a bounce buffer or get an external iovec from the HBA. Yes, sure, for non-I/O commands it's perfectly okay. Most of which will be emulated anyway. It's bounce buffers for I/O which kills performance. But I seem to have missed your last RFC (I'm not reading qemu-devel on a regular basis ...). Care to send me a pointer to it? Sure, http://lists.gnu.org/archive/html/qemu-devel/2011-06/msg00668.html Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] megasas: LSI Megaraid SAS emulation
On 07/04/2011 09:26 AM, Hannes Reinecke wrote: Cool. Exactly what I need. FWIW, feel free to add my 'Acked-by' to it. Any chance of getting them included? I'm not very tied to pvscsi; I just needed an HBA that is not a joke by modern standards :) to play with the SCSI layer. There may be hope that megasas will come before pvscsi or eliminate the need for it altogether. So, feel free to pick up patches 5-8 from that series. That said, note that scsi-generic does *not* support scatter/gather in that series, so you should still make sure the fallback paths work well. :) In pvscsi I added a qdev property to toggle scatter/gather, it was useful for benchmarking. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] megasas: LSI Megaraid SAS emulation
On 07/04/2011 02:52 PM, Hannes Reinecke wrote: However, I probably will see to fixup the megasas emulation with the current infrastructure, get that in, and then move over to the iovec infrastructure. Makes sense also for ease of review. But if you promise to merge the iovec infrastructure I'm more than willing to fixup the scsi-generic backend. I am certainly going to get it upstream, but maintainers would like it to have users at the time it is committed. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for July 5
On 07/04/2011 09:52 AM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. SCSI without bounce buffers: talk now, or forever hold your peace. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 3/3] Avoid Wunsed-but-set warnings (or errors in case of Werror)
On 07/05/2011 09:49 AM, Markus Armbruster wrote: > If 'ret' has been used to silence compiler warnings about functions > which have been declared with attribute __warn_unused_result__ > (eg write() and various other libc functions) then "(void)write()" > is insufficient -- gcc requires the variable. gcc being silly. Oh well. In this particular case I think that the return value should be checked. It's good if something is printed in the log saying that the reset wasn't done for some reason---even if it is just defensive. The really silly thing is in glibc, not gcc. __warn_unused_result__ was added to fwrite, where you have no certainty that the write has been done, but not to either fclose or fflush. Oh well... Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] scsi-disk: Fixup debugging statement
On 07/05/2011 01:03 PM, Hannes Reinecke wrote: A debugging statement wasn't converted to the new interface. Signed-off-by: Hannes Reinecke --- hw/scsi-disk.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index c2a99fe..5804662 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -1007,7 +1007,7 @@ static int32_t scsi_send_command(SCSIRequest *req, uint8_t *buf) command = buf[0]; outbuf = (uint8_t *)r->iov.iov_base; -DPRINTF("Command: lun=%d tag=0x%x data=0x%02x", lun, tag, buf[0]); +DPRINTF("Command: lun=%d tag=0x%x data=0x%02x", req->lun, req->tag, buf[0]); if (scsi_req_parse(&r->req, buf) != 0) { BADF("Unsupported command length, command %x\n", command); Acked-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] scsi-disk: Mask out serial number EVPD
On 07/05/2011 01:03 PM, Hannes Reinecke wrote: If the serial number is not set we should mask it out in the list of supported VPD pages and mark it as not supported. Signed-off-by: Hannes Reinecke --- hw/scsi-disk.c | 15 --- 1 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index 5804662..05d14ab 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -398,7 +398,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf) "buffer size %zd\n", req->cmd.xfer); pages = buflen++; outbuf[buflen++] = 0x00; // list of supported pages (this page) -outbuf[buflen++] = 0x80; // unit serial number +if (s->serial) +outbuf[buflen++] = 0x80; // unit serial number outbuf[buflen++] = 0x83; // device identification if (s->drive_kind == SCSI_HD) { outbuf[buflen++] = 0xb0; // block limits @@ -409,8 +410,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf) } case 0x80: /* Device serial number, optional */ { -int l = strlen(s->serial); +int l; +if (!s->serial) { +DPRINTF("Inquiry (EVPD[Serial number] not supported\n"); +return -1; +} + +l = strlen(s->serial); if (l> req->cmd.xfer) l = req->cmd.xfer; if (l> 20) @@ -1203,7 +1210,9 @@ static int scsi_initfn(SCSIDevice *dev, SCSIDriveKind kind) if (!s->serial) { /* try to fall back to value set with legacy -drive serial=... */ dinfo = drive_get_by_blockdev(s->bs); -s->serial = qemu_strdup(*dinfo->serial ? dinfo->serial : "0"); +if (*dinfo->serial) { +s->serial = qemu_strdup(dinfo->serial); +} } if (!s->version) { Acked-by: Paolo Bonzini Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] megasas: LSI Megaraid SAS emulation
On 07/05/2011 03:38 PM, Alexander Graf wrote: >> >> +if (is_sgl64) { >> +iov_pa = ldq_phys(pa); >> +} else { >> +iov_pa = ldl_phys(pa); > > These load data from memory in target endianness. Are you sure that's > what you want? I'd expect this to be defined as little endian > (especially given that ldq and ldl on the same address work). Seems to be target endianness from the corresponding Linux code: if (sge_count) { scsi_for_each_sg(scp, os_sgl, sge_count, i) { mfi_sgl->sge32[i].length = sg_dma_len(os_sgl); mfi_sgl->sge32[i].phys_addr = sg_dma_address(os_sgl); } } ... if (sge_count) { scsi_for_each_sg(scp, os_sgl, sge_count, i) { mfi_sgl->sge64[i].length = sg_dma_len(os_sgl); mfi_sgl->sge64[i].phys_addr = sg_dma_address(os_sgl); } } Note that this is _either_ a ldq or a ldl depending on what the driver told the device. It is not accessing a 64-bit value as 32-bit. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 5/5] megasas: LSI Megaraid SAS emulation
On 07/06/2011 08:20 AM, Hannes Reinecke wrote: We cannot map control structures from guest memory and treating them as valid request state later on. Yes, I've been working on that one already. What I'll be doing is to read in the sge count during 'map_sgl' and store this value internally (in ->iov_cnt). And during unmap I'll be using this value instead of the frame-provided one. That way we'll be checking the sge_count field only once when we slurp in the entire frame. Note the flags too. Perhaps it's easier to simply copy the entire frame header... Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] New thread for the VM migration
On 07/14/2011 06:07 PM, Avi Kivity wrote: Maybe we can do this via a magic subsection whose contents are the hotplug event. What about making the device list is just another "thing" to be migrated live, together with block and ram? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] New thread for the VM migration
On 07/14/2011 06:07 PM, Avi Kivity wrote: Maybe we can do this via a magic subsection whose contents are the hotplug event. What about making the device list just another "thing" that has to be migrated live, together with block and ram? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Native Linux KVM tool for 3.1
On 07/25/2011 09:50 AM, Sasha Levin wrote: Anthony had a talk on last years KVM forum regarding the QEMU threading model (slide: http://www.linux-kvm.org/wiki/images/7/70/2010-forum-threading-qemu.pdf) . It was suggested that the KVM part of QEMU is having a hard time achieving the ideal threading model due to its need to support TCG - something which has nothing to do with KVM itself. No, it is not having a hard time. The "foot in the door" that Anthony mentions has been part of QEMU and qemu-kvm for a long time (multi-threading is necessary to support SMP!) and works quite well. Historically, there were three main loops: 1) QEMU single-threaded; 2) QEMU multi-threaded; clean, but buggy, untested and bitrotting; 3) qemu-kvm multi-threaded, forked from (1), ugly but robust and widely deployed. In 0.15 the two multi-threaded versions have been unified by Jan Kiszka. I have even ported (2) to Windows with little or no pain; porting to Mac OS X interestingly is harder than Windows, because non-portable Linux assumptions about signal handling have crept in the code (to preempt the objections: they weren't just non-portabilities, they were latent bugs). Windows just does not have signals. :) So, right now, the only difference is that QEMU is still defaulting to the single-threaded main loop, while qemu-kvm enables multi-threading by default. In some time even QEMU will switch. Yes, this is of course worse than getting it right in the first place; Nobody is saying the opposite. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Introduce QEMU_NEW()
On 07/25/2011 04:23 PM, Blue Swirl wrote: > Yes. We can just make qemu_malloc use g_malloc. It would be also possible to make g_malloc() use qemu_malloc(). That way we could keep the tracepoints which would lose their value with g_malloc() otherwise. qemu_malloc uses g_malloc => you keep tracepoints, you just do not trace memory allocated by glib g_malloc uses qemu_malloc => you keep and expand tracepoints, you lose the very nicely tuned allocator The former is much less code, however it requires qemu_malloc to be always balanced with qemu_free (patches ready and on my github tree, won't be sent before KVM Forum though...). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Introduce QEMU_NEW()
On 07/25/2011 04:56 PM, Blue Swirl wrote: > qemu_malloc uses g_malloc => you keep tracepoints, you just do not trace > memory allocated by glib Unless the plan is to replace all qemu_malloc() calls with calls to g_malloc(). We can worry when the day comes... there is already another task blocking that replacement (balancing qemu_malloc with qemu_free). > The former is much less code, however it requires qemu_malloc to be always > balanced with qemu_free (patches ready and on my github tree, won't be sent > before KVM Forum though...). Freeing qemu_malloc() memory with plain free() is a bug. We have many bugs, then. :) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] separate thread for VM migration
On 07/22/2011 09:58 PM, Umesh Deshapnde wrote: -qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100); +qemu_mod_timer(s->timer, qemu_get_clock_ms(migration_clock) + 100); if (s->freeze_output) return; @@ -246,8 +246,10 @@ static void buffered_rate_tick(void *opaque) buffered_flush(s); -/* Add some checks around this */ s->put_ready(s->opaque); +qemu_mutex_unlock_iothread(); +usleep(qemu_timer_difference(s->timer, migration_clock) * 1000); +qemu_mutex_lock_iothread(); } Do you really need a timer? The timer will only run in the migration thread, where you should have full control on when to wait. The BufferedFile code is more general, but you can create one thread per BufferedFile (you will only have one). Then, since you only have one timer, calling buffered_rate_tick repeatedly is really all that is done in the body of the thread: while (s->state == MIG_STATE_ACTIVE) start_time = qemu_get_clock_ms(rt_clock); buffered_rate_tick (s->file); qemu_mutex_unlock_iothread(); usleep ((qemu_get_clock_ms(rt_clock) + 100 - start_time) * 1000); qemu_mutex_lock_iothread(); } Perhaps the whole QEMUFile abstraction should be abandoned as the basis of QEMUBufferedFile. It is simply too heavy when you're working in a separate thread and you can afford blocking operation. I also am not sure about the correctness. Here you removed the delayed call to migrate_fd_put_notify, which is what resets freeze_output to 0: @@ -327,9 +320,7 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) if (ret == -1) ret = -(s->get_error(s)); -if (ret == -EAGAIN) { -qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s); -} else if (ret< 0) { +if (ret< 0&& ret != -EAGAIN) { if (s->mon) { monitor_resume(s->mon); } The call to migrate_fd_put_notify is here: @@ -396,6 +401,9 @@ void migrate_fd_put_ready(void *opaque) } s->state = state; notifier_list_notify(&migration_state_notifiers); +} else { +migrate_fd_wait_for_unfreeze(s); +qemu_file_put_notify(s->file); } } But it is not clear to me what calls migrate_fd_put_ready when the output file is frozen. Finally, here you are busy waiting: @@ -340,10 +331,34 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) return ret; } -void migrate_fd_connect(FdMigrationState *s) +void *migrate_run_timers(void *arg) { +FdMigrationState *s = arg; int ret; +qemu_mutex_lock_iothread(); +ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk, +s->mig_state.shared); +if (ret< 0) { +DPRINTF("failed, %d\n", ret); +migrate_fd_error(s); +return NULL; +} + +migrate_fd_put_ready(s); + +while (s->state == MIG_STATE_ACTIVE) { +qemu_run_timers(migration_clock); +} which also convinces me that if you make rate limiting the main purpose of the migration thread's main loop, most problems will go away. You can also use select() instead of usleep, so that you fix the problem with qemu_file_put_notify. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/23] memory: merge adjacent segments of a single memory region
On 07/26/2011 01:26 PM, Avi Kivity wrote: +while (i < view->nr) { +j = i + 1; +while (j < view->nr + && can_merge(&view->ranges[j-1], &view->ranges[j])) { +view->ranges[i].addr.size += view->ranges[j].addr.size; +++j; +} +++i; if (j != i) { +memmove(&view->ranges[i], &view->ranges[j], +(view->nr - j) * sizeof(view->ranges[j])); +view->nr -= j - i; +} } Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/23] memory: merge adjacent segments of a single memory region
On 07/26/2011 01:38 PM, Avi Kivity wrote: if (j != i) { +memmove(&view->ranges[i], &view->ranges[j], +(view->nr - j) * sizeof(view->ranges[j])); +view->nr -= j - i; +} } Seems to work both ways? Sure, but you're pointlessly memmove-ing memory over itself. I'm not sure how many segments a single memory region will usually have, but it's better to be safe. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows7 crashes inside the VM when starting a certain program
On 07/07/2011 07:26 AM, André Weidemann wrote: Hi, I am running Windows7 x64 in a VM which crashes after starting a certain game. Actually there are two games both from the same company, that make the VM crash after starting them. Windows crashes right after starting the game. With the 1st game the screen goes black as usual and the cursor keeps spinning for 3-5 seconds until Windows crashes. With the second game I get to 3D the login screen. The game then crashes after logging in. Windows displays this error message on the first crash: http://pastebin.com/kMzk9Jif Windows then finishes writing the crash dump and restarts. I can reproduce Windows crashing every time I start the game while the VM keeps running without any problems. When Windows reboots after the first crash and the game is started again, the message on the following blue screen changes slightly and stays the same(except for the addresses) for every following crash: http://pastebin.com/jVtBc4ZH The blue screens seem to be for the same exception, 0xC096--privileged instruction, only sometimes in user mode (but in a system service, which also causes a blue screen) sometimes in kernel mode. Can you open the produced dump in WinDbg and post a disassemble around the failing instruction? Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows7 crashes inside the VM when starting a certain program
On 07/28/2011 03:21 PM, Avi Kivity wrote: I haven't used debuggers very much, so I hope I grabbed the correct lines from the disassembly: http://pastebin.com/t3sfvmTg That's the bug check routine. Can you go up a frame? Or just do what Gleb suggested. Open the dump, type "!analyze -v" and cut-paste the address from WinDbg's output into the Disassemble window. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows7 crashes inside the VM when starting a certain program
On 07/28/2011 07:44 PM, André Weidemann wrote: Hi, On 28.07.2011 15:49, Paolo Bonzini wrote: On 07/28/2011 03:21 PM, Avi Kivity wrote: I haven't used debuggers very much, so I hope I grabbed the correct lines from the disassembly: http://pastebin.com/t3sfvmTg That's the bug check routine. Can you go up a frame? Or just do what Gleb suggested. Open the dump, type "!analyze -v" and cut-paste the address from WinDbg's output into the Disassemble window. This is the output of "!analyze -v": http://pastebin.com/sCZSjr8m ...and this is the output from the disassembly window: http://pastebin.com/AVZuswkT Very useful, thanks! Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/3] separate thread for VM migration
On 07/29/2011 10:57 PM, Umesh Deshpande wrote: This patch creates a separate thread for the guest migration on the source side. Signed-off-by: Umesh Deshpande Looks pretty good! One thing that shows, is that the interface separation between buffered_file.c is migration.c is pretty weird. Your patch makes it somewhat worse, but it was like this before so it's not your fault. The good thing is that if buffered_file.c uses threads, you can fix a large part of this and get even simpler code: 1) there is really just one way to implement migrate_fd_put_notify, and with your simplifications it does not belong anymore in migration.c. 2) s->callback is actually not NULL exactly if s->file->frozen_output is true, you can remove it as well; 3) buffered_close is messy because it can be called from both the iothread (monitor->migrate_fd_cancel->migrate_fd_cleanup->qemu_fclose) or the migration thread (after qemu_savevm_state_complete). But buffered_close is actually very similar to your thread function (it does flush+wait_for_unfreeze, basically)! So buffered_close can be simply: s->closed = 1; ret = qemu_thread_join(s->thread); /* doesn't exist yet :) */ qemu_free(...); return ret; Another nit is that here: +if (migrate_fd_check_expire()) { +buffered_rate_tick(s->file); +} + +if (s->state != MIG_STATE_ACTIVE) { +break; +} + +if (s->callback) { +migrate_fd_wait_for_unfreeze(s); +s->callback(s); +} you can still have a busy wait. Putting it all together, you can move the thread function back to buffered_file.c like: while (!s->closed || (!s->has_error && s->buffer_size)) { if (s->freeze_output) { qemu_mutex_unlock_iothread(); s->wait_for_unfreeze(s); qemu_mutex_lock_iothread(); /* This comes from qemu_file_put_notify (via buffered_put_buffer---can be simplified a lot too?). s->freeze_output = 0; /* Test again for cancellation. */ continue; } int64_t current_time = qemu_get_clock_ms(rt_clock); if (s->expire_time > current_time) { struct timeval tv = { .tv_sec = 0, .tv_usec = ... }; qemu_mutex_unlock_iothread(); select (0, NULL, NULL, NULL, &tv); qemu_mutex_lock_iothread(); s->expire_time = qemu_get_clock_ms(rt_clock) + 100; continue; } /* This comes from buffered_rate_tick. */ s->bytes_xfer = 0; buffered_flush(s); if (!s->closed) { s->put_ready(s->opaque); } } ret = s->close(s->opaque); ... Does it look sane? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 2/3] fine grained qemu_mutex locking for migration
On 07/29/2011 10:57 PM, Umesh Deshpande wrote: +qemu_mutex_unlock_iothread(); while (s->state == MIG_STATE_ACTIVE) { if (migrate_fd_check_expire()) { +qemu_mutex_lock_iothread(); buffered_rate_tick(s->file); +qemu_mutex_unlock_iothread(); } if (s->state != MIG_STATE_ACTIVE) { @@ -392,12 +396,11 @@ void migrate_fd_begin(void *arg) if (s->callback) { migrate_fd_wait_for_unfreeze(s); +qemu_mutex_lock_iothread(); s->callback(s); +qemu_mutex_unlock_iothread(); } } - -out: -qemu_mutex_unlock_iothread(); I think it's clearer to unlock explicitly around the waiting points (see review of 1/3). In fact, I think you're working around the busy wait by accessing s->state outside the lock, right? I don't think this is provably safe; moving the knowledge of the thread entirely within buffered_file.c also fixes this, because then the lifetimes of the thread and the QEMUFile are much clearer. Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/3] separate thread for VM migration
On 08/01/2011 11:00 PM, Umesh Deshpande wrote: I kept this in migration.c to call qemu_savevm_state_begin. (The way it is done currently. i.e. to keep access to FdMigrationState in migration.c) Calling it from buffered_file.c would be inconsistent in that sense. or we will have to call it from the iothread before spawning the migration thread. Right, I missed that. Perhaps you can call it the first time put_ready is called. Also why is the separation between FdMigrationState and QEMUFileBuffered is required. Is QEMUFileBuffered designed to use also for things other than migration? No, but let's keep it this way for now. It may be an annoyance, but it also helps making a reusable architecture, and it can probably be cleaned up substantially with thread support. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 1/4] separate thread for VM migration
On 08/11/2011 05:32 PM, Umesh Deshpande wrote: This patch creates a separate thread for the guest migration on the source side. migrate_cancel request from the iothread is handled asynchronously. That is, iothread submits migrate_cancel to the migration thread and returns, while the migration thread attends this request at the next iteration to terminate its execution. Looks pretty good! I hope you agree. :) Just one note inside. Signed-off-by: Umesh Deshpande --- buffered_file.c | 85 -- buffered_file.h |4 ++ migration.c | 49 ++- migration.h |6 4 files changed, 82 insertions(+), 62 deletions(-) diff --git a/buffered_file.c b/buffered_file.c index 41b42c3..19932b6 100644 --- a/buffered_file.c +++ b/buffered_file.c @@ -16,6 +16,8 @@ #include "qemu-timer.h" #include "qemu-char.h" #include "buffered_file.h" +#include "migration.h" +#include "qemu-thread.h" //#define DEBUG_BUFFERED_FILE @@ -28,13 +30,14 @@ typedef struct QEMUFileBuffered void *opaque; QEMUFile *file; int has_error; +int closed; int freeze_output; size_t bytes_xfer; size_t xfer_limit; uint8_t *buffer; size_t buffer_size; size_t buffer_capacity; -QEMUTimer *timer; +QemuThread thread; } QEMUFileBuffered; #ifdef DEBUG_BUFFERED_FILE @@ -155,14 +158,6 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, in offset = size; } -if (pos == 0&& size == 0) { -DPRINTF("file is ready\n"); -if (s->bytes_xfer<= s->xfer_limit) { -DPRINTF("notifying client\n"); -s->put_ready(s->opaque); -} -} - return offset; } @@ -175,20 +170,20 @@ static int buffered_close(void *opaque) while (!s->has_error&& s->buffer_size) { buffered_flush(s); -if (s->freeze_output) +if (s->freeze_output) { s->wait_for_unfreeze(s); +} } This is racy; you might end up calling buffered_put_buffer twice from two different threads. -ret = s->close(s->opaque); +s->closed = 1; -qemu_del_timer(s->timer); -qemu_free_timer(s->timer); +ret = s->close(s->opaque); qemu_free(s->buffer); -qemu_free(s); ... similarly, here the migration thread might end up using the buffer. Just set s->closed here and wait for thread completion; the migration thread can handle the flushes free the buffer etc. Let the migration thread do as much as possible, it will simplify your life. return ret; } + static int buffered_rate_limit(void *opaque) { QEMUFileBuffered *s = opaque; @@ -228,34 +223,55 @@ static int64_t buffered_get_rate_limit(void *opaque) return s->xfer_limit; } -static void buffered_rate_tick(void *opaque) +static void *migrate_vm(void *opaque) { QEMUFileBuffered *s = opaque; +int64_t current_time, expire_time = qemu_get_clock_ms(rt_clock) + 100; +struct timeval tv = { .tv_sec = 0, .tv_usec = 10}; -if (s->has_error) { -buffered_close(s); -return; -} +qemu_mutex_lock_iothread(); -qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100); +while (!s->closed) { ... This can be in fact while (!s->closed || s->buffered_size) and that alone will subsume the loop in buffered_close, no? +if (s->freeze_output) { +s->wait_for_unfreeze(s); +s->freeze_output = 0; +continue; +} -if (s->freeze_output) -return; +if (s->has_error) { +break; +} + +current_time = qemu_get_clock_ms(rt_clock); +if (!s->closed&& (expire_time> current_time)) { +tv.tv_usec = 1000 * (expire_time - current_time); +select(0, NULL, NULL, NULL,&tv); +continue; +} -s->bytes_xfer = 0; +s->bytes_xfer = 0; +buffered_flush(s); -buffered_flush(s); +expire_time = qemu_get_clock_ms(rt_clock) + 100; +s->put_ready(s->opaque); +} -/* Add some checks around this */ -s->put_ready(s->opaque); +if (s->has_error) { +buffered_close(s); +} +qemu_free(s); + +qemu_mutex_unlock_iothread(); + +return NULL; } QEMUFile *qemu_fopen_ops_buffered(void *opaque, - size_t bytes_per_sec, - BufferedPutFunc *put_buffer, - BufferedPutReadyFunc *put_ready, - BufferedWaitForUnfreezeFunc *wait_for_unfreeze, - BufferedCloseFunc *close) +size_t bytes_per_sec, +BufferedPutFunc *put_buffer, +BufferedPutReadyFunc *put_ready, +BufferedWaitForUnfreezeFunc *wait_for_unfreeze, +BufferedCloseFunc *close) { QEMUFileBuffered *s; @@ -267,15 +
Re: [RFC PATCH v3 3/4] lock to protect memslots
diff --git a/buffered_file.c b/buffered_file.c index b64ada7..5735e18 100644 --- a/buffered_file.c +++ b/buffered_file.c @@ -243,7 +243,9 @@ static void *migrate_vm(void *opaque) while (!s->closed) { if (s->freeze_output) { +qemu_mutex_unlock_iothread(); s->wait_for_unfreeze(s); +qemu_mutex_lock_iothread(); s->freeze_output = 0; continue; } @@ -255,7 +257,9 @@ static void *migrate_vm(void *opaque) current_time = qemu_get_clock_ms(rt_clock); if (!s->closed&& (expire_time> current_time)) { tv.tv_usec = 1000 * (expire_time - current_time); +qemu_mutex_unlock_iothread(); select(0, NULL, NULL, NULL,&tv); +qemu_mutex_lock_iothread(); continue; } I believe these should be already in patch 1 or anyway in a separate patch. diff --git a/cpus.c b/cpus.c index de70e02..6090c44 100644 --- a/cpus.c +++ b/cpus.c @@ -666,6 +666,7 @@ int qemu_init_main_loop(void) qemu_cond_init(&qemu_work_cond); qemu_mutex_init(&qemu_fair_mutex); qemu_mutex_init(&qemu_global_mutex); +qemu_mutex_init(&ram_list.mutex); qemu_mutex_lock(&qemu_global_mutex); qemu_thread_get_self(&io_thread); These are only executed if CONFIG_IOTHREAD; you can probably move it somewhere in exec.c. @@ -919,6 +920,17 @@ void qemu_mutex_unlock_iothread(void) qemu_mutex_unlock(&qemu_global_mutex); } +void qemu_mutex_lock_ramlist(void) +{ +qemu_mutex_lock(&ram_list.mutex); +} + +void qemu_mutex_unlock_ramlist(void) +{ +qemu_mutex_unlock(&ram_list.mutex); +} + + And these too. static int all_vcpus_paused(void) { CPUState *penv = first_cpu; diff --git a/exec.c b/exec.c index 0e2ce57..7bfb36f 100644 --- a/exec.c +++ b/exec.c @@ -2972,6 +2972,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, } new_block->length = size; +qemu_mutex_lock_ramlist(); + QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next); ram_list.phys_dirty = qemu_realloc(ram_list.phys_dirty, @@ -2979,6 +2981,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, memset(ram_list.phys_dirty + (new_block->offset>> TARGET_PAGE_BITS), 0xff, size>> TARGET_PAGE_BITS); +qemu_mutex_unlock_ramlist(); + if (kvm_enabled()) kvm_setup_guest_memory(new_block->host, size); @@ -2996,7 +3000,9 @@ void qemu_ram_free_from_ptr(ram_addr_t addr) QLIST_FOREACH(block,&ram_list.blocks, next) { if (addr == block->offset) { +qemu_mutex_lock_ramlist(); QLIST_REMOVE(block, next); +qemu_mutex_unlock_ramlist(); qemu_free(block); return; } @@ -3009,7 +3015,9 @@ void qemu_ram_free(ram_addr_t addr) QLIST_FOREACH(block,&ram_list.blocks, next) { if (addr == block->offset) { +qemu_mutex_lock_ramlist(); QLIST_REMOVE(block, next); +qemu_mutex_unlock_ramlist(); if (block->flags& RAM_PREALLOC_MASK) { ; } else if (mem_path) { @@ -3117,8 +3125,10 @@ void *qemu_get_ram_ptr(ram_addr_t addr) if (addr - block->offset< block->length) { /* Move this entry to to start of the list. */ if (block != QLIST_FIRST(&ram_list.blocks)) { +qemu_mutex_lock_ramlist(); QLIST_REMOVE(block, next); QLIST_INSERT_HEAD(&ram_list.blocks, block, next); +qemu_mutex_unlock_ramlist(); Theoretically qemu_get_ram_ptr should be protected. The problem is not just accessing the ramlist, it is accessing the data underneath it before anyone frees it. Luckily we can set aside that problem for now, because qemu_ram_free_from_ptr is only used by device assignment and device assignment makes VMs unmigratable. If we support generic RAM hotplug/hotunplug, we would probably need something RCU-like to protect accesses, since protecting all uses of qemu_get_ram_ptr is not practical. } if (xen_mapcache_enabled()) { /* We need to check if the requested address is in the RAM diff --git a/qemu-common.h b/qemu-common.h index abd7a75..b802883 100644 --- a/qemu-common.h +++ b/qemu-common.h @@ -212,6 +212,8 @@ char *qemu_strndup(const char *str, size_t size); void qemu_mutex_lock_iothread(void); void qemu_mutex_unlock_iothread(void); +void qemu_mutex_lock_ramlist(void); +void qemu_mutex_unlock_ramlist(void); int qemu_open(const char *name, int flags, ...); ssize_t qemu_write_full(int fd, const void *buf, size_t count) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 0/4] Separate thread for VM migration
On 08/11/2011 05:32 PM, Umesh Deshpande wrote: Following patch series deals with VCPU and iothread starvation during the migration of a guest. Currently the iothread is responsible for performing the guest migration. It holds qemu_mutex during the migration and doesn't allow VCPU to enter the qemu mode and delays its return to the guest. The guest migration, executed as an iohandler also delays the execution of other iohandlers. In the following patch series, The migration has been moved to a separate thread to reduce the qemu_mutex contention and iohandler starvation. Umesh Deshpande (4): separate thread for VM migration synchronous migrate_cancel lock to protect memslots separate migration bitmap arch_init.c | 26 ++ buffered_file.c | 100 +-- buffered_file.h |4 ++ cpu-all.h | 39 cpus.c | 12 ++ exec.c | 74 + hw/hw.h |5 ++- migration.c | 50 -- migration.h |6 +++ qemu-common.h |2 + qemu-thread-posix.c | 10 + qemu-thread.h |1 + savevm.c| 31 +++- 13 files changed, 280 insertions(+), 80 deletions(-) Looks quite good. Note that when submitting for inclusion, the last two patches should go first, so that you only add the migration thread after you're race free. If there are parts that do not apply before patches 1/2, that likely means they are in the wrong patch. :) Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 1/4] separate thread for VM migration
On 08/11/2011 07:36 PM, Umesh Deshpande wrote: Now, migrate_fd_cleanup, buffured_close is just executed by the migration thread. I am not letting iothread call any migration cancellation related functions. In stead it just submits the request and waits for the migration thread to terminate itself in the next iteration. The reason is to avoid the call to qemu_fflush, qemu_savevm_state_cancel (to carry out migrate_cancel) from iothread while migration thread is transferring data without holding the locks. Got it now, thanks for explaining. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 3/4] lock to protect memslots
On 08/11/2011 06:20 PM, Paolo Bonzini wrote: +qemu_mutex_lock_ramlist(); QLIST_REMOVE(block, next); QLIST_INSERT_HEAD(&ram_list.blocks, block, next); +qemu_mutex_unlock_ramlist(); Theoretically qemu_get_ram_ptr should be protected. The problem is not just accessing the ramlist, it is accessing the data underneath it before anyone frees it. Luckily we can set aside that problem for now, because qemu_ram_free_from_ptr is only used by device assignment and device assignment makes VMs unmigratable. Hmm, rethinking about it, all the loops in exec.c should be protected from the mutex. That's not too good because qemu_get_ram_ptr is a hot path for TCG. Perhaps you can also avoid the mutex entirely, and just disable the above optimization for most-recently-used-block while migration is running. It's not a complete solution, but it could be good enough until we have RAM hot-plug/hot-unplug. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 3/4] lock to protect memslots
On 08/14/2011 11:45 PM, Umesh Deshpande wrote: That's not too good because qemu_get_ram_ptr is a hot path for TCG. Looks like qemu_get_ram_ptr isn't called from the source side code of guest migration. Right, but since you are accessing the list from both migration and non-migration threads, qemu_get_ram_ptr needs to take the ram_list lock. The ram_list and IO thread mutexes together protect the "next" member, and you need to hold them both when modifying the list. See the reply to Marcelo for a proposed solution. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 3/4] lock to protect memslots
On 08/15/2011 12:26 AM, Marcelo Tosatti wrote: > Actually the previous patchset does not traverse the ramlist without > qemu_mutex locked, which is safe versus the most-recently-used-block > optimization. Actually it does: bytes_transferred_last = bytes_transferred; bwidth = qemu_get_clock_ns(rt_clock); +if (stage != 3) { +qemu_mutex_lock_ramlist(); +qemu_mutex_unlock_iothread(); +} + while (!qemu_file_rate_limit(f)) { int bytes_sent; /* ram_save_block does traverse memory. */ bytes_sent = ram_save_block(f); bytes_transferred += bytes_sent; if (bytes_sent == 0) { /* no more blocks */ break; } } +if (stage != 3) { +qemu_mutex_lock_iothread(); +qemu_mutex_unlock_ramlist(); +} + bwidth = qemu_get_clock_ns(rt_clock) - bwidth; bwidth = (bytes_transferred - bytes_transferred_last) / bwidth; What Umesh is doing is using "either ramlist mutex or iothread mutex" when reading the ramlist, and "both" when writing the ramlist; similar to rwlocks done with a regular mutex per CPU---clever! So this: +qemu_mutex_lock_ramlist(); QLIST_REMOVE(block, next); QLIST_INSERT_HEAD(&ram_list.blocks, block, next); +qemu_mutex_unlock_ramlist(); is effectively upgrading the lock from read-side to write-side, assuming that qemu_get_ram_ptr is never called from the migration thread (which is true). However, I propose that you put the MRU order in a separate list. You would still need two locks: the IO thread lock to protect the new list, a new lock to protect the other fields in the ram_list. For simplicity you may skip the new lock if you assume that the migration and I/O threads never modify the list concurrently, which is true. And more importantly, the MRU and migration code absolutely do not affect each other, because indeed the migration thread does not do MRU accesses. See the attachment for an untested patch. Paolo >From 8579b821a2c7c4da55a4208c5df3c86e8ce2cc87 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Fri, 12 Aug 2011 13:08:04 +0200 Subject: [PATCH] split MRU ram list Outside the execution threads the normal, non-MRU-ized order of the RAM blocks should always be enough. So manage two separate lists, which will have separate locking rules. Signed-off-by: Paolo Bonzini --- cpu-all.h |4 +++- exec.c| 16 +++- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/cpu-all.h b/cpu-all.h index f5c82cd..083d9e6 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -479,8 +479,9 @@ typedef struct RAMBlock { ram_addr_t offset; ram_addr_t length; uint32_t flags; -char idstr[256]; QLIST_ENTRY(RAMBlock) next; +QLIST_ENTRY(RAMBlock) next_mru; +char idstr[256]; #if defined(__linux__) && !defined(TARGET_S390X) int fd; #endif @@ -489,6 +490,7 @@ typedef struct RAMBlock { typedef struct RAMList { uint8_t *phys_dirty; QLIST_HEAD(, RAMBlock) blocks; +QLIST_HEAD(, RAMBlock) blocks_mru; } RAMList; extern RAMList ram_list; diff --git a/exec.c b/exec.c index 253f42c..be0f37e 100644 --- a/exec.c +++ b/exec.c @@ -110,7 +110,10 @@ static uint8_t *code_gen_ptr; int phys_ram_fd; static int in_migration; -RAMList ram_list = { .blocks = QLIST_HEAD_INITIALIZER(ram_list.blocks) }; +RAMList ram_list = { +.blocks = QLIST_HEAD_INITIALIZER(ram_list.blocks), +.blocks_mru = QLIST_HEAD_INITIALIZER(ram_list.blocks_mru) +}; static MemoryRegion *system_memory; @@ -2972,6 +2975,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, new_block->length = size; QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next); +QLIST_INSERT_HEAD(&ram_list.blocks_mru, new_block, next_mru); ram_list.phys_dirty = qemu_realloc(ram_list.phys_dirty, last_ram_offset() >> TARGET_PAGE_BITS); @@ -2996,6 +3000,7 @@ void qemu_ram_free_from_ptr(ram_addr_t addr) QLIST_FOREACH(block, &ram_list.blocks, next) { if (addr == block->offset) { QLIST_REMOVE(block, next); +QLIST_REMOVE(block, next_mru); qemu_free(block); return; } @@ -3009,6 +3014,7 @@ void qemu_ram_free(ram_addr_t addr) QLIST_FOREACH(block, &ram_list.blocks, next) { if (addr == block->offset) { QLIST_REMOVE(block, next); +QLIST_REMOVE(block, next_mru); if (block->flags & RAM_PREALLOC_MASK) { ; } else if (mem_path) { @@ -3113,12 +3119,12 @@ void *qemu_get_ram_ptr(ram_addr_t addr) { RAMBlock *block; -QLIST_FOREACH(block, &ram_list.blocks, next) { +QLIST_FOREACH(block, &ram_list.blocks_mru, next_mru) { if (addr - block->offset < block->length) {
Re: [RFC PATCH v3 3/4] lock to protect memslots
On 08/15/2011 01:27 PM, Umesh Deshpande wrote: Yes, the mru list patch would obviate the need of holding the ram_list mutex in qemu_get_ram_ptr. Feel free to take it and complete it with locking then! Also, I was planning to protect the whole migration thread with iothread mutex, and ram_list mutex. (i.e. holding ram_list mutex while sleeping between two iterations, when we release iothread mutex). This will prevent the memslot block removal altogether during the migration. Do you see any problem with this? No, indeed holding the ram_list mutex through all the migration "fixes" the problems with ram_lists removing during migration. I guess whoever will implement memory hotplug, will find a way around it. :) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 3/4] lock to protect memslots
On 08/15/2011 11:15 PM, Paolo Bonzini wrote: On 08/15/2011 01:27 PM, Umesh Deshpande wrote: Yes, the mru list patch would obviate the need of holding the ram_list mutex in qemu_get_ram_ptr. Feel free to take it and complete it with locking then! Also, I was planning to protect the whole migration thread with iothread mutex, and ram_list mutex. (i.e. holding ram_list mutex while sleeping between two iterations, when we release iothread mutex). This will prevent the memslot block removal altogether during the migration. Do you see any problem with this? No, indeed holding the ram_list mutex through all the migration "fixes" the problems with ram_lists removing during migration. I guess whoever will implement memory hotplug, will find a way around it. :) BTW, now that I think of it, do you have ideas on how to do the migration thread do block migration? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 0.15: is glib actually needed?
On 08/16/2011 03:57 AM, Michael Tokarev wrote: Each time I build qemu-kvm, my build script complains that it is needlessly linked against libglib-2.0 without using any symbols from that library. So is glib really needed for qemu-kvm? How it's different from qemu-0.15? glib is only needed for the guest agent, or on platforms that do not have makecontext/swapcontext (the BSDs). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v4 2/5] ramlist mutex
On 08/16/2011 08:56 PM, Umesh Deshpande wrote: @@ -3001,8 +3016,10 @@ void qemu_ram_free_from_ptr(ram_addr_t addr) QLIST_FOREACH(block,&ram_list.blocks, next) { if (addr == block->offset) { +qemu_mutex_lock_ramlist(); QLIST_REMOVE(block, next); QLIST_REMOVE(block, next_mru); +qemu_mutex_unlock_ramlist(); qemu_free(block); return; } @@ -3015,8 +3032,10 @@ void qemu_ram_free(ram_addr_t addr) QLIST_FOREACH(block,&ram_list.blocks, next) { if (addr == block->offset) { +qemu_mutex_lock_ramlist(); QLIST_REMOVE(block, next); QLIST_REMOVE(block, next_mru); +qemu_mutex_unlock_ramlist(); if (block->flags& RAM_PREALLOC_MASK) { ; } else if (mem_path) { You must protect the whole QLIST_FOREACH. Otherwise looks good. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v4 4/5] separate thread for VM migration
On 08/16/2011 08:56 PM, Umesh Deshpande wrote: +qemu_mutex_lock_ramlist(); Taken locks: iothread, ramlist +qemu_mutex_unlock_iothread(); Taken locks: ramlist +s->wait_for_unfreeze(s); +qemu_mutex_lock_iothread(); Taken locks: ramlist, iothread You'd have a deadlock here if hypothetically you had two migrations at the same time. +qemu_mutex_unlock_ramlist(); But in general, why this locking? The buffered file need not know anything about the ram list and its mutex. Only ram_save_live needs to hold the ramlist lock---starting just before sort_ram_list and ending just after the end of stage 3. That should be part of patch 2. Actually buffered_file.c should ideally not even take the iothread lock! The code there is only copying data from a private buffer to a file descriptor; neither is shared. It's migrate_fd_put_buffer that should take care of locking. This is an example of why keeping the separation of QEMUBufferedFile is a good idea at least for now. I am still kind of unconvinced about calling qemu_fclose from the migration thread. You still have one instance of cancellation in the iothread when migrate_fd_release is called. Ideally, as soon as migration finishes or has an error you could trigger a bottom half that closes the file (which in turn joins the thread). Migration state notifiers should also be run only from the iothread. Failure to do so (or in general lack of a policy of what runs where) can lead to very difficult bugs. Not so much hard to debug in this case (we have a global lock, so things cannot go _that_ bad), but hard to fix without redoing everything. However, this patch is a good start (with locking fixed). It should takes several incremental steps before getting there, including incredible simplification if you take into account that migration can block and wait_for_unfreeze can disappear. In the end it probably should be committed as a single patch, but I'm liking the patches more and more. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 5/5] Making iothread block for migrate_cancel
On 08/16/2011 08:56 PM, Umesh Deshpande wrote: diff --git a/qemu-thread-posix.c b/qemu-thread-posix.c index 2bd02ef..0d18b35 100644 --- a/qemu-thread-posix.c +++ b/qemu-thread-posix.c @@ -115,6 +115,16 @@ void qemu_cond_wait(QemuCond *cond, QemuMutex *mutex) error_exit(err, __func__); } +void qemu_thread_join(QemuThread thread) Should take a pointer. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v4 2/5] ramlist mutex
On 08/19/2011 08:20 AM, Umesh Deshpande wrote: Or, is it okay to convert all the ramblock list traversals in exec.c (under iothread) to mru traversals, and probably it makes sense as the original list was also maintained in the mru order, whereas the sequence of blocks doesn't matter for the migration code. This way we don't have to acquire the mutex for block list traversals. I'm not sure... as I said, the MRU list is on a fast path and restricting it to that fast path keeps us honest. Also, the non-MRU list is almost never accessed outside the migration thread, so the mutex shouldn't be heavily contended anyway. You can also think about (not too clever) ways to keep the mutex unlocked while not doing ram_save_live. BTW, actually the migration code tries to migrate the largest blocks first (because usually all the blocks after the first are small and easily migrated during the _complete pass), so the order does somewhat matter for migration. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Segfault starting vcpu thread
Il 11/07/2012 14:08, Avi Kivity ha scritto: > specific command line or guest? >>> >>> qemu-system-x86_64 >>> >> >>> >> Just did the same, but it's all fine here. >> > >> > Ok, I'll debug it. Probably something stupid like a miscompile. > Indeed, a simple clean build fixed it up. Paolo, it looks like > autodependencies are still broken. Hmm, I think everything happened while I was away. I'll take a look. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] hw/virtio-scsi: Set max_target=0 during vhost-scsi operation
Il 12/07/2012 07:34, Zhi Yong Wu ha scritto: > HI, > > Do we need to maintain one QEMU branch to collect all useful latest > patches for tcm_vhost support? You know, those patches will not get > merged into qemu.git/master. Never say never, but the answer to your question is yes: please apply this patch to your vhost-scsi branch and push it to github. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
Il 12/07/2012 09:23, James Bottomley ha scritto: >> > Cc: Paolo Bonzini >> > Cc: Stefan Hajnoczi >> > Cc: Zhi Yong Wu >> > Cc: Christoph Hellwig >> > Cc: Hannes Reinecke >> > Cc: James Bottomley >> > Signed-off-by: Nicholas Bellinger > Was the change so great that it needs re acking? Not really, but anyway Acked-by: Paolo Bonzini > I assume it also now no longer applies to stable because it will reject? Yes, stable will have to use v1. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH uq/master 2/9] event_notifier: remove event_notifier_test
Il 12/07/2012 11:10, Avi Kivity ha scritto: > On 07/05/2012 06:16 PM, Paolo Bonzini wrote: >> This is broken; since the eventfd is used in nonblocking mode there >> is a race between reading and writing. >> > >> diff --git a/event_notifier.c b/event_notifier.c >> index 2b210f4..c339bfe 100644 >> --- a/event_notifier.c >> +++ b/event_notifier.c >> @@ -51,18 +51,3 @@ int event_notifier_test_and_clear(EventNotifier *e) >> int r = read(e->fd, &value, sizeof(value)); >> return r == sizeof(value); >> } >> - >> -int event_notifier_test(EventNotifier *e) >> -{ >> -uint64_t value; >> -int r = read(e->fd, &value, sizeof(value)); >> -if (r == sizeof(value)) { >> -/* restore previous value. */ >> -int s = write(e->fd, &value, sizeof(value)); >> -/* never blocks because we use EFD_SEMAPHORE. >> - * If we didn't we'd get EAGAIN on overflow >> - * and we'd have to write code to ignore it. */ >> -assert(s == sizeof(value)); >> -} >> -return r == sizeof(value); >> -} > > I don't see the race. Mind explaining? The assertion can actually fire, there's nothing that prevents this from happening: event_notifier_test() read(fd, &value, 8) write(fd, , 8) write(fd, &value, 8) event_notifier_set will always write a 1 and it will take a large amount of writes to reach overflow :) but that may not be true of other writers using the same file descriptor. Then, the comment is wrong in two ways. First, we do not use EFD_SEMAPHORE (though even if we did the only difference is that value will be always one). Second, we cannot write code to ignore EAGAIN, because then we've lost the value. With blocking I/O things would not be much better, because then event_notifier_test() might block on the write. That would be quite surprising. If we cared, we could implement the function more easily and corectly with poll(), checking for POLLIN in the revents. But I don't see a sensible use case for it anyway. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH uq/master 2/9] event_notifier: remove event_notifier_test
Il 12/07/2012 13:04, Avi Kivity ha scritto: > Right, it's useless. I'll adjust the comment (and the whitespace fix) > and apply. Ok, thanks very much! Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] virtio-scsi spec: event improvements
This makes some changes to the virtio-scsi event specification, so that it is now possible to use virtio-scsi events in the implementation of the QEMU "block_resize" command. Thanks to Cong Meng for finally implementing virtio-scsi hotplug, which made me look at block_resize again! Paolo Bonzini (2): virtio-scsi spec: unify event structs virtio-scsi spec: add configuration change event virtio-spec.lyx | 164 +-- 1 file changed, 160 insertions(+), 4 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] virtio-scsi spec: unify event structs
All currently defined event structs have the same fields. Simplify the driver by enforcing this also for future structs. Signed-off-by: Paolo Bonzini --- virtio-spec.lyx | 69 +++ 1 file changed, 65 insertions(+), 4 deletions(-) diff --git a/virtio-spec.lyx b/virtio-spec.lyx index 905e619..f8b214b 100644 --- a/virtio-spec.lyx +++ b/virtio-spec.lyx @@ -8207,7 +8207,20 @@ struct virtio_scsi_event { \begin_layout Plain Layout +\change_deleted 1531152142 1342440791 + ... +\change_inserted 1531152142 1342440791 +u8 lun[8]; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 1531152142 1342440791 + +u32 reason; +\change_unchanged + \end_layout \begin_layout Plain Layout @@ -8221,16 +8234,32 @@ struct virtio_scsi_event { \end_layout \begin_layout Standard -If bit 31 is set in the event field, the device failed to report an event - due to missing buffers. +If bit 31 is set in the +\series bold +event +\series default + field, the device failed to report an event due to missing buffers. In this case, the driver should poll the logical units for unit attention conditions, and/or do whatever form of bus scan is appropriate for the guest operating system. \end_layout \begin_layout Standard -Other data that the device writes to the buffer depends on the contents - of the event field. + +\change_deleted 1531152142 1342440830 +Other data that the device writes to the buffer +\change_inserted 1531152142 1342440839 +The meaning of the +\series bold +reason +\series default + field +\change_unchanged + depends on the contents of the +\series bold +event +\series default + field. The following events are defined: \end_layout @@ -8312,36 +8341,50 @@ status open \begin_layout Plain Layout +\change_deleted 1531152142 1342440799 + struct virtio_scsi_event_reset { \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440799 + // Write-only part \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440799 + u32 event; \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440799 + u8 lun[8]; \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440799 + u32 reason; \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440799 + } \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440799 + \end_layout \begin_layout Plain Layout @@ -8542,40 +8585,58 @@ status open \begin_layout Plain Layout #define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 +\change_deleted 1531152142 1342440854 + \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440854 + \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440854 + struct virtio_scsi_event_an { \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440854 + // Write-only part \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440854 + u32 event; \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440854 + u8 lun[8]; \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440854 + u32 reason; \end_layout \begin_layout Plain Layout +\change_deleted 1531152142 1342440854 + } +\change_unchanged + \end_layout \end_inset -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] virtio-scsi spec: add configuration change event
This adds an event for changes to LUN parameters, for example capacity. These are reported in virtio-blk via configuration changes, and we want a similar functionality in virtio-scsi too. There is no list of supported parameter changes, instead we just refer to the list of sense codes in the SCSI specification. This event will usually be serviced in one of three ways: 1) call an OS service to revalidate the disk, either always or only for some specific sense codes; 2) somehow pass the sense directly to the upper-level driver; 3) inject a TEST UNIT READY command into the upper-level device, so that the OS will see the unit attention code and react. Of course a mix of the three is also possible, depending on how the driver writer prefers to have his layering violations served. Signed-off-by: Paolo Bonzini --- virtio-spec.lyx | 95 +++ 1 file changed, 95 insertions(+) diff --git a/virtio-spec.lyx b/virtio-spec.lyx index f8b214b..8d2ac9a 100644 --- a/virtio-spec.lyx +++ b/virtio-spec.lyx @@ -6995,6 +6995,21 @@ VIRTIO_SCSI_F_HOTPLUG (1) The host should enable hot-plug/hot-unplug of new LUNs and targets on the SCSI bus. +\change_inserted 1531152142 1342440342 + +\end_layout + +\begin_layout Description + +\change_inserted 1531152142 1342440768 +VIRTIO_SCSI_F_CHANGE +\begin_inset space ~ +\end_inset + +(2) The host will report changes to LUN parameters via a VIRTIO_SCSI_T_PARAM_CHA +NGE event. +\change_unchanged + \end_layout \end_deeper @@ -8673,6 +8688,86 @@ reason \begin_layout Standard When dropped events are reported, the driver should poll for asynchronous events manually using SCSI commands. +\change_inserted 1531152142 1342439104 + +\end_layout + +\end_deeper +\begin_layout Description + +\change_inserted 1531152142 1342440778 +LUN +\begin_inset space ~ +\end_inset + +parameter +\begin_inset space ~ +\end_inset + +change +\begin_inset space ~ +\end_inset + + +\begin_inset Newline newline +\end_inset + + +\begin_inset listings +inline false +status open + +\begin_layout Plain Layout + +\change_inserted 1531152142 1342440783 + +#define VIRTIO_SCSI_T_PARAM_CHANGE 3 +\end_layout + +\end_inset + + +\end_layout + +\begin_deeper +\begin_layout Standard + +\change_inserted 1531152142 1342440882 +By sending this event, the device signals that the configuration parameters + (for example the capacity) of a logical unit have changed. + The +\series bold +event +\series default + field is set to VIRTIO_SCSI_T_PARAM_CHANGE. + The +\series bold +lun +\series default + field addresses a logical unit in the SCSI host. +\end_layout + +\begin_layout Standard + +\change_inserted 1531152142 1342440916 +The same event is also reported as a unit attention condition. + The +\series bold +reason +\series default + field contains the additional sense code and additional sense code qualifier, + respectively in bits 0..7 and 8..15. + For example, a change in capacity will be reported as asc 0x2a, ascq 0x09 + (CAPACITY DATA HAS CHANGED). +\end_layout + +\begin_layout Standard + +\change_inserted 1531152142 1342442803 +For MMC devices (inquiry type 5) there would be some overlap between this + event and the asynchronous notification event. + For simplicity, as of this version of the specification the host must + never report this event for MMC devices. \end_layout \end_deeper -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] virtio-scsi: support block_resize
This series adds support for block_resize to virtio-scsi. Events are reported via a new event type. Patches to the spec are on the list. Paolo Bonzini (5): scsi-disk: removable hard disks support START/STOP scsi-disk: report resized disk via sense codes scsi: establish precedence levels for unit attention scsi: report parameter changes to HBA drivers virtio-scsi: report parameter change events hw/scsi-bus.c| 67 +- hw/scsi-disk.c | 21 +++-- hw/scsi.h|5 hw/virtio-scsi.c | 16 + trace-events |1 + 5 files changed, 102 insertions(+), 8 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] scsi-disk: removable hard disks support START/STOP
Support for START/STOP UNIT right now is limited to CD-ROMs. This is wrong, since removable hard disks (in the real world: SD card readers) also support it in pretty much the same way. Signed-off-by: Paolo Bonzini --- hw/scsi-disk.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index bcec66b..42bae3b 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -1251,7 +1251,7 @@ static int scsi_disk_emulate_start_stop(SCSIDiskReq *r) bool start = req->cmd.buf[4] & 1; bool loej = req->cmd.buf[4] & 2; /* load on start, eject on !start */ -if (s->qdev.type == TYPE_ROM && loej) { +if ((s->features & (1 << SCSI_DISK_F_REMOVABLE)) && loej) { if (!start && !s->tray_open && s->tray_locked) { scsi_check_condition(r, bdrv_is_inserted(s->qdev.conf.bs) -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] scsi-disk: report resized disk via sense codes
Linux will not use these, but a very similar mechanism will be used to report the condition via virtio-scsi events. Signed-off-by: Paolo Bonzini --- hw/scsi-bus.c |5 + hw/scsi-disk.c | 15 +++ hw/scsi.h |2 ++ 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index 8b961f2..2547c50 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -1161,6 +1161,11 @@ const struct SCSISense sense_code_LUN_FAILURE = { .key = ABORTED_COMMAND, .asc = 0x3e, .ascq = 0x01 }; +/* Unit attention, Capacity data has changed */ +const struct SCSISense sense_code_CAPACITY_CHANGED = { +.key = UNIT_ATTENTION, .asc = 0x2a, .ascq = 0x09 +}; + /* Unit attention, Power on, reset or bus device reset occurred */ const struct SCSISense sense_code_RESET = { .key = UNIT_ATTENTION, .asc = 0x29, .ascq = 0x00 diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index 42bae3b..0905446 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -1863,6 +1863,13 @@ static void scsi_destroy(SCSIDevice *dev) blockdev_mark_auto_del(s->qdev.conf.bs); } +static void scsi_disk_resize_cb(void *opaque) +{ +SCSIDiskState *s = opaque; + +scsi_device_set_ua(&s->qdev, SENSE_CODE(CAPACITY_CHANGED)); +} + static void scsi_cd_change_media_cb(void *opaque, bool load) { SCSIDiskState *s = opaque; @@ -1904,11 +1911,13 @@ static bool scsi_cd_is_medium_locked(void *opaque) return ((SCSIDiskState *)opaque)->tray_locked; } -static const BlockDevOps scsi_cd_block_ops = { +static const BlockDevOps scsi_disk_block_ops = { .change_media_cb = scsi_cd_change_media_cb, .eject_request_cb = scsi_cd_eject_request_cb, .is_tray_open = scsi_cd_is_tray_open, .is_medium_locked = scsi_cd_is_medium_locked, + +.resize_cb = scsi_disk_resize_cb, }; static void scsi_disk_unit_attention_reported(SCSIDevice *dev) @@ -1956,9 +1965,7 @@ static int scsi_initfn(SCSIDevice *dev) return -1; } -if (s->features & (1 << SCSI_DISK_F_REMOVABLE)) { -bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_cd_block_ops, s); -} +bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_disk_block_ops, s); bdrv_set_buffer_alignment(s->qdev.conf.bs, s->qdev.blocksize); bdrv_iostatus_enable(s->qdev.conf.bs); diff --git a/hw/scsi.h b/hw/scsi.h index 47c3b9c..e901350 100644 --- a/hw/scsi.h +++ b/hw/scsi.h @@ -198,6 +198,8 @@ extern const struct SCSISense sense_code_IO_ERROR; extern const struct SCSISense sense_code_I_T_NEXUS_LOSS; /* Command aborted, Logical Unit failure */ extern const struct SCSISense sense_code_LUN_FAILURE; +/* LUN not ready, Capacity data has changed */ +extern const struct SCSISense sense_code_CAPACITY_CHANGED; /* LUN not ready, Medium not present */ extern const struct SCSISense sense_code_UNIT_ATTENTION_NO_MEDIUM; /* Unit attention, Power on, reset or bus device reset occurred */ -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] scsi: establish precedence levels for unit attention
When a device is resized, we will report a unit attention condition for CAPACITY DATA HAS CHANGED. However, we should ensure that this condition does not override a more important unit attention condition. Signed-off-by: Paolo Bonzini --- hw/scsi-bus.c | 52 +++- hw/scsi-disk.c |4 ++-- hw/scsi.h |1 + trace-events |1 + 4 files changed, 55 insertions(+), 3 deletions(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index 2547c50..d5e1fb0 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -1530,6 +1530,55 @@ void scsi_req_abort(SCSIRequest *req, int status) scsi_req_unref(req); } +static int scsi_ua_precedence(SCSISense sense) +{ +if (sense.key != UNIT_ATTENTION) { +return INT_MAX; +} +if (sense.asc == 0x29 && sense.ascq == 0x04) { +/* DEVICE INTERNAL RESET goes with POWER ON OCCURRED */ +return 1; +} else if (sense.asc == 0x3F && sense.ascq == 0x01) { +/* MICROCODE HAS BEEN CHANGED goes with SCSI BUS RESET OCCURRED */ +return 2; +} else if (sense.asc == 0x29 && (sense.ascq == 0x05 || sense.ascq == 0x06)) { +/* These two go with "all others". */ +; +} else if (sense.asc == 0x29 && sense.ascq <= 0x07) { +/* POWER ON, RESET OR BUS DEVICE RESET OCCURRED = 0 + * POWER ON OCCURRED = 1 + * SCSI BUS RESET OCCURRED = 2 + * BUS DEVICE RESET FUNCTION OCCURRED = 3 + * I_T NEXUS LOSS OCCURRED = 7 + */ +return sense.ascq; +} else if (sense.asc == 0x2F && sense.ascq == 0x01) { +/* COMMANDS CLEARED BY POWER LOSS NOTIFICATION */ +return 8; +} +return (sense.asc << 8) | sense.ascq; +} + +void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense) +{ +int prec1, prec2; +if (sense.key != UNIT_ATTENTION) { +return; +} +trace_scsi_device_set_ua(sdev->id, sdev->lun, sense.key, + sense.asc, sense.ascq); + +/* + * Override a pre-existing unit attention condition, except for a more + * important reset condition. +*/ +prec1 = scsi_ua_precedence(sdev->unit_attention); +prec2 = scsi_ua_precedence(sense); +if (prec2 < prec1) { +sdev->unit_attention = sense; +} +} + void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense) { SCSIRequest *req; @@ -1538,7 +1587,8 @@ void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense) req = QTAILQ_FIRST(&sdev->requests); scsi_req_cancel(req); } -sdev->unit_attention = sense; + +scsi_device_set_ua(sdev, sense); } static char *scsibus_get_dev_path(DeviceState *dev) diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index 0905446..fabd0bf 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -1886,7 +1886,7 @@ static void scsi_cd_change_media_cb(void *opaque, bool load) */ s->media_changed = load; s->tray_open = !load; -s->qdev.unit_attention = SENSE_CODE(UNIT_ATTENTION_NO_MEDIUM); +scsi_device_set_ua(&s->qdev, SENSE_CODE(UNIT_ATTENTION_NO_MEDIUM)); s->media_event = true; s->eject_request = false; } @@ -1925,7 +1925,7 @@ static void scsi_disk_unit_attention_reported(SCSIDevice *dev) SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, dev); if (s->media_changed) { s->media_changed = false; -s->qdev.unit_attention = SENSE_CODE(MEDIUM_CHANGED); +scsi_device_set_ua(&s->qdev, SENSE_CODE(MEDIUM_CHANGED)); } } diff --git a/hw/scsi.h b/hw/scsi.h index e901350..4804be9 100644 --- a/hw/scsi.h +++ b/hw/scsi.h @@ -239,6 +239,7 @@ void scsi_req_abort(SCSIRequest *req, int status); void scsi_req_cancel(SCSIRequest *req); void scsi_req_retry(SCSIRequest *req); void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense); +void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense); int scsi_device_get_sense(SCSIDevice *dev, uint8_t *buf, int len, bool fixed); SCSIDevice *scsi_device_find(SCSIBus *bus, int channel, int target, int lun); diff --git a/trace-events b/trace-events index 7baa42d..90e9c2a 100644 --- a/trace-events +++ b/trace-events @@ -393,6 +393,7 @@ scsi_req_parsed(int target, int lun, int tag, int cmd, int mode, int xfer) "targ scsi_req_parsed_lba(int target, int lun, int tag, int cmd, uint64_t lba) "target %d lun %d tag %d command %d lba %"PRIu64 scsi_req_parse_bad(int target, int lun, int tag, int cmd) "target %d lun %d tag %d command %d" scsi_req_build_sense(int target, int lun, int tag, int key, int asc, int ascq) "target %d lun %d tag %d key %#02x asc %#02x ascq %#02x" +scsi_device_set_ua(int target, int lun, int key, int asc, int ascq) "target %d lun %d key %#02x asc %#02x ascq %#02x" scsi_report_luns(int target, int lun, int tag) "target %d l
[PATCH 4/5] scsi: report parameter changes to HBA drivers
Signed-off-by: Paolo Bonzini --- hw/scsi-bus.c | 10 ++ hw/scsi-disk.c |2 +- hw/scsi.h |2 ++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c index d5e1fb0..77aa946 100644 --- a/hw/scsi-bus.c +++ b/hw/scsi-bus.c @@ -1072,6 +1072,16 @@ int scsi_req_parse(SCSICommand *cmd, SCSIDevice *dev, uint8_t *buf) return 0; } +void scsi_device_report_change(SCSIDevice *dev, SCSISense sense) +{ +SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, dev->qdev.parent_bus); + +scsi_device_set_ua(dev, sense); +if (bus->info->change) { +bus->info->change(bus, dev, sense); +} +} + /* * Predefined sense codes */ diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c index fabd0bf..8351de5 100644 --- a/hw/scsi-disk.c +++ b/hw/scsi-disk.c @@ -1867,7 +1867,7 @@ static void scsi_disk_resize_cb(void *opaque) { SCSIDiskState *s = opaque; -scsi_device_set_ua(&s->qdev, SENSE_CODE(CAPACITY_CHANGED)); +scsi_device_report_change(&s->qdev, SENSE_CODE(CAPACITY_CHANGED)); } static void scsi_cd_change_media_cb(void *opaque, bool load) diff --git a/hw/scsi.h b/hw/scsi.h index 4804be9..380bb89 100644 --- a/hw/scsi.h +++ b/hw/scsi.h @@ -132,6 +132,7 @@ struct SCSIBusInfo { void (*cancel)(SCSIRequest *req); void (*hotplug)(SCSIBus *bus, SCSIDevice *dev); void (*hot_unplug)(SCSIBus *bus, SCSIDevice *dev); +void (*change)(SCSIBus *bus, SCSIDevice *dev, SCSISense sense); QEMUSGList *(*get_sg_list)(SCSIRequest *req); void (*save_request)(QEMUFile *f, SCSIRequest *req); @@ -240,6 +241,7 @@ void scsi_req_cancel(SCSIRequest *req); void scsi_req_retry(SCSIRequest *req); void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense); void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense); +void scsi_device_report_change(SCSIDevice *dev, SCSISense sense); int scsi_device_get_sense(SCSIDevice *dev, uint8_t *buf, int len, bool fixed); SCSIDevice *scsi_device_find(SCSIBus *bus, int channel, int target, int lun); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] virtio-scsi: report parameter change events
Signed-off-by: Paolo Bonzini --- hw/virtio-scsi.c | 16 1 file changed, 16 insertions(+) diff --git a/hw/virtio-scsi.c b/hw/virtio-scsi.c index 83dbabd..80a47d7 100644 --- a/hw/virtio-scsi.c +++ b/hw/virtio-scsi.c @@ -27,6 +27,7 @@ /* Feature Bits */ #define VIRTIO_SCSI_F_INOUT0 #define VIRTIO_SCSI_F_HOTPLUG 1 +#define VIRTIO_SCSI_F_CHANGE 2 /* Response codes */ #define VIRTIO_SCSI_S_OK 0 @@ -63,6 +64,7 @@ #define VIRTIO_SCSI_T_NO_EVENT 0 #define VIRTIO_SCSI_T_TRANSPORT_RESET 1 #define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 +#define VIRTIO_SCSI_T_PARAM_CHANGE 3 /* Reasons for transport reset event */ #define VIRTIO_SCSI_EVT_RESET_HARD 0 @@ -554,6 +556,7 @@ static uint32_t virtio_scsi_get_features(VirtIODevice *vdev, uint32_t requested_features) { requested_features |= (1UL << VIRTIO_SCSI_F_HOTPLUG); +requested_features |= (1UL << VIRTIO_SCSI_F_CHANGE); return requested_features; } @@ -637,6 +640,18 @@ static void virtio_scsi_handle_event(VirtIODevice *vdev, VirtQueue *vq) } } +static void virtio_scsi_change(SCSIBus *bus, SCSIDevice *dev, SCSISense sense) +{ +VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus); + +if (((s->vdev.guest_features >> VIRTIO_SCSI_F_CHANGE) & 1) && +(s->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK) && +dev->type != TYPE_ROM) { +virtio_scsi_push_event(s, dev, VIRTIO_SCSI_T_PARAM_CHANGE, + sense.asc | (sense.ascq << 8)); +} +} + static void virtio_scsi_hotplug(SCSIBus *bus, SCSIDevice *dev) { VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus); @@ -666,6 +681,7 @@ static struct SCSIBusInfo virtio_scsi_scsi_info = { .complete = virtio_scsi_command_complete, .cancel = virtio_scsi_request_cancelled, +.change = virtio_scsi_change, .hotplug = virtio_scsi_hotplug, .hot_unplug = virtio_scsi_hot_unplug, .get_sg_list = virtio_scsi_get_sg_list, -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Add vhost-blk support
Il 17/07/2012 10:29, Asias He ha scritto: > So, vhost-blk at least saves ~6 syscalls for us in each request. Are they really 6? If I/O is coalesced by a factor of 3, for example (i.e. each exit processes 3 requests), it's really 2 syscalls per request. Also, is there anything we can improve? Perhaps we can modify epoll and ask it to clear the eventfd for us (would save 2 reads)? Or io_getevents (would save 1)? > I guess you mean qemu here. Yes, in theory, qemu's block layer can be > improved to achieve similar performance as vhost-blk or kvm tool's > userspace virito-blk has. But I think it makes no sense to prevent one > solution becase there is another in theory solution called: we can do > similar in qemu. It depends. Like vhost-scsi, vhost-blk has the problem of a crippled feature set: no support for block device formats, non-raw protocols, etc. This makes it different from vhost-net. So it begs the question, is it going to be used in production, or just a useful reference tool? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Add vhost-blk support
Il 17/07/2012 11:21, Asias He ha scritto: >> It depends. Like vhost-scsi, vhost-blk has the problem of a crippled >> feature set: no support for block device formats, non-raw protocols, >> etc. This makes it different from vhost-net. > > Data-plane qemu also has this cripppled feature set problem, no? Yes, but that is just a proof of concept. We can implement a separate I/O thread within the QEMU block layer, and add fast paths that resemble data-path QEMU, without limiting the feature set. > Does user always choose to use block devices format like qcow2? What > if they prefer raw image or raw block device? If they do, the code should hit fast paths and be fast. But it should be automatic, without the need for extra knobs. aio=thread vs. aio=native is already one knob too much IMHO. >> So it begs the question, is it going to be used in production, or just a >> useful reference tool? > > This should be decided by user, I can not speak for them. What is wrong > with adding one option for user which they can decide? Having to explain the user about the relative benefits; having to support the API; having to handle transition from one more thing when something better comes out. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Add vhost-blk support
Il 17/07/2012 11:45, Michael S. Tsirkin ha scritto: >> So it begs the question, is it going to be used in production, or just a >> useful reference tool? > > Sticking to raw already makes virtio-blk faster, doesn't it? > In that vhost-blk looks to me like just another optimization option. > Ideally I think user just should not care where do we handle virtio: > in-kernel or in userspace. One can imagine it being enabled/disabled > automatically if none of the features unsupported by it are used. Ok, that would make more sense. One difference between vhost-blk and vhost-net is that for vhost-blk there are also management actions that would trigger the switch, for example a live snapshot. So a prerequisite for vhost-blk would be that it is possible to disable it on the fly while the VM is running, as soon as all in-flight I/O is completed. (Note that, however, this is not possible for vhost-scsi, because it really exposes different hardware to the guest. It must not happen that a kernel upgrade or downgrade toggles between userspace SCSI and vhost-scsi, for example). >> having to >> support the API; having to handle transition from one more thing when >> something better comes out. > > Well this is true for any code. If the limited featureset which > vhost-blk can accelerate is something many people use, then accelerating > by 5-15% might outweight support costs. It is definitely what people use if they are interested in performance. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Add vhost-blk support
Il 17/07/2012 12:49, Michael S. Tsirkin ha scritto: >> Ok, that would make more sense. One difference between vhost-blk and >> vhost-net is that for vhost-blk there are also management actions that >> would trigger the switch, for example a live snapshot. >> So a prerequisite for vhost-blk would be that it is possible to disable >> it on the fly while the VM is running, as soon as all in-flight I/O is >> completed. > > It applies for vhost-net too. For example if you bring link down, > we switch to userspace. So vhost-net supports this switch on the fly. Cool. >> (Note that, however, this is not possible for vhost-scsi, because it >> really exposes different hardware to the guest. It must not happen that >> a kernel upgrade or downgrade toggles between userspace SCSI and >> vhost-scsi, for example). > > I would say this is not a prerequisite for merging in qemu. > It might be a required feature for production but it > is also solvable at the management level. I'm thinking of the level interrupts here. You cannot make a change in the guest, and have it do completely unrelated changes the hardware that the guest sees. having to support the API; having to handle transition from one more thing when something better comes out. >>> >>> Well this is true for any code. If the limited featureset which >>> vhost-blk can accelerate is something many people use, then accelerating >>> by 5-15% might outweight support costs. >> >> It is definitely what people use if they are interested in performance. > > In that case it seems to me we should stop using the feature set as > an argument and focus on whether the extra code is worth the 5-15% gain. > No one seems to have commented on that so everyone on list thinks that > aspect is OK? I would like to see a breakdown of _where_ the 5-15% lies, something like http://www.linux-kvm.org/page/Virtio/Block/Latency. > Kernel merge windows is coming up and I would like to see whether > any of vhost-blk / vhost-scsi is going to be actually used by userspace. > I guess we could tag it for staging but would be nice to avoid that. Staging would be fine by me for both vhost-blk and vhost-scsi. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/5] scsi-disk: report resized disk via sense codes
Il 17/07/2012 13:14, Kevin Wolf ha scritto: >> > -if (s->features & (1 << SCSI_DISK_F_REMOVABLE)) { >> > -bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_cd_block_ops, s); >> > -} >> > +bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_disk_block_ops, s); > Are you aware of this code? > > bool bdrv_dev_has_removable_media(BlockDriverState *bs) > { > return !bs->dev || (bs->dev_ops && bs->dev_ops->change_media_cb); > } > > This means that now all SCSI disks have removable media from the > monitor's point of view. I remembered there was something similar but I couldn't find it. I'll rework the patch to have two separate sets of dev_ops. Thanks very much! Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Add vhost-blk support
Il 17/07/2012 14:48, Michael S. Tsirkin ha scritto: > On Tue, Jul 17, 2012 at 01:03:39PM +0100, Stefan Hajnoczi wrote: >> On Tue, Jul 17, 2012 at 12:54 PM, Michael S. Tsirkin wrote: Knowing the answer to that is important before anyone can say whether this approach is good or not. Stefan >>> >>> Why is it? >> >> Because there might be a fix to kvmtool which closes the gap. It >> would be embarassing if vhost-blk was pushed just because no one >> looked into what is actually going on. > > Embarrasing to whom? Is someone working on an optimization that > makes the work in question redundant, with posting just around > the corner? Then maybe the thing to do is just wait a bit. Of course there is work going on to make QEMU perform better. Not sure about lkvm. >> And on the flipside, hard evidence of an overhead that cannot be >> resolved could be good reason to do more vhost devices in the future. > > How can one have hard evidence of an overhead that cannot be resolved? Since we do have two completely independent userspaces (lkvm and data-plane QEMU), you can build up some compelling evidence of an overhead that cannot be resolved in user space. >> Either way, it's useful to do this before going further. > > I think each work should be discussed on its own merits. Maybe > vhost-blk is just well written. So? What is your conclusion? If it's just that vhost-blk is written well, my conclusion is that lkvm people should look into improving their virtio-blk userspace. We take hints from each other all the time, for example virtio-scsi will have unlocked kick in 3.6. Why can't vhost-* just get into staging, and we call it a day? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
Il 18/07/2012 15:42, Anthony Liguori ha scritto: > If you add support for a new command, you need to provide userspace a > way to disable this command. If you change what gets reported for VPD, > you need to provide userspace a way to make VPD look like what it did in > a previous version. The QEMU target is not enforcing this to this level. We didn't for CD-ROM ATAPI, and we're not doing it for SCSI. It may indeed be useful for changes to VPD pages or major features. However, so far we've never introduced any feature that deserved it. This is also because OSes typically don't care: they use a small subset of the features and all the remaining "decorations" are only needed to be pedantically compliant to the spec. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
Il 18/07/2012 21:12, Anthony Liguori ha scritto: > Is that true for all OSes? Linux may handle things gracefully if UNMAP > starts throwing errors but that doesn't mean that Windows will. There is so much USB crap (not just removable, think of ATA-to-USB enclosures) that they have to deal with pretty much everything. > Windows does this with a points system and I do believe that INQUIRY > responses from any local disks are included in this tally. INQUIRY responses (at least vendor/product/type) should not change. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
Il 19/07/2012 09:28, James Bottomley ha scritto: >> > INQUIRY responses (at least vendor/product/type) should not change. > INQUIRY responses often change for arrays because a firmware upgrade > enables new features and new features have to declare themselves, > usually in the INQUIRY data. What you mean, I think, is that previously > exposed features in INQUIRY data, as well as strings > (vendor/product/type, as you say), shouldn't change, but unexposed data > (read 0 in the fields) may. What I meant is that it's unlikely that Windows fingerprinting is using anything but vendor/product/type, because everything else can change. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [QEMU PATCH 0/2] virtio-blk: writeback cache enable improvements
Il 03/07/2012 15:20, Paolo Bonzini ha scritto: > These patches let virtio-blk use the new support for toggling the cache > mode between writethrough and writeback. > > The first patch introduces a new feature bit and configuration field to > do this. The second patch disables writeback caching for guests that do > not negotiate VIRTIO_BLK_F_WCACHE (meaning that they cannot send flush > requests), so that limited or older guests are now safe wrt power losses. > VIRTIO_BLK_F_FLUSH has been introduced in Linux 2.6.32 (in 2009) and was > backported to RHEL/CentOS 5.6 (in 2010). > > The Windows drivers (which work by emulating SCSI on top of virtio-blk) > have bugs in this area, which I reported on the Red Hat Bugzilla as > bugs 837321 and 837324. With these patches they will suffer a > performance hit but gain correctness. > > Paolo Bonzini (2): > virtio-blk: support VIRTIO_BLK_F_CONFIG_WCE > virtio-blk: disable write cache if not negotiated Ping - Anthony, mst? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/5] scsi-disk: removable hard disks support START/STOP
Il 23/07/2012 18:44, Blue Swirl ha scritto: >> > Support for START/STOP UNIT right now is limited to CD-ROMs. This is >> > wrong, >> > since removable hard disks (in the real world: SD card readers) also >> > support >> > it in pretty much the same way. > I remember vaguely tuning a set of large SCSI hard disks > (non-removable) so that they all didn't start immediately at the same > time (which could have burned out the PSU) but only with START UNIT > command. I think Linux or maybe even the BIOS started the drives > (nicely in sequence) before accessing the drive. Yes, all disks do start/stop. Removable disks support load and eject in addition. I'll fix the commit message. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/9] virtio-pci: support host notifiers in TCG mode
Il 25/07/2012 00:33, Nicholas A. Bellinger ha scritto: > From: Stefan Hajnoczi > > Normally host notifiers are only used together with vhost-net in KVM > mode. It is occassionally useful to use vhost with TCG mode, mainly for > testing and development. This isn't hard to achieve, simply fall back > to notifying the host notifier manually from qemu if KVM mode is > disabled. > > Signed-off-by: Stefan Hajnoczi > Cc: Anthony Liguori > Cc: Paolo Bonzini > Signed-off-by: Nicholas Bellinger > --- > hw/virtio-pci.c | 23 --- > hw/virtio.c |7 +++ > 2 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c > index 4e03f0b..538eef4 100644 > --- a/hw/virtio-pci.c > +++ b/hw/virtio-pci.c > @@ -249,6 +249,25 @@ void virtio_pci_reset(DeviceState *d) > proxy->flags &= ~VIRTIO_PCI_FLAG_BUS_MASTER_BUG; > } > > +static void virtio_pci_queue_notify(VirtIOPCIProxy *proxy, uint32_t n) > +{ > +VirtQueue *vq; > +EventNotifier *notifier; > + > +if (n >= VIRTIO_PCI_QUEUE_MAX) { > +return; > +} > + > +vq = virtio_get_queue(proxy->vdev, n); > +notifier = virtio_queue_get_host_notifier(vq); > +if (event_notifier_valid(notifier)) { > +printf("notifying vq %u host notifier from userspace\n", n); Debug printf. > +event_notifier_notify(notifier); > +} else { > +virtio_queue_notify_vq(vq); > +} This can be done directly in virtio_queue_notify, there is nothing specific to virtio-pci. > +} > + > static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) > { > VirtIOPCIProxy *proxy = opaque; > @@ -278,9 +297,7 @@ static void virtio_ioport_write(void *opaque, uint32_t > addr, uint32_t val) > vdev->queue_sel = val; > break; > case VIRTIO_PCI_QUEUE_NOTIFY: > -if (val < VIRTIO_PCI_QUEUE_MAX) { > -virtio_queue_notify(vdev, val); > -} > +virtio_pci_queue_notify(proxy, val); > break; > case VIRTIO_PCI_STATUS: > if (!(val & VIRTIO_CONFIG_S_DRIVER_OK)) { > diff --git a/hw/virtio.c b/hw/virtio.c > index d146f86..36a18b5 100644 > --- a/hw/virtio.c > +++ b/hw/virtio.c > @@ -536,6 +536,11 @@ void virtio_reset(void *opaque) > vdev->vq[i].signalled_used = 0; > vdev->vq[i].signalled_used_valid = false; > vdev->vq[i].notification = true; > + > +assert(!event_notifier_valid(&vdev->vq[i].guest_notifier)); > +assert(!event_notifier_valid(&vdev->vq[i].host_notifier)); > +vdev->vq[i].guest_notifier = EVENT_NOTIFIER_INITIALIZER; > +vdev->vq[i].host_notifier = EVENT_NOTIFIER_INITIALIZER; Given the assertions, the assignments should be no-ops. > } > } > > @@ -905,6 +910,8 @@ VirtIODevice *virtio_common_init(const char *name, > uint16_t device_id, > for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { > vdev->vq[i].vector = VIRTIO_NO_VECTOR; > vdev->vq[i].vdev = vdev; > +vdev->vq[i].guest_notifier = EVENT_NOTIFIER_INITIALIZER; > +vdev->vq[i].host_notifier = EVENT_NOTIFIER_INITIALIZER; > } > > vdev->name = name; > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/9] notifier: add validity check and notify function
Il 25/07/2012 00:33, Nicholas A. Bellinger ha scritto: > +int event_notifier_notify(EventNotifier *e) > +{ > +uint64_t value = 1; > +int r; > + > +assert(event_notifier_valid(e)); > +r = write(e->fd, &value, sizeof(value)); > +if (r < 0) { > +return -errno; > +} > +assert(r == sizeof(value)); > +return 0; > +} Note we now have event_notifier_set. > +#define EVENT_NOTIFIER_INITIALIZER ((EventNotifier){ .fd = -1 }) This is problematic when your event notifier is inside a struct, because it is extremely easy to forget the initializer. You have to initialize them yourself. Also, the right thing to test is not whether the notifier is initialized; it is whether the notifier is actually checked in QEMU's select() loop. So, I would prefer avoiding event_notifier_valid and just use a boolean (in virtio_queue_set_host_notifier_fd_handler and virtio_queue_set_guest_notifier_fd_handler) to track whether the notifiers are in use. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 9/9] vhost-scsi: add -vhost-scsi host device
Il 25/07/2012 00:34, Nicholas A. Bellinger ha scritto: > From: Stefan Hajnoczi > > This patch adds a new type of host device that drives the vhost_scsi > device. The syntax to add vhost-scsi is: > > qemu -vhost-scsi id=vhost-scsi0,wwpn=...,tpgt=123 > > The virtio-scsi emulated device will make use of vhost-scsi to process > virtio-scsi requests inside the kernel and hand them to the in-kernel > SCSI target stack. > > Changelog v0 -> v1: > > - Add VHOST_SCSI_SET_ENDPOINT call (stefan) > - Enable vhost notifiers for multiple queues (Zhi) > - clear vhost-scsi endpoint on stopped (Zhi) > - Add CONFIG_VHOST_SCSI for QEMU build configure (nab) > - Rename vhost_vring_target -> vhost_scsi_target (mst + nab) > - Add support for VHOST_SCSI_GET_ABI_VERSION ioctl (aliguori + nab) > > Signed-off-by: Stefan Hajnoczi > Signed-off-by: Zhi Yong Wu > Cc: Anthony Liguori > Cc: Paolo Bonzini > Cc: Michael S. Tsirkin > Signed-off-by: Nicholas Bellinger > --- > configure| 10 +++ > hw/Makefile.objs |1 + > hw/qdev-properties.c | 32 + > hw/qdev.h|3 + > hw/vhost-scsi.c | 173 > ++ > hw/vhost-scsi.h | 50 ++ > qemu-common.h|1 + > qemu-config.c| 16 + > qemu-options.hx |4 + > vl.c | 18 + > 10 files changed, 308 insertions(+), 0 deletions(-) > create mode 100644 hw/vhost-scsi.c > create mode 100644 hw/vhost-scsi.h > > diff --git a/configure b/configure > index cef0a71..2291e7e 100755 > --- a/configure > +++ b/configure > @@ -144,6 +144,7 @@ libattr="" > xfs="" > > vhost_net="no" > +vhost_scsi="no" > kvm="no" > gprof="no" > debug_tcg="no" > @@ -489,6 +490,7 @@ Haiku) >usb="linux" >kvm="yes" >vhost_net="yes" > + vhost_scsi="yes" >if [ "$cpu" = "i386" -o "$cpu" = "x86_64" ] ; then > audio_possible_drivers="$audio_possible_drivers fmod" >fi > @@ -794,6 +796,10 @@ for opt do >;; >--enable-vhost-net) vhost_net="yes" >;; > + --disable-vhost-net) vhost_scsi="no" > + ;; > + --enable-vhost-net) vhost_scsi="yes" > + ;; Case labels are using vhost-net. >--disable-opengl) opengl="no" >;; >--enable-opengl) opengl="yes" > @@ -3059,6 +3065,7 @@ echo "posix_madvise $posix_madvise" > echo "uuid support $uuid" > echo "libcap-ng support $cap_ng" > echo "vhost-net support $vhost_net" > +echo "vhost-scsi support $vhost_scsi" > echo "Trace backend $trace_backend" > echo "Trace output file $trace_file-" > echo "spice support $spice" > @@ -3762,6 +3769,9 @@ case "$target_arch2" in >if test "$vhost_net" = "yes" ; then > echo "CONFIG_VHOST_NET=y" >> $config_target_mak >fi > + if test "$vhost_scsi" = "yes" ; then > +echo "CONFIG_VHOST_SCSI=y" >> $config_target_mak > + fi > fi > esac > case "$target_arch2" in > diff --git a/hw/Makefile.objs b/hw/Makefile.objs > index 8327e55..37a0d0e 100644 > --- a/hw/Makefile.objs > +++ b/hw/Makefile.objs > @@ -161,6 +161,7 @@ obj-$(CONFIG_VIRTIO) += virtio.o virtio-blk.o > virtio-balloon.o virtio-net.o > obj-$(CONFIG_VIRTIO) += virtio-serial-bus.o virtio-scsi.o > obj-$(CONFIG_SOFTMMU) += vhost_net.o > obj-$(CONFIG_VHOST_NET) += vhost.o > +obj-$(CONFIG_VHOST_SCSI) += vhost-scsi.o > obj-$(CONFIG_REALLY_VIRTFS) += 9pfs/ > obj-$(CONFIG_NO_PCI) += pci-stub.o > obj-$(CONFIG_VGA) += vga.o > diff --git a/hw/qdev-properties.c b/hw/qdev-properties.c > index 24b39e8..0c538d2 100644 > --- a/hw/qdev-properties.c > +++ b/hw/qdev-properties.c > @@ -3,6 +3,7 @@ > #include "qerror.h" > #include "blockdev.h" > #include "hw/block-common.h" > +#include "vhost-scsi.h" > > void *qdev_get_prop_ptr(DeviceState *dev, Property *prop) > { > @@ -685,6 +686,37 @@ PropertyInfo qdev_prop_vlan = { > .set = set_vlan, > }; > > +/* --- vhost-scsi --- */ > + > +static int parse_vhost_scsi(DeviceState *dev, Property *prop, const char > *str) > +{ > +VHostSCSI **ptr = qdev_get_prop_ptr(dev, prop); > + > +*ptr = find_vhost_scsi(str); > +if (*ptr == NULL) { > +return -ENOENT; >
Re: [RFC 7/9] virtio-scsi: Start/stop vhost
Il 25/07/2012 00:34, Nicholas A. Bellinger ha scritto: > From: Stefan Hajnoczi > > This patch starts and stops vhost as the virtio device transitions > through its status phases. Vhost can only be started once the guest > reports its driver has successfully initialized, which means the > virtqueues have been set up by the guest. > > (v2: Squash virtio-scsi: use the vhost-scsi host device from stefan) > > Signed-off-by: Stefan Hajnoczi > Signed-off-by: Zhi Yong Wu > Cc: Michael S. Tsirkin > Cc: Paolo Bonzini > Signed-off-by: Nicholas Bellinger Hmm, this is not what the patch does... :) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 7/9] virtio-scsi: Start/stop vhost
Il 25/07/2012 09:01, Paolo Bonzini ha scritto: >> > From: Stefan Hajnoczi >> > >> > This patch starts and stops vhost as the virtio device transitions >> > through its status phases. Vhost can only be started once the guest >> > reports its driver has successfully initialized, which means the >> > virtqueues have been set up by the guest. >> > >> > (v2: Squash virtio-scsi: use the vhost-scsi host device from stefan) >> > >> > Signed-off-by: Stefan Hajnoczi >> > Signed-off-by: Zhi Yong Wu >> > Cc: Michael S. Tsirkin >> > Cc: Paolo Bonzini >> > Signed-off-by: Nicholas Bellinger > Hmm, this is not what the patch does... :) Oops, the above comment was meant for patch 5. Which is a one-liner that can be squashed here. Anyway there is some problem with the ordering of the patches, because this patch includes vhost-scsi.h (introduced in patch 9) and patch 5 uses VHostSCSI (defined by vhost-scsi.h). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 6/9] virtio-scsi: Open and initialize /dev/vhost-scsi
Il 25/07/2012 00:34, Nicholas A. Bellinger ha scritto: > From: Stefan Hajnoczi > > Begin adding vhost support by opening /dev/vhost-scsi. > > (v2: Drop legacy ->vhost_vqs[] usage) > > Signed-off-by: Stefan Hajnoczi > Signed-off-by: Zhi Yong Wu > Cc: Michael S. Tsirkin > Cc: Paolo Bonzini > Signed-off-by: Nicholas Bellinger > --- > hw/virtio-scsi.c | 12 > 1 files changed, 12 insertions(+), 0 deletions(-) > > diff --git a/hw/virtio-scsi.c b/hw/virtio-scsi.c > index 4a787d3..dea3269 100644 > --- a/hw/virtio-scsi.c > +++ b/hw/virtio-scsi.c > @@ -18,6 +18,7 @@ > #include "virtio-scsi.h" > #include > #include > +#include "vhost.h" > > #define VIRTIO_SCSI_VQ_SIZE 128 > #define VIRTIO_SCSI_CDB_SIZE32 > @@ -137,6 +138,9 @@ typedef struct { > VirtQueue *ctrl_vq; > VirtQueue *event_vq; > VirtQueue *cmd_vqs[0]; > + > +bool vhost_started; > +VHostSCSI *vhost_scsi; > } VirtIOSCSI; > > typedef struct VirtIOSCSIReq { > @@ -456,6 +460,11 @@ static void virtio_scsi_fail_cmd_req(VirtIOSCSIReq *req) > virtio_scsi_complete_req(req); > } > > +static VirtIOSCSI *to_virtio_scsi(VirtIODevice *vdev) > +{ > +return (VirtIOSCSI *)vdev; > +} This function is unused and, because it is static, it will break compilation. Not that it would compile anyway since VHostSCSI is not defined yet... ;) > static void virtio_scsi_handle_cmd(VirtIODevice *vdev, VirtQueue *vq) > { > VirtIOSCSI *s = (VirtIOSCSI *)vdev; > @@ -605,6 +614,8 @@ VirtIODevice *virtio_scsi_init(DeviceState *dev, > VirtIOSCSIConf *proxyconf) > > s->qdev = dev; > s->conf = proxyconf; > +s->vhost_started = false; > +s->vhost_scsi = proxyconf->vhost_scsi; No need for this, just use s->conf->vhost_scsi. Also, please do not register the QEMU SCSI bus if vhost-scsi is active. Paolo > > /* TODO set up vdev function pointers */ > s->vdev.get_config = virtio_scsi_get_config; > @@ -636,5 +647,6 @@ void virtio_scsi_exit(VirtIODevice *vdev) > { > VirtIOSCSI *s = (VirtIOSCSI *)vdev; > unregister_savevm(s->qdev, "virtio-scsi", s); > +vhost_dev_cleanup(&s->vhost_scsi); > virtio_cleanup(vdev); > } > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] qom: delay DeviceState's reclaim to main-loop
Il 25/07/2012 09:03, Stefan Hajnoczi ha scritto: > From: Liu Ping Fan > > iohandler/bh/timer may use DeviceState when its refcnt=0, It's not clear how to me. The only reference to devices from an iohandler/bh/timer can be in the opaque. Now, if you have a iohandler/bh/timer whose opaque is a DeviceState, you should bump the refcount before setting it up, and unref after tearing it down. See for example how hw/scsi-disk.c bumps the refcount of a request before starting asynchronous I/O and calls unref at the end of the asynchronous I/O callback. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] qom: delay DeviceState's reclaim to main-loop
Il 25/07/2012 10:16, liu ping fan ha scritto: > > It's not clear how to me. The only reference to devices from an > > iohandler/bh/timer can be in the opaque. Now, if you have a > > iohandler/bh/timer whose opaque is a DeviceState, you should bump the > > refcount before setting it up, and unref after tearing it down. > > Yes, I admit refcnt is a good solution, but I think it means that we > will fix it with each device's bh. And this way seems lazy. Most of the time the bh/timer/iohandler is created in the init function, and deleted in the exit function, so in this case the lifecycle is even easier to manage. Looking at your patch again, it seems like you're implementing a poor-man RCU. So that's fine for a proof-of-concept, but let's replace it with real RCU sooner or later. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] qom: adopt rwlock to protect accessing dev from removing it
Il 25/07/2012 05:31, Liu Ping Fan ha scritto: > From: Liu Ping Fan > > rwlock: > qemu_device_tree_mutex > > rd side: > --device_del(destruction of device will be postphoned until unplug > ack from guest), > --pci hot-unplug > --iteration (qdev_reset_all) > > wr side: > --device_add > > Signed-off-by: Liu Ping Fan > --- > hw/pci-hotplug.c |4 > hw/qdev-monitor.c | 17 - > hw/qdev.c |2 ++ > 3 files changed, 22 insertions(+), 1 deletions(-) > > diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c > index e7fb780..b3b88c1 100644 > --- a/hw/pci-hotplug.c > +++ b/hw/pci-hotplug.c > @@ -265,9 +265,11 @@ static int pci_device_hot_remove(Monitor *mon, const > char *pci_addr) > return -1; > } > > +qemu_rwlock_rdlock_devtree(); This is not defined anywhere, is a piece missing in the patch? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] exec.c: use refcnt to protect device during dispatching
Il 25/07/2012 10:12, liu ping fan ha scritto: >>> >> +qemu_rwlock_rdlock_devtree(); >>> >> section = phys_page_find(page >> TARGET_PAGE_BITS); >>> >> +if (!(memory_region_is_ram(section->mr) || >>> >> +memory_region_is_romd(section->mr)) && !is_write) { >>> >> +bk = get_backend(section->mr, addr); >>> >> +object_ref(bk); >>> >> +} else if (!memory_region_is_ram(section->mr) && is_write) { >>> >> +bk = get_backend(section->mr, addr); >>> >> +object_ref(bk); >>> >> +} >>> >> +qemu_rwlock_unlock_devtree(); >>> >> >>> >> if (is_write) { >>> >> if (!memory_region_is_ram(section->mr)) { >>> >> @@ -3426,6 +3462,7 @@ void cpu_physical_memory_rw(target_phys_addr_t >>> >> addr, uint8_t *buf, >>> >> io_mem_write(section->mr, addr1, val, 1); >>> >> l = 1; >>> >> } >>> >> +object_unref(bk); >> > >> > Currently object_ref()/object_unref() are not atomic. Will you send > We obey the rule: >rdlock->search->ref_get, >wrlock->remove ->ref_put > So can it causes problem if object_ref()/object_unref() are not atomic? Yes, two CPUs can perform object_ref at the same time. You can find a header file for atomic operations here: https://github.com/bonzini/qemu/commit/atomics.patch Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 25/07/2012 15:26, Boaz Harrosh ha scritto: >>> In SCSI land most LLDs should support chaining just by virtu of using the >>> for_each_sg macro. That all it takes. Your code above does support it. >> >> Yes, it supports it but still has to undo them before passing to virtio. >> >> What my LLD does is add a request descriptor in front of the scatterlist >> that the LLD receives. I would like to do this with a 2-item >> scatterlist: one for the request descriptor, and one which is a chain to >> the original scatterlist. > > I hate that plan. Why yet override the scatter element yet again with a third > union of a "request descriptor" I'm not overriding (or did you mean overloading?) anything, and I think you're reading too much in my words. What I am saying is (for a WRITE command): 1) what I get is a scsi_cmnd which contains an N-element scatterlist. 2) virtio-scsi has to build the "packet" that is passed to the hardware (it does not matter that the hardware is virtual). This packet (per virtio-scsi spec) has an N+1-element scatterlist, where the first element is a request descriptor (struct virtio_scsi_cmd_req), and the others describe the written data. 3) virtio takes care of converting the "packet" from a scatterlist (which currently must be a flat one) to the hardware representation. Here a walk is inevitable, so we don't care about this walk. 4) What I'm doing now: copying (and flattening) the N-element scatterlist onto the last elements of an N+1 array that I pass to virtio. _ _ _ _ _ _ |_|_|_|_|_|_| scsi_cmnd scatterlist vvv COPY vvv _ _ _ _ _ _ _ |_|_|_|_|_|_|_| scatterlist passed to virtio | virtio_scsi_cmd_req Then I hand off the scatterlist to virtio. virtio walks it and converts it to hardware format. 5) What I want to do: create a 2-element scatterlist, the first being the request descriptor and the second chaining to scsi_cmnd's N-element scatterlist. _ _ _ _ _ _ |_|_|_|_|_|_| scsi_cmnd scatterlist _ _/ |_|C| scatterlist passed to virtio | virtio_scsi_cmd_req Then I hand off the scatterlist to virtio. virtio will still walk the scatterlist chain, and convert it to N+1 elements for the hardware to consume. Still, removing one walk largely reduces the length of my critical sections. I also save some pointer-chasing because the 2-element scatterlist are short-lived and can reside on the stack. Other details (you can probably skip these): There is also a response descriptor. In the case of writes this is the only element that the hardware will write to, so in the case of writes the "written by hardware" scatterlist has 1 element only and does not need chaining. Reads are entirely symmetric. The hardware will read the request descriptor from a 1-element scatterlist, and will write response+data into an N+1-element scatterlist (the response descriptor precedes the data that was read). It can be treated in exactly the same way. The N+1-element scatterlist could also become a 2-element scatterlist with chaining. >> Except that if I call sg_chain and my >> architecture does not define ARCH_HAS_SG_CHAIN, it will BUG out. So I >> need to keep all the scatterlist allocation and copying crap that I have >> now within #ifdef, and it will bitrot. > > except that with the correct design you don't call sg_chain you just do: > request_descriptor->sg_list = sg; By the above it should be clear, that request_descriptor is not a driver-private extension of the scsi_cmnd. It is something passed to the hardware. >>> Well that can be done as well, (If done carefully) It should be easy to add >>> chained fragments to both the end of a given chain just as at beginning. >>> It only means that the last element of the appended-to chain moves to >>> the next fragment and it's place is replaced by a link. >> >> But you cannot do that in constant time, can you? And that means you do >> not enjoy any benefit in terms of cache misses etc. > > I did not understand "constant time" it is O(0) if that what you meant. In the worst case it is a linked list, no? So in the worst case _finding_ the last element of the appended-to chain is O(n). Actually, appending to the end is not a problem for virtio-scsi. But for example the virtio-blk spec places the response descriptor _after_ the input data. I think this was a mistake, and I didn't repeat it for virtio-scsi, but I cited it because in the past Rusty complained that the long sglist implementation was "SCSI-centric". >> Also, this assumes that I can modify the appended-to chain. I'm not >> sure I can do this? > > Each case it's own. If the appended-to chain is const, yes you'll have > to reallocate it and copy. Is that your case? It will be virtio-blk's case, but we can ignore it. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org Mor
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 25/07/2012 17:28, Boaz Harrosh ha scritto: >> 1) what I get is a scsi_cmnd which contains an N-element scatterlist. >> >> 2) virtio-scsi has to build the "packet" that is passed to the hardware >> (it does not matter that the hardware is virtual). This packet (per >> virtio-scsi spec) has an N+1-element scatterlist, where the first >> element is a request descriptor (struct virtio_scsi_cmd_req), and the >> others describe the written data. > > Then "virtio-scsi spec" is crap. It overloads the meaning of > "struct scatterlist" of the first element in an array. to be a > "struct virtio_scsi_cmd_req". What the holy fuck? The first element simply _points_ to the "struct virtio_scsi_cmd_req", just like subsequent elements point to the data. And the protocol of the device is _not_ a struct scatterlist[]. The virtio _API_ takes that array and converts to a series of physical address + offset pairs. > Since you need to change the standard to support chaining then > it is a good time to fix this. Perhaps it is a good time for you to read the virtio spec. You are making a huge confusion between the LLD->virtio interface and the virtio->hardware interface. I'm talking only of the former. >> 3) virtio takes care of converting the "packet" from a scatterlist >> (which currently must be a flat one) to the hardware representation. >> Here a walk is inevitable, so we don't care about this walk. > > "hardware representation" you mean aio or biovec, what ever the > IO submission path uses at the host end? No, I mean the way the virtio spec encodes the physical address + offset pairs. I stopped reading here. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 25/07/2012 21:16, Boaz Harrosh ha scritto: > The picture confused me. It looked like the first element is the > virtio_scsi_cmd_req > not an sgilist-element that points to the struct's buffer. > > In that case then yes your plan of making a two-elements fragment that points > to the > original scsi-sglist is perfect. All you have to do is that, and all you have > to do > at virtio is use the sg_for_each macro and you are done. > > You don't need any sglist allocation or reshaping. And you can easily support > chaining. Looks like order of magnitude more simple then what you do now It is. > So what is the problem? That not all architectures have ARCH_HAS_SG_CHAIN (though all those I care about do). So I need to go through all architectures and make sure they use for_each_sg, or at least to change ARCH_HAS_SG_CHAIN to a Kconfig define so that dependencies can be expressed properly. > And BTW you won't need that new __sg_set_page API anymore. Kind of. sg_init_table(sg, 2); sg_set_buf(sg[0], req, sizeof(req)); sg_chain(sg[1], scsi_out(sc)); is still a little bit worse than __sg_set_buf(sg[0], req, sizeof(req)); __sg_chain(sg[1], scsi_out(sc)); Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 25/07/2012 23:04, Boaz Harrosh ha scritto: >> That not all architectures have ARCH_HAS_SG_CHAIN (though all those I >> care about do). So I need to go through all architectures and make sure >> they use for_each_sg, or at least to change ARCH_HAS_SG_CHAIN to a >> Kconfig define so that dependencies can be expressed properly. > > What is actually preventing ARCH_HAS_SG_CHAIN from all these ARCHES? > is that the DMA drivers not using for_each_sg(). Sounds like easy > to fix. > > But yes a deep change would convert ARCH_HAS_SG_CHAIN to a Kconfig. > > If you want to be lazy, like me, You might just put a BUILD_BUG_ON > in code, requesting the user to disable the driver for this ARCH. > > I bet there is more things to do at ARCH to enable virtualization > then just support ARCH_HAS_SG_CHAIN. Be it just another requirement. Actually, many arches run just fine with QEMU's binary code translation (alpha, mips, coldfire). And since it's already pretty slow, using virtio to improve performance is nice even in that scenario (which does not require any kernel change). But it's just an icing on the cake indeed. I can live with a BUILD_BUG_ON or better a "depends on HAVE_SG_CHAIN" for the time being. >>> And BTW you won't need that new __sg_set_page API anymore. >> >> Kind of. >> >>sg_init_table(sg, 2); >>sg_set_buf(sg[0], req, sizeof(req)); >>sg_chain(sg[1], scsi_out(sc)); >> >> is still a little bit worse than >> >>__sg_set_buf(sg[0], req, sizeof(req)); >>__sg_chain(sg[1], scsi_out(sc)); > > > I believe they are the same, specially for the > on the stack 2 elements array. Actually I think > In both cases you need to at least call sg_init_table() > once after allocation, No? Because the scatterlist here would be allocated on the stack, I would need to call it every time if I used sg_set_buf/sg_chain. But sg_init_table is really just memset + set SG_MAGIC + sg_mark_end. If you use the "full-init" APIs as above, you don't need any of those (sg_chain undoes the flag changes in sg_mark_end and vice versa). > But please for my sake do not call it __sg_chain. Call it > something like sg_chain_not_end(). I hate those "__" which > for god sack means what? > (A good name is when I don't have to read the code, an "__" > means "fuck you go read the code") Right, better call them sg_init_buf/sg_init_page/sg_init_chain. In the meanwhile, we still have a bug to fix, and we need to choose between Sen Wang's v1 (sg_set_page) or v2 (value assignment). I'm still leaning more towards v2, if only because I already tested that one myself. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 26/07/2012 09:56, Boaz Harrosh ha scritto: >> > In the meanwhile, we still have a bug to fix, and we need to choose >> > between Sen Wang's v1 (sg_set_page) or v2 (value assignment). I'm still >> > leaning more towards v2, if only because I already tested that one myself. > > It's your call, you know what I think ;-) > > I Agree none of them are bugs in current code, they will both work > just the same. Cool. Then, Sen, please fix the commit message in v2 and repost. > Please CC me on the "convert to sg copy-less" patches, It looks interesting Sure. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue
Il 05/07/2012 13:40, Sasha Levin ha scritto: > @@ -275,7 +274,7 @@ static void vm_del_vq(struct virtqueue *vq) > vring_del_virtqueue(vq); > > /* Select and deactivate the queue */ > - writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL); > + writel(virtqueue_get_queue_index(vq), vm_dev->base + > VIRTIO_MMIO_QUEUE_SEL); > writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); > This accesses vq after vring_del_virtqueue has freed it. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 26/07/2012 09:58, Paolo Bonzini ha scritto: > >> > Please CC me on the "convert to sg copy-less" patches, It looks interesting > Sure. Well, here is the gist of it (note it won't apply on any public tree, hence no SoB yet). It should be split in multiple changesets and you can make more simplifications on top of it, because virtio_scsi_target_state is not anymore variable-sized, but that's secondary. The patch includes the conversion of virtio_ring.c to use sg_next. It is a bit ugly because of backwards-compatibility, it can be fixed by going through all drivers; not that there are many. I'm still a bit worried of the future of this feature though. I would be the first and (because of the non-portability) presumably sole user for some time. Someone axing chained sg-lists in the future does not seem too unlikely. So in practice I would rather just add to virtio a function that takes two struct scatterlist ** (or an array of struct scatterlist * + size pairs). Converting to sg_chain once it takes off would be trivial. Paolo >From 57f8d4a20cbe9b3b25e10cd0595d7ac102fc8f73 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 26 Jul 2012 09:58:14 +0200 Subject: [PATCH] virtio-scsi: use chained sg_lists Using chained sg_lists simplifies everything a lot. The scatterlists we pass to virtio are always of bounded size, and can be allocated on the stack. This means we do not need to take tgt_lock and struct virtio_scsi_target_state does not have anymore a flexible array at the end, so we can avoid a pointer access. --- drivers/block/virtio_blk.c |3 + drivers/scsi/virtio_scsi.c | 93 --- drivers/virtio/virtio_ring.c | 93 +++ include/linux/virtio.h |8 +++ 4 files changed, 114 insertions(+), 83 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 13f7ccb..ef65ea1 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -66,9 +66,6 @@ struct virtio_scsi_target_state { struct virtio_scsi_vq *req_vq; atomic_t reqs; - - /* For sglist construction when adding commands to the virtqueue. */ - struct scatterlist sg[]; }; /* Driver instance state */ @@ -362,20 +359,6 @@ static void virtscsi_event_done(struct virtqueue *vq) spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags); }; -static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx, - struct scsi_data_buffer *sdb) -{ - struct sg_table *table = &sdb->table; - struct scatterlist *sg_elem; - unsigned int idx = *p_idx; - int i; - - for_each_sg(table->sgl, sg_elem, table->nents, i) - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length); - - *p_idx = idx; -} - /** * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist * @vscsi : virtio_scsi state @@ -384,52 +367,57 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx, * @in_num : number of write-only elements * @req_size : size of the request buffer * @resp_size : size of the response buffer - * - * Called with tgt_lock held. */ -static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt, - struct virtio_scsi_cmd *cmd, - unsigned *out_num, unsigned *in_num, +static void virtscsi_map_cmd(struct virtio_scsi_cmd *cmd, + struct scatterlist *sg_out, + unsigned *out_num, + struct scatterlist *sg_in, + unsigned *in_num, size_t req_size, size_t resp_size) { struct scsi_cmnd *sc = cmd->sc; - struct scatterlist *sg = tgt->sg; - unsigned int idx = 0; + + sg_init_table(sg_out, 2); + sg_init_table(sg_in, 2); /* Request header. */ - sg_set_buf(&sg[idx++], &cmd->req, req_size); + sg_set_buf(&sg_out[0], &cmd->req, req_size); + *out_num = 1; /* Data-out buffer. */ - if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) - virtscsi_map_sgl(sg, &idx, scsi_out(sc)); - - *out_num = idx; + if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) { + struct sg_table *table = &scsi_out(sc)->table; + sg_chain(sg_out, 2, table->sgl); + *out_num += table->nents; + } /* Response header. */ - sg_set_buf(&sg[idx++], &cmd->resp, resp_size); + sg_set_buf(&sg_in[0], &cmd->resp, resp_size); + *in_num = 1; /* Data-in buffer */ - if (sc && sc->sc_data_direction != DMA_TO_DEVICE) - virtscsi_map_sgl(sg, &idx, scsi_in(sc)); - - *in_num = idx - *out_num; + if (sc && sc->sc_data_direction != DMA_TO_DEVICE) { + struct sg_table *table = &scsi_in(sc)->table; + sg_chain(sg_in, 2, table->sgl); + *in_num += table->nents; + } } -static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt, - struct virtio_scsi_vq *vq, +static int virtscsi_kick_cmd(struct virtio_scsi_vq *vq, struct virtio_scsi_cmd *cmd, si
[PATCH 2/2] virtio-scsi: support online resizing of disks
Support the LUN parameter change event. Currently, the host fires this event when the capacity of a disk is changed from the virtual machine monitor. The resize then appears in the kernel log like this: sd 0:0:0:0: [sda] 46137344 512-byte logical blocks: (23.6 GB/22.0 GIb) sda: detected capacity change from 22548578304 to 23622320128 Signed-off-by: Paolo Bonzini --- drivers/scsi/virtio_scsi.c | 31 ++- include/linux/virtio_scsi.h |2 ++ 2 files changed, 32 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 8b6b927..c1a5e60 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -279,6 +279,31 @@ static void virtscsi_handle_transport_reset(struct virtio_scsi *vscsi, } } +static void virtscsi_handle_param_change(struct virtio_scsi *vscsi, +struct virtio_scsi_event *event) +{ + struct scsi_device *sdev; + struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev); + unsigned int target = event->lun[1]; + unsigned int lun = (event->lun[2] << 8) | event->lun[3]; + u8 asc = event->reason & 255; + u8 ascq = event->reason >> 8; + + sdev = scsi_device_lookup(shost, 0, target, lun); + if (!sdev) { + pr_err("SCSI device %d 0 %d %d not found\n", + shost->host_no, target, lun); + return; + } + + /* Handle "Parameters changed", "Mode parameters changed", and + "Capacity data has changed". */ + if (asc == 0x2a && (ascq == 0x00 || ascq == 0x01 || ascq == 0x09)) + scsi_rescan_device(&sdev->sdev_gendev); + + scsi_device_put(sdev); +} + static void virtscsi_handle_event(struct work_struct *work) { struct virtio_scsi_event_node *event_node = @@ -297,6 +322,9 @@ static void virtscsi_handle_event(struct work_struct *work) case VIRTIO_SCSI_T_TRANSPORT_RESET: virtscsi_handle_transport_reset(vscsi, event); break; + case VIRTIO_SCSI_T_PARAM_CHANGE: + virtscsi_handle_param_change(vscsi, event); + break; default: pr_err("Unsupport virtio scsi event %x\n", event->event); } @@ -737,7 +765,8 @@ static struct virtio_device_id id_table[] = { }; static unsigned int features[] = { - VIRTIO_SCSI_F_HOTPLUG + VIRTIO_SCSI_F_HOTPLUG, + VIRTIO_SCSI_F_CHANGE, }; static struct virtio_driver virtio_scsi_driver = { diff --git a/include/linux/virtio_scsi.h b/include/linux/virtio_scsi.h index dc8d305..d6b4440 100644 --- a/include/linux/virtio_scsi.h +++ b/include/linux/virtio_scsi.h @@ -72,6 +72,7 @@ struct virtio_scsi_config { /* Feature Bits */ #define VIRTIO_SCSI_F_INOUT0 #define VIRTIO_SCSI_F_HOTPLUG 1 +#define VIRTIO_SCSI_F_CHANGE 2 /* Response codes */ #define VIRTIO_SCSI_S_OK 0 @@ -108,6 +109,7 @@ struct virtio_scsi_config { #define VIRTIO_SCSI_T_NO_EVENT 0 #define VIRTIO_SCSI_T_TRANSPORT_RESET 1 #define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 +#define VIRTIO_SCSI_T_PARAM_CHANGE 3 /* Reasons of transport reset event */ #define VIRTIO_SCSI_EVT_RESET_HARD 0 -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] virtio-scsi fixes for 3.6
James, patch 1 fixes scanning of LUNs whose number is greater than 255. QEMU passes a max_lun of 16383 (because it uses SAM numbering) but in Linux it must become 32768 (because LUNs above 255 are "relocated" to 16640). Patch 2 is a resubmission of the patch for online resizing of virtio-scsi LUNs, which needs to be rebased. LUNs above 255 now work for all of scanning, hotplug, hotunplug and resize. Thanks, Paolo Paolo Bonzini (2): virtio-scsi: fix LUNs greater than 255 virtio-scsi: support online resizing of disks drivers/scsi/virtio_scsi.c | 37 +++-- include/linux/virtio_scsi.h |2 ++ 2 files changed, 37 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] virtio-scsi: fix LUNs greater than 255
virtio-scsi needs to report LUNs greater than 256 using the "flat" format. Because the Linux SCSI layer just maps the SCSI LUN to an u32, without any parsing, these end up in the range from 16640 to 32767. Fix max_lun to account for the possibility that logical unit numbers are encoded with the "flat" format. Cc: Signed-off-by: Paolo Bonzini --- drivers/scsi/virtio_scsi.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index c7030fb..8b6b927 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -677,7 +677,11 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev) cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1; shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x; - shost->max_lun = virtscsi_config_get(vdev, max_lun) + 1; + + /* LUNs > 256 are reported with format 1, so they go in the range +* 16640-32767. +*/ + shost->max_lun = virtscsi_config_get(vdev, max_lun) + 1 + 0x4000; shost->max_id = num_targets; shost->max_channel = 0; shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 27/07/2012 08:27, Rusty Russell ha scritto: >> > +int virtqueue_add_buf_sg(struct virtqueue *_vq, >> > + struct scatterlist *sg_out, >> > + unsigned int out, >> > + struct scatterlist *sg_in, >> > + unsigned int in, >> > + void *data, >> > + gfp_t gfp) > The point of chained scatterlists is they're self-terminated, so the > in & out counts should be calculated. > > Counting them is not *that* bad, since we're about to read them all > anyway. > > (Yes, the chained scatterlist stuff is complete crack, but I lost that > debate years ago.) > > Here's my variant. Networking, console and block seem OK, at least > (ie. it booted!). I hate the for loops, even though we're about indeed to read all the scatterlists anyway... all they do is lengthen critical sections. Also, being the first user of chained scatterlist doesn't exactly give me warm fuzzies. I think it's simpler if we provide an API to add individual buffers to the virtqueue, so that you can do multiple virtqueue_add_buf_more (whatever) before kicking the virtqueue. The idea is that I can still use indirect buffers for the scatterlists that come from the block layer or from an skb, but I will use direct buffers for the request/response descriptors. The direct buffers are always a small number (usually 2), so you can balance the effect by making the virtqueue bigger. And for small reads and writes, you save a kmalloc on a very hot path. (BTW, scatterlists will have separate entries for each page; we do not need this in virtio buffers. Collapsing physically-adjacent entries will speed up QEMU and will also help avoiding indirect buffers). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
Il 05/07/2012 12:29, Jason Wang ha scritto: > Sometimes, virtio device need to configure irq affiniry hint to maximize the > performance. Instead of just exposing the irq of a virtqueue, this patch > introduce an API to set the affinity for a virtqueue. > > The api is best-effort, the affinity hint may not be set as expected due to > platform support, irq sharing or irq type. Currently, only pci method were > implemented and we set the affinity according to: > > - if device uses INTX, we just ignore the request > - if device has per vq vector, we force the affinity hint > - if the virtqueues share MSI, make the affinity OR over all affinities > requested > > Signed-off-by: Jason Wang Hmm, I don't see any benefit from this patch, I need to use irq_set_affinity (which however is not exported) to actually bind IRQs to CPUs. Example: with irq_set_affinity_hint: 43: 89 107 100 97 PCI-MSI-edge virtio0-request 44: 178 195 268 199 PCI-MSI-edge virtio0-request 45: 97 100 97 155 PCI-MSI-edge virtio0-request 46: 234 261 213 218 PCI-MSI-edge virtio0-request with irq_set_affinity: 43: 721001 PCI-MSI-edge virtio0-request 44:0 74601 PCI-MSI-edge virtio0-request 45:00 6580 PCI-MSI-edge virtio0-request 46:001 547 PCI-MSI-edge virtio0-request I gathered these quickly after boot, but real benchmarks show the same behavior, and performance gets actually worse with virtio-scsi multiqueue+irq_set_affinity_hint than with irq_set_affinity. I also tried adding IRQ_NO_BALANCING, but the only effect is that I cannot set the affinity The queue steering algorithm I use in virtio-scsi is extremely simple and based on your tx code. See how my nice pinning is destroyed: # taskset -c 0 dd if=/dev/sda bs=1M count=1000 of=/dev/null iflag=direct # cat /proc/interrupts 43: 2690 2709 2691 2696 PCI-MSI-edge virtio0-request 44: 109 122 199 124 PCI-MSI-edge virtio0-request 45: 170 183 170 237 PCI-MSI-edge virtio0-request 46: 143 166 125 125 PCI-MSI-edge virtio0-request All my requests come from CPU#0 and thus go to the first virtqueue, but the interrupts are serviced all over the place. Did you set the affinity manually in your experiments, or perhaps there is a difference between scsi and networking... (interrupt mitigation?) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html