Hi Paolo, Sorry for the late reply and thanks for the review. See my comments inline:
On 02/11/2016 01:17 PM, Paolo Bonzini wrote: > > > On 16/12/2015 17:55, Alex Pyrgiotis wrote: >> +/* >> + * Create a QEMUIOVector from a scatter-gather list. >> + * >> + * This function does not copy the data of the scatter-gather list. >> Instead, it >> + * uses the dma_memory_map() function to map physical memory regions of the >> + * virtual device (as interpreted by the guest kernel) into the address >> space >> + * of the QEMU process, in order to have access to the data. >> + */ >> +static void dma_map_sg(DMAAIOCB *dbs) > > In special cases where the QEMUSGList includes MMIO regions, dma_map_sg > might not be able to map the whole list. In this case, for regular I/O > it is possible to break the operation in multiple steps---in fact, this > breaking of requests is the main purpose of most of the code in > dma-helpers.c. You're right. I've written a comment about that yet, somehow, the implications of rescheduling a DMA operation flew over my head. Great. So, I've tried to grasp the current situation and this is what I've got so far. Please correct me where I'm wrong. Guest kernel space: 1. The guest kernel wants to send a SCSI request to a (para)virtual SCSI controller. 2. This SCSI request involves a list of physical address ranges. Whether the request only refers to a single contiguous physical range, or whether it is a scatter-gather list depends on the implementation of the kernel's Low-level Device Driver (LLDD) and the capabilities of the SCSI controller. 3. Some of these ranges may include MMIO regions. Since I'm not aware of a scenario where this happens, can you provide an example? 4. The LLDD writes the SCSI CDB to the SCSI controller's queue and informs the SCSI controller by writing to some sort of doorbell register. QEMU/Hardware space: 5. The SCSI controller code will create a QEMUSGList that points to the memory regions of the SCSI request. This QEMUSGList will also include the MMIO regions. 6. The QEMU device implementation, e.g. scsi-block, chooses to use the dma_* interface. 7. The dma_blk_read/write() code will ultimately attempt to map all the memory regions pointed by the QEMUSGList in order to create a QEMUIOVector. 8. At some point during the mapping loop, the code will encounter an MMIO region. Since reading and writing from/to an MMIO region requires special handling, e.g., we need to call MemoryRegion->ops->write(), we cannot include it in our read/write system call to the host kernel. 9. This leads to a partial read/write and the mapping loop will resume once the partial read/write() has finished. Are we in the same page so far? If so, let's dig into the dma_blk_read/write() logic. The following is a simplified pseudocode: allocate QEMUIOVector callback: # Reset any changes to QEMUIOVector from previous operations for address in QEMUIOVector: unmap address reset QEMUIOVector # Check if the previous IO operation was the last one and if so, # call the completion callback of the user and return. if operation is completed: dma_complete() return offset = adjust read/write offset # Map each sg of QEMUSGList and add it to the QEMUIOVector for sg in QEMUSGList: address = dma_memory_map(sg) if address == NULL: break add address to QEMUIOVector # When either of the following operations have finished, restart # this DMA operation. if nothing mapped: # If we can't map anything, we need to retry create reschedule_dma callback cpu_register_map_client(callback) else: # Else, we can perform a partial read/write() do readv/writev(offset, QEMUIOVector, callback) Are the above OK? If so, I have some questions: a) Is an MMIO region one of the reasons why we can't map an sg? b) At which point will the relevant ops->write() method for the MMIO region be called when we have to DMA into the region?? Is it handled implicitly in dma_memory_map()? c) I'm not quite sure about the logic of the "nothing mapped" section. Correct me if I'm wrong, but what I think it does is that it registers a callback (reschedule_dma) once some sort of mapping has completed. What kind of mapping is this? Is there anything more to it? > However, it is not possible to do the same for ioctls. This is actually > the reason why no one has ever tried to make scsi-generic do anything > but bounce-buffering. I think that your code breaks horribly in this > case, and I don't see a way to fix it, except for reverting to bounce > buffering. > > This would require major changes in your patches, and I'm not sure > whether they are worth it for the single use case of tape devices... Well, I wouldn't narrow it down to tape devices. The scsi-generic interface is the universal interface for all SCSI devices and the only interface that is fully passthrough. So this patch would effectively boost the baseline performance of SCSI devices. I think it's worth a try. Thanks, Alex